Lecture02

STA732
Statistical Inference
Lecture 02: Exponential families
Yuansi Chen
Spring 2023
Duke University
https://www2.stat.duke.edu/courses/Spring23/sta732.01/
1
Recap from Lecture 01
• Defined statistical inference problem

Statistical experiement, data, statistical model, loss function, risk function
• Discussed how to argue for the optimal estimator

statistical optimality, in addition to empirical success, fast computation,
simplicity, etc.
2
Goal of Lecture 02
• Introduce exponential families

• Examples
• Differential identities (how to get moments and cumulants
from exponential families?)
Chap. 2 in Keener or Chap. 1.5 in Lehmann and Casella
3
Exponential families
Exponential families
An 𝑠-parameter exponential family is a family P = {𝑃𝜂 ∶ 𝜂 ∈ Ξ}

with densities 𝑝𝜂 w.r.t. a common measure 𝜇 on 𝒳 of the form
𝑝𝜂 (𝑥) = exp (𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂)) ℎ(𝑥)
𝑇 ∶𝒳 → ℝ𝑠 sufficient statistics
ℎ ∶𝒳 → ℝ carrier/base density
𝑠
𝜂∈Ξ⊆ℝ natural parameter
𝐴 ∶ℝ𝑠 → ℝ cumulant-generating function (cgf)
4
Notes on 𝐴(𝜂)
For any 𝜂, the cgf 𝐴(𝜂) is determined by ℎ and 𝑇 . Since ∫ 𝑝𝜂 𝑑𝜇 = 1

holds, we have
𝐴(𝜂) = log [∫ exp (𝜂⊤ 𝑇 (𝑥)) ℎ(𝑥)𝑑𝜇(𝑥)]
• We say 𝑝𝜂 is normalizable if 𝐴(𝜂) < ∞

• So 𝐴(𝜂) is also called the normalizing constant.
5
Example 2.1
Take 𝜇 to be Lebesgue measure on ℝ, 𝑠 = 1, ℎ = 1(0,∞) and

𝑇 (𝑥) = 𝑥. Then we have
∞
𝐴(𝜂) = log ∫ 𝑒𝜂𝑥 𝑑𝑥
0
⎧
{log(−1/𝜂), 𝜂<0
=⎨
{
⎩∞, 𝜂 ≥ 0.
What is the corresponding 𝑝𝜂 (𝑥)? What distribution? Is it in the usual form?
6
Notes on the natural parameter
The natural parameter space is the set of all normalizable 𝜂:
Ξ1 = {𝜂 ∶ 𝐴(𝜂) < ∞}
We say P is in canonical form if Ξ = Ξ1 . Sometimes we could take

Ξ ⊂ Ξ1 .
7
Other parameterization for an exponetial family
Take 𝜂 ∶ Ω → Ξ, define
𝑝𝜃 (𝑥) = exp [𝜂(𝜃)⊤ 𝑇 (𝑥) − 𝐵(𝜃)] ℎ(𝑥)

𝐵(𝜃) = 𝐴(𝜂(𝜃))
The family {𝑝𝜃 ∶ 𝜃 ∈ Ω} is also called an exponential family
8
Other parameterization for an exponetial family
Take 𝜂 ∶ Ω → Ξ, define
𝑝𝜃 (𝑥) = exp [𝜂(𝜃)⊤ 𝑇 (𝑥) − 𝐵(𝜃)] ℎ(𝑥)

𝐵(𝜃) = 𝐴(𝜂(𝜃))
The family {𝑝𝜃 ∶ 𝜃 ∈ Ω} is also called an exponential family (Many

distribution belong to exponential families (see Wiki) but often some massaging is
needed to realize)
8
Example 2.2: normal with unknown mean and variance
The normal distribution 𝒩(𝜇, 𝜎2 ), 𝜇 ∈ ℝ, 𝜎2 > 0 has density
1 (𝑥−𝜇)2
𝑝𝜃 (𝑥) = √ 𝑒− 2𝜎2
2𝜋𝜎2
𝜇 1 𝜇2 1
= exp [ 2 𝑥 − 2 𝑥2 − 2 − log (2𝜋𝜎2 )]
𝜎 2𝜎 2𝜎 2
9
Example 2.2: normal with unknown mean and variance
The normal distribution 𝒩(𝜇, 𝜎2 ), 𝜇 ∈ ℝ, 𝜎2 > 0 has density
1 (𝑥−𝜇)2
𝑝𝜃 (𝑥) = √ 𝑒− 2𝜎2
2𝜋𝜎2
𝜇 1 𝜇2 1
= exp [ 2 𝑥 − 2 𝑥2 − 2 − log (2𝜋𝜎2 )]
𝜎 2𝜎 2𝜎 2
We identify
𝜇
𝜇 𝜎2 ) , 𝑇 (𝑥) = ( 𝑥 )
𝜃=( ) , 𝜂(𝜃) = (
𝜎2 − 2𝜎1 2 𝑥2
𝜇2 1
ℎ(𝑥) = 1, 𝐵(𝜃) = 2
+ log (2𝜋𝜎2 )
2𝜎 2
How to write in terms of natural parameters?
9
𝑥
𝑝𝜂 (𝑥) = exp [𝜂⊤ ( 2 ) − 𝐴(𝜂)]
𝑥
where Ξ = {𝜂 ∈ ℝ2 ∣ 𝜂2 < 0} and
−𝜂12 1 𝜋
𝐴(𝜂) = + log (− )
4𝜂2 2 𝜂2
10
{𝑝𝜂 ∶ 𝜂 ∈ Ξ} lives inside a s-dimensional subspace
It is useful to think “log {𝑝𝜂 ∶ 𝜂 ∈ Ξ}” is a subset of an

𝑠-dimensional subspace of the log-density space
• 𝑒𝑓𝜂 (𝑥) is always proportional to a density if integrable

• For exponential family, we can write
𝑓𝜂 (𝑥) = log ℎ(𝑥) + 𝜂⊤ 𝑇 (𝑥) (draw a picture)
11
The form of an exponential family is not unique
Operations to express the same family

1. Change the common measure so ℎ(𝑥) = 1:
𝑑 𝜇̃
𝜇 ⇝ 𝜇̃ with =ℎ
𝑑𝜇
2. Reparameterize so 0 ∈ Ξ: take 𝜂0 ∈ Ξ
𝜂 ⇝ 𝜂 ̃ = 𝜂 − 𝜂0
ℎ ⇝ ℎ̃ = 𝑝𝜂 (𝑥)
0
𝐴 ⇝ 𝐴 ̃ (𝜂)̃ = 𝐴 (𝜂0 + 𝜂)̃ − 𝐴(𝜂0 )
3. Reparameterize with an invertible map ℝ𝑠 → ℝ𝑠 .
...
12
More examples
Example 2.3: joint density of 𝑛 i.i.d. normal
i.i.d.
Given 𝑋1 , … , 𝑋𝑛 ∼ 𝒩(𝜇, 𝜎2 ), the joint density is
𝑛
1 (𝑥𝑖 −𝜇)2
𝑝𝜃 (𝑥) = ∏ [ √ 𝑒− 2𝜎2 ]
𝑖=1 2𝜋𝜎2
𝑛
𝜇 1 2 𝜇2 1
= exp {∑ [ 2
𝑥𝑖 − 2
𝑥𝑖 − 2
− log (2𝜋𝜎2 )]}
𝑖=1
𝜎 2𝜎 2𝜎 2
13
Example 2.3: joint density of 𝑛 i.i.d. normal
i.i.d.
Given 𝑋1 , … , 𝑋𝑛 ∼ 𝒩(𝜇, 𝜎2 ), the joint density is
𝑛
1 (𝑥𝑖 −𝜇)2
𝑝𝜃 (𝑥) = ∏ [ √ 𝑒− 2𝜎2 ]
𝑖=1 2𝜋𝜎2
𝑛
𝜇 1 2 𝜇2 1
= exp {∑ [ 2
𝑥𝑖 − 2
𝑥𝑖 − 2
− log (2𝜋𝜎2 )]}
𝑖=1
𝜎 2𝜎 2𝜎 2
𝜇
𝜎2 ) , 𝑇 (𝑥) ∑ 𝑥𝑖
𝜂(𝜃) = ( =( ) , 𝐵(𝜃) = 𝑛𝐵(1) (𝜃)
− 2𝜎1 2 ∑ 𝑥2𝑖
Ex: in general the joint density of 𝑛 i.i.d. random variables from 𝑠-parameter Exp
family is still an 𝑠-parameter Exp family with the same parameters
13
Example: binomial
For 𝑋 ∼ Binomial(𝑛, 𝜃), 𝑋 has probability mass function
𝑛
𝑝𝜃 (𝑥) = ( ) 𝜃𝑥 (1 − 𝜃)𝑛−𝑥
𝑥
𝑥
𝜃 𝑛
=( ) (1 − 𝜃)𝑛 ( )
1−𝜃 𝑥
𝜃 𝑛
= exp [log ( ) 𝑥 + 𝑛 log(1 − 𝜃)] ( )
1−𝜃 𝑥
This is a 1-parameter exponential family
𝜃
𝑇 (𝑥) = 𝑥, 𝜂(𝜃) = log ( )
1−𝜃
14
Example: Poisson
For 𝑋 ∼ Poisson(𝜃), 𝑋 has probability mass function
𝜆𝑥 𝑒−𝜆
𝑝𝜆 (𝑥) =
𝑥!
1
= exp [log(𝜆)𝑥 − 𝜆]
𝑥!
This is a 1-parameter exponential family
𝜂(𝜆) = log(𝜆)
Ex: try some on Wikipedia: Beta, Gamma, Dirichlet...
15
Differential Identities
Intuition for getting moments from cgf
Because the density integrates to 1, we always have
⊤
𝑒𝐴(𝜂) = ∫ 𝑒𝜂 𝑇 (𝑥)
ℎ(𝑥)𝑑𝜇(𝑥)
Whenever a quantity is in the form of “integral of exponential tilt”,

we can obtain moments by differentiating on both sides
16
Intuition for getting moments from cgf
Because the density integrates to 1, we always have
⊤
𝑒𝐴(𝜂) = ∫ 𝑒𝜂 𝑇 (𝑥)
ℎ(𝑥)𝑑𝜇(𝑥)
Whenever a quantity is in the form of “integral of exponential tilt”,

we can obtain moments by differentiating on both sides
Be careful: we need to be able to switch the order of derivative and
integral!
16
Theorem 2.4 in Keener
Theorem 2.4
Let Ξ𝑓 be the set of values for 𝜂 ∈ ℝ𝑠 where
∫ |𝑓(𝑥)| exp [𝜂⊤ 𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥) < ∞
Then the function
𝑔(𝜂) = ∫ 𝑓(𝑥) exp [𝜂⊤ 𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥)
is continuous and has continuous partial derivatives of all orders

for 𝜂 ∈ Ξ𝑜𝑓 .
In particular, taking 𝑓 = 1, 𝐴(𝜂) has all partial derivatives
17
Proof sketch in 1-d (Chap. 2.3. in Keener)
We want to take derivative of 𝑒𝐴(𝜂) = ∫ exp [𝜂𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥)

inside integral
• Sufficient to consider 𝜂 ∈ (−3𝜖, 3𝜖) and show the derivative at

𝜂=0
• Idea: use dominated convergence theorem
• Construct a sequence that converges to the actual derivative
18
Proof:
19
What do we get by differentiating 𝐴(𝜂)?
By differentiating once, show that
∇𝐴(𝜂) = 𝔼𝜂 [𝑇 (𝑋)]
Because
𝜕 𝐴(𝜂) 𝜕
𝑒 = ∫ exp [𝜂⊤ 𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥)
𝜕𝜂𝑗 𝜕𝜂𝑗
20
Differentiating twice
By differentiating twice, show that
∇2 𝐴(𝜂) = Var𝜂 [𝑇 (𝑋)]
21
Example: Poisson
𝑝𝜆 (𝑥) =
𝑥!
𝑇 (𝑥) = 𝑥, 𝜂(𝜆) = log(𝜆), 𝐵(𝜆) = 𝜆
22
Example: Poisson
𝑝𝜆 (𝑥) =
𝑥!
𝑇 (𝑥) = 𝑥, 𝜂(𝜆) = log(𝜆), 𝐵(𝜆) = 𝜆
For the natural parameter 𝜂, 𝐴(𝜂) = 𝑒𝜂 , then
𝑑𝑒𝜂
𝔼𝜂 [𝑋] = = 𝑒𝜂 = 𝜆
𝑑𝜂
𝑑2
Var𝜂 [𝑋] = 2 𝑒𝜂 = 𝑒𝜂 = 𝜆
𝑑𝜂
22
Moment-generating function
For 𝑇 a random vector in ℝ𝑠 , the moment generating function of 𝑇

is
⊤
𝑀𝑇 (𝑢) = 𝔼 [𝑒𝑢 𝑇
]
The cumulant generating function is
𝐾𝑇 (𝑢) = log(𝑀𝑇 (𝑢))
23
Useful properties of moment-generating function
1. If two random variables have the same moment-generating

function, then the have the same distribution
2. Moments of 𝑇 , denoted by
𝑟 𝑟
𝔼[𝑇1 1 × ⋯ × 𝑇𝑠 𝑠 ]
can be found by differentiating 𝑀𝑇 at 𝑢 = 0
𝜕 𝑟1 𝜕 𝑟𝑠
𝑟1 ⋯ 𝑟 𝑀 (𝑢)∣
𝜕𝑢1 𝜕𝑢𝑠𝑠 𝑡 𝑢=0
24
Moment-generating function of exponential family
𝑇 (𝑋) ⊤
𝑀𝜂 (𝑢) = 𝔼𝜂 [𝑒𝑢 𝑇 (𝑋)
]
⊤
𝑇 𝜂⊤ 𝑇 −𝐴(𝜂)
= ∫ 𝑒𝑢 𝑒 ℎ𝑑𝜇
= 𝑒𝐴(𝜂+𝑢)−𝐴(𝜂) ∫ 𝑒(𝜂+𝑢)⊤ 𝑇 −𝐴(𝜂+𝑢) ℎ𝑑𝜇

⏟⏟⏟⏟⏟⏟⏟⏟⏟
=1
𝐴(𝜂+𝑢)−𝐴(𝜂)
=𝑒
25
Moment-generating function of exponential family
𝑇 (𝑋) ⊤
𝑀𝜂 (𝑢) = 𝔼𝜂 [𝑒𝑢 𝑇 (𝑋)
]
⊤
𝑇 𝜂⊤ 𝑇 −𝐴(𝜂)
= ∫ 𝑒𝑢 𝑒 ℎ𝑑𝜇
= 𝑒𝐴(𝜂+𝑢)−𝐴(𝜂) ∫ 𝑒(𝜂+𝑢)⊤ 𝑇 −𝐴(𝜂+𝑢) ℎ𝑑𝜇

⏟⏟⏟⏟⏟⏟⏟⏟⏟
=1
𝐴(𝜂+𝑢)−𝐴(𝜂)
=𝑒
Hence, the cumulant generating function is
𝐾𝑇 (𝑢) = 𝐴(𝑢 + 𝜂) − 𝐴(𝜂)
25
Relationship between the moments and cumulants
For 𝑠 = 1, from 𝑀 = 𝑒𝐾 , we get
𝑀 ′ = 𝐾 ′ 𝑒𝐾 ⇒ 𝔼[𝑇 ] = 𝜅1
𝑀 ″ = (𝐾 ″ + 𝐾 ′2 )𝑒𝐾 ⇒ 𝔼[𝑇 2 ] = 𝜅2 + 𝜅21
𝑀 ‴ = (𝐾 ‴ + 3𝐾 ′ 𝐾 ″ + 𝐾 ′3 )𝑒𝐾 ⇒ 𝔼[𝑇 3 ] = 𝜅3 + 3𝜅1 𝜅2 + 𝜅31
26
Exampe 2.11: moments of normal
• Unknown 𝜇, but known 𝜎2

• Unknown 𝜇 and 𝜎2
27
Proof:
28
Summary of useful properties of exponential families
𝑝𝜂 (𝑥) = exp (𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂)) ℎ(𝑥)
1. The natural parameter space is convex

2. The joint density of 𝑛 i.i.d. exponential family densities is still
in an exponential family
3. Sufficient statistics 𝑇 (𝑥)
4. 𝐴(𝜂) infinitely differentiable (Theorem 2.4): easy to get
moments
29
What is next?
• Sufficiency
• Factorization theorem
• Minimal sufficiency
30
Thank you
31
32

Lecture02

Uploaded by

Copyright:

Available Formats

You might also like

Lecture02

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture02

Uploaded by

Copyright:

Available Formats

STA732

• Defined statistical inference problem

• Discussed how to argue for the optimal estimator

• Introduce exponential families

Chap. 2 in Keener or Chap. 1.5 in Lehmann and Casella

An 𝑠-parameter exponential family is a family P = {𝑃𝜂 ∶ 𝜂 ∈ Ξ}

𝑝𝜂 (𝑥) = exp (𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂)) ℎ(𝑥)

For any 𝜂, the cgf 𝐴(𝜂) is determined by ℎ and 𝑇 . Since ∫ 𝑝𝜂 𝑑𝜇 = 1

𝐴(𝜂) = log [∫ exp (𝜂⊤ 𝑇 (𝑥)) ℎ(𝑥)𝑑𝜇(𝑥)]

• We say 𝑝𝜂 is normalizable if 𝐴(𝜂) < ∞

Take 𝜇 to be Lebesgue measure on ℝ, 𝑠 = 1, ℎ = 1(0,∞) and

What is the corresponding 𝑝𝜂 (𝑥)? What distribution? Is it in the usual form?

The natural parameter space is the set of all normalizable 𝜂:

We say P is in canonical form if Ξ = Ξ1 . Sometimes we could take

𝑝𝜃 (𝑥) = exp [𝜂(𝜃)⊤ 𝑇 (𝑥) − 𝐵(𝜃)] ℎ(𝑥)

The family {𝑝𝜃 ∶ 𝜃 ∈ Ω} is also called an exponential family

𝑝𝜃 (𝑥) = exp [𝜂(𝜃)⊤ 𝑇 (𝑥) − 𝐵(𝜃)] ℎ(𝑥)

The family {𝑝𝜃 ∶ 𝜃 ∈ Ω} is also called an exponential family (Many

The normal distribution 𝒩(𝜇, 𝜎2 ), 𝜇 ∈ ℝ, 𝜎2 > 0 has density

The normal distribution 𝒩(𝜇, 𝜎2 ), 𝜇 ∈ ℝ, 𝜎2 > 0 has density

where Ξ = {𝜂 ∈ ℝ2 ∣ 𝜂2 < 0} and

It is useful to think “log {𝑝𝜂 ∶ 𝜂 ∈ Ξ}” is a subset of an

• 𝑒𝑓𝜂 (𝑥) is always proportional to a density if integrable

Operations to express the same family

𝐴 ⇝ 𝐴 ̃ (𝜂)̃ = 𝐴 (𝜂0 + 𝜂)̃ − 𝐴(𝜂0 )

3. Reparameterize with an invertible map ℝ𝑠 → ℝ𝑠 .

For 𝑋 ∼ Binomial(𝑛, 𝜃), 𝑋 has probability mass function

This is a 1-parameter exponential family

For 𝑋 ∼ Poisson(𝜃), 𝑋 has probability mass function

Ex: try some on Wikipedia: Beta, Gamma, Dirichlet...

Because the density integrates to 1, we always have

Whenever a quantity is in the form of “integral of exponential tilt”,

Because the density integrates to 1, we always have

Whenever a quantity is in the form of “integral of exponential tilt”,

∫ |𝑓(𝑥)| exp [𝜂⊤ 𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥) < ∞

Then the function

𝑔(𝜂) = ∫ 𝑓(𝑥) exp [𝜂⊤ 𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥)

is continuous and has continuous partial derivatives of all orders

In particular, taking 𝑓 = 1, 𝐴(𝜂) has all partial derivatives

We want to take derivative of 𝑒𝐴(𝜂) = ∫ exp [𝜂𝑇 (𝑥)] ℎ(𝑥)𝑑𝜇(𝑥)

• Sufficient to consider 𝜂 ∈ (−3𝜖, 3𝜖) and show the derivative at

By differentiating once, show that

By differentiating twice, show that

∇2 𝐴(𝜂) = Var𝜂 [𝑇 (𝑋)]

𝑇 (𝑥) = 𝑥, 𝜂(𝜆) = log(𝜆), 𝐵(𝜆) = 𝜆

𝑇 (𝑥) = 𝑥, 𝜂(𝜆) = log(𝜆), 𝐵(𝜆) = 𝜆

For the natural parameter 𝜂, 𝐴(𝜂) = 𝑒𝜂 , then

For 𝑇 a random vector in ℝ𝑠 , the moment generating function of 𝑇

The cumulant generating function is

𝐾𝑇 (𝑢) = log(𝑀𝑇 (𝑢))

1. If two random variables have the same moment-generating

can be found by differentiating 𝑀𝑇 at 𝑢 = 0

= 𝑒𝐴(𝜂+𝑢)−𝐴(𝜂) ∫ 𝑒(𝜂+𝑢)⊤ 𝑇 −𝐴(𝜂+𝑢) ℎ𝑑𝜇

= 𝑒𝐴(𝜂+𝑢)−𝐴(𝜂) ∫ 𝑒(𝜂+𝑢)⊤ 𝑇 −𝐴(𝜂+𝑢) ℎ𝑑𝜇

Hence, the cumulant generating function is

𝐾𝑇 (𝑢) = 𝐴(𝑢 + 𝜂) − 𝐴(𝜂)

For 𝑠 = 1, from 𝑀 = 𝑒𝐾 , we get

• Unknown 𝜇, but known 𝜎2

𝑝𝜂 (𝑥) = exp (𝜂⊤ 𝑇 (𝑥) − 𝐴(𝜂)) ℎ(𝑥)