A Generalized Moment Approach To Sharp Bounds For Conditional Expectations

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

A generalized moment approach to sharp bounds for

conditional expectations
Wouter J.E.C. van Eekelen
Booth School of Business, University of Chicago, Wouter.vanEekelen@chicagobooth.edu

January 2, 2024
In this paper, we address the problem of bounding conditional expectations when moment information of the underlying
distribution and the random event conditioned upon are given. To this end, we propose an adapted version of the
generalized moment problem which deals with this conditional information through a simple transformation. By
exploiting conic duality, we obtain sharp bounds that can be used for distribution-free decision-making under uncertainty.
arXiv:2401.00090v1 [math.OC] 29 Dec 2023

Additionally, we derive computationally tractable mathematical programs for distributionally robust optimization (DRO)
with side information by leveraging core ideas from ambiguity-averse uncertainty quantification and robust optimization,
establishing a moment-based DRO framework for prescriptive stochastic programming.

Key words: Conditional expectations, sharp bounds, conic duality, prescriptive stochastic programming, distributionally
robust optimization with side information

1. Introduction
Distribution-free performance analysis of stochastic models strives to obtain tight bounds for the expec-
tation of an objective function of random variables, using only limited information about the underlying
probability distributions. Traditionally, given the moment information about the random variables, these
problems are modeled as generalized moment problems. The sharpest (i.e., “best possible”) upper and/or
lower bounds for these objective functions of random variables are found by solving semi-infinite optimiza-
tion problems, where the optimization is taken over all admissible distributions of the random variables. In
this paper, we explore a new situation in which, besides the given moment sequence, we possess stronger
information—that is, we have knowledge that a particular random event will occur, which pertains to the
realizations of the random variables rather than their underlying probability distribution. We are inter-
ested in solving the moment problem with the knowledge of this random event, which may be based on
assumptions, assertions, expert judgment or past observations. Alternatively, we may want to determine
the worst-case behavior of a stochastic model if such a random event occurs. Conditional expectations
offer a way to model this event information and provide us with the best estimate of the expected value
of a function of random variables, knowing that the specified random event will occur. It is of interest
to define a generalized moment framework for this new setting with conditional expectations, instead of
standard expectations in which the random variables are not restricted to take on a subset of the values
in the event set. However, random variables conditioned on a random event give rise to a different prob-
abilistic concept, requiring a distinct type of analysis compared to the traditional theory on generalized
moment problems.

1
Moment problems have been investigated by probability theorists since the end of the 19th century, see,
e.g., [14, 40, 62, 63]. In its most classical form, the problem of moments is a feasibility problem that aims
to determine if there exists a probability distribution that satisfies the given distributional information.
Chebyshev formulated the problem of finding the optimal bound, given only mean-variance information,
for the tail distribution of a random variable, which his student Markov later solved using continued frac-
tions techniques, ultimately resulting in the well-known Chebyshev inequality. It took till the 1950s and
60s for the interest in this type of literature to be rekindled, which gave rise to the vast literature on Cheby-
shev systems. We refer to the monograph of Karlin and Studden [33] for a comprehensive review. As a
prominent reason for the renewed interest in the generalized moment problem, we highlight the advent of
duality theory as a novel method for solving this problem, as demonstrated in works such as [31, 33, 34].
As summarized in [33], Marshall and Olkin [41] were the first to generalize Chebyshev’s inequality to the
multivariate setting. Since the traditional Chebyshev inequalities only using moment information are con-
sidered too loose to be used for practical purposes, Mallows [38] used structural properties to sharpen these
bounds. In more recent research, the connections between moment problems, nonnegativity of polynomi-
als, and semidefinite programming have been exploited (see, e.g., [9, 35, 47, 48, 50]). Shapiro [56] discussed
the relationship between duality results for generalized moment problems and the more general theory
of conic duality, extending these concepts to more general topological vector spaces. Smith [60] revisited
the generalized moment problems in a contemporary discussion, highlighting its various applications in
decision theory. In his work, Smith briefly mentions the setting with prior information, but notes that the
resulting expectation is no longer linear in the probability measure, thus presenting a more challenging
problem that is not directly amenable to the techniques discussed in his work. As a solution, Smith [60]
suggested a linearization technique, as discussed in [36] and [17], which converts the problem into a series
of regular generalized moment problems. Along the lines of Shapiro [56], we define a novel variant of the
generalized problem of moments and use semi-infinite programming theory to obtain tight bounds in a
setting where the objective function is a fractional, rather than linear, function of the probability distribu-
tion. A comprehensive overview of general semi-infinite programming theory can be found in [30]. To the
best of our knowledge, there are no general techniques available for obtaining tight bounds on conditional
expectations. Although Mallows and Richter [39] provided some bounds for conditional expectations for
traditional power moments, these are only for the expectation of a single random variable and are not
necessarily tight. Moreover, Mallows and Richter [39] primarily used conditional information to sharpen
Chebyshev-type tail probability bounds. In contrast, we provide a general framework for obtaining the
best possible bounds for conditional expectations under limited information.
Extending the ideas described above to the decision-making setting, we are interested in obtaining tight
bounds for the expectation of some objective function of random variables, where the function also con-
tains a decision vector, in order to protect ourselves against the worst nature has to offer. Scarf [55] was
arguably the first to bring the concept of minimax optimization into the area of operations research. Scarf

2
considered a single-item newsvendor problem where the demand distribution is not precisely known, but
only characterized by limited information. The set of admissible distributions, also known as the ambigu-
ity set, contained all distributions with specified mean and variance. Scarf used duality techniques to find
the worst-case distribution that maximizes the total costs of the newsvendor, and then determined the
order quantity such that these maximal total costs were minimized. Subsequently, generalized moment
problems have been used to solve such minimax optimization problems in a vast amount of literature (see,
for instance, [10, 11, 21, 57, 59, 68]). Minimax optimization evolved into the field of distributionally robust
stochastic optimization, or distributionally robust optimization (DRO). This term was first coined by Delage
and Ye [19], and has since then become the standard terminology in the operations research community.
Distributionally robust optimization is advocated as the unifying paradigm between two distinct fields
for decision-making under uncertainty: robust optimization and stochastic programming. Robust opti-
mization [4, 5] provides an effective way to deal with problems subject to parameter uncertainty through
uncertainty sets. Although encoding uncertainty through these uncertainty sets often results in computa-
tionally tractable problems, the resulting solutions might turn out to yield overly conservative outcomes.
In contrast, stochastic programming captures parameter uncertainty with full distributional information,
but often yields intractable models [58]. Ambiguity sets provide a powerful modeling tool for capturing
distributional information about uncertain parameters in terms of their support and descriptive statistics.
A significant portion of the literature on distributionally robust optimization (DRO) is moment-based, with
partial information given by means, moments, and dispersion measures [19, 29, 51, 66]. As adequately de-
scribed in [66], the tractability of moment-based DRO problems relies on the intricate interplay between
the objective function and the structure of the ambiguity set. Other popular ambiguity sets are based on
statistical-distance measures, which restrict their members to be within a specific distance of a reference
distribution. Examples of these statistical-distance measures include 𝜙-divergences and the Wasserstein
distance (see, e.g., [2, 42]). In the present work, we focus on moment-based information.
Conditional-moment information has been used in various works to describe ambiguity sets. In the
minimax stochastic programming literature, Birge and Wets [11] incorporated such information into the
constraints of generalized moment problems to bound the objective values of stochastic programming
problems. de Klerk et al. [18] restricted the distributions in the ambiguity set to polynomial density func-
tions. They demonstrated that these ambiguity sets are highly expressive because they can conveniently ac-
commodate distributional information about conditional probabilities, conditional moments and marginal
distributions. Chen et al. [16] introduced scenario-wise ambiguity sets that capture information with con-
ditional expectation constraints based on generalized moments. Although these works use conditional
moment information, their objectives aim to maximize the conventional expectation. Consequently, their
approaches cannot directly handle the case in which the conditional expectation is the objective.
Here we should mention some work on distributionally robust optimization in which the objective func-
tion is fractional. When only support information is available, Gorissen [25] extended robust optimization

3
formulations to fractional programming in which the objective function is a fraction of two functions of the
uncertain parameters. Liu et al. [37] solved a DRO problem with moment constraints which consists of the
maximization of ambiguous fractional functions representing reward-risk ratios. Ji and Lejeune [32] also
investigated this class of fractional DRO problems using semi-infinite programming epigraphic formula-
tions to solve the ambiguous reward-risk ratio problem and, additionally, design a data-driven formulation
and solution framework using the Wasserstein ambiguity set. Using the conditional information, we can
use the conditional-expectation bounds to enhance “worst-case” decision-making in a DRO framework. As
for this setting, there does exist a separate thrust of research named contextual distributionally robust opti-
mization, in which the prior knowledge on the realizations of the random variables is commonly referred to
as side information. This area of research is a natural extension of the prescriptive stochastic programming
paradigm. In this paradigm, the central object of interest is the joint distribution of the side information
and the outcome random variables, which, if known, would result in more accurate estimations of the
outcome variable when conditioning this distribution on the side information given. In practice, however,
this joint distribution is usually not known precisely and is only estimated using a finite data sample. The
ultimate goal of prescriptive stochastic programming is to develop an optimization methodology that uses
the available side information to improve decision-making given only limited insights into the predictive
power of the side information on the uncertain outcome parameters (see, e.g., [1, 7, 61]). The contextual
DRO modeling paradigm assumes that, next to this side information, the joint distribution is contained in
an ambiguity set that is defined using the limited information available. Research on this paradigm is still
relatively scarce. We highlight a number of works. Esteban-Pérez and Morales [23] described ambiguity
through a partial mass transportation problem, and exploited probability-trimming methods to solve the
contextual DRO problem. In contrast, Nguyen et al. [46] worked directly with the optimal transport am-
biguity set, and the authors succeeded in finding tractable conic reformulations for DRO problems with
side information. Nguyen et al. [45] considered a Wasserstein-ball ambiguity set [42], which is centered
on the empirical distribution that follows from the available data sample of the side information and the
outcome parameters. Even though the Wasserstein-ball ambiguity set is a class of distributional ambiguity
sets obtained through the theory of optimal transport, the models and results obtained in [45, 46] behave
qualitatively differently due to special properties of the type-∞ Wasserstein distance, which is used to con-
struct the ambiguity set. The literature described above mostly considers ambiguity sets that are defined
through measures that define distances between probability distributions (such as the Wasserstein metric),
whereas in this paper, we shall focus on ambiguity sets described by generalized moment information.
The main contributions of the paper may be summarized as follows.

1. We expand the theory on semi-infinite programming and generalized moment problems, by deriving
duality results for linear-fractional programming in the semi-infinite setting. We extend several
well-known results from the theory on generalized moment problems in order to bound conditional
expectations, rather than standard expectations of functions of random variables. The fact that most

4
results carry over to this setting with conditional expectations is particularly intriguing because the
conditional expectation is a nonlinear function of the probability measure (thus not amenable to
standard techniques based on semi-infinite linear programming).

2. We apply these novel results for the generalized conditional-bound problem to univariate functions
of a simple random variable. Using primal-dual arguments, we obtain several closed-form bounds
for different types of dispersion information. In addition to generalized moment information, we
show that structural properties of the distribution can also be included. We further demonstrate our
approach by resolving a minimax optimization problem, taken from the robust monopoly pricing
literature. It is further asserted that most computations, in the univariate setting, are as tractable as
with the linear expectations operator.

3. We use findings from robust uncertainty quantification and distributionally robust convex optimiza-
tion to develop conic reformulations for the multivariate problem. We then apply these reformula-
tions to contextual DRO, presenting a generalized moment framework for distributionally robust
optimization with side information. The resulting framework is designed for conditional decision-
making, incorporating both the side information and the distributional information contained within
the ambiguity set. The computational tractability of the reformulations turns out to be closely re-
lated to that of distributionally robust convex optimization problems with support restrictions on
the random variables.

The remainder of the paper is organized as follows. In Section 2, we introduce the generalized moment
bound problem for conditional expectations and elaborate on the duality approach. Section 3 discusses sev-
eral examples of tight bounds for the conditional expected value of functions of a single random variable.
In Section 4, the moment-based contextual DRO framework is presented. Most proofs of minor results
are deferred to the Appendix. Finally, in Section 5, we conclude and provide several directions for future
research.

2. A duality framework for generalized conditional-bound problems


We first describe the problem of bounding conditional expectations in Section 2.1 and subsequently derive
the associated dual problem in Section 2.2. Then, in Section 2.3, we provide fundamental results that will
be employed in later sections to obtain the desired sharp bounds.

We aim to find the best upper bound for the conditional expectation of a random vector 𝑋 . Let us first
2.1. Problem statement

introduce some notation. Let 𝔼 denote the expectation operator, and 𝑔(⋅) denote an arbitrary measurable
function of 𝑋 . The random vector 𝑋 is defined on the support Ω ⊆ ℝ𝑛 , which we assume is a closed set
endowed with the Borel sigma algebra BΩ . The random vector 𝑋 is governed by a probability measure

5
ℙ ∶ BΩ → [0, 1], such that for a measurable set S ∈ BΩ we have ℙ(S) = ℙ(𝑋 ∈ S). Furthermore, ℙ lies
in some convex set of probability measures P. Throughout the paper, the terms “probability measure” and
“probability distribution” are used interchangeably. We assume that Ξ ∈ BΩ is an arbitrary measurable
event modeling the random event observed, pertaining to the realization of 𝑋 . Let Ξ be a set with strictly
positive measure ℙ(𝑋 ∈ Ξ) > 0 so that 𝔼[𝑔(𝑋 ) | Ξ] is well-defined and denotes the conditional expectation
of 𝑋 restricted to the values in the set Ξ. We now have the necessary notation to develop an adapted version
of the generalized moment problem that incorporates random events or, using different terminology, side
information. The central problem in this paper can be formulated as follows:

sup 𝔼ℙ [𝑔(𝑋 ) | 𝑋 ∈ Ξ] subject to 𝔼ℙ [ℎ𝑗 (𝑋 )] = 𝑞𝑗 for 𝑗 = 0, … , 𝑚, (1)


ℙ∈M+ (Ω)

where M+ (Ω) denotes all nonnegative measures defined on the support Ω, and 𝑔, ℎ0 , … , ℎ𝑚 are real-valued,
measurable functions that model the objective function and the available (generalized) moment informa-
tion. The probability mass constraint is explicitly included as ℎ0 ≡ 1 and 𝑞0 = 1. Let P denote the ambiguity
set that contains the true probability distribution. For a given moment vector 𝐪, define the set
{ }
P(𝐪) ∶= ℙ ∈ P0 (Ω) ∶ ∫ ℎ𝑗 (𝑥)dℙ(𝑥) = 𝑞𝑗 , 𝑗 = 0, … , 𝑚 , (2)

which contains all probability distributions that comply with the given moment sequence. Here, we typ-
ically assume ℙ is an element of a set of probability measures P0 (Ω) with support contained in Ω. Thus,
the constraint ℎ0 ≡ 1 is implicitly assumed so as to normalize the measures in M+ (Ω) to obtain probability
distributions. As a closely related concept, we consider the cone of moments 𝐪 ∈ ℝ𝑚 that yield a nonempty
ambiguity set P(𝐪), which can be defined as

Q ∶= {𝐪 ∈ ℝ𝑚 ∶ ∃ℙ ∈ M+ (Ω) such that ℙ ∈ P(𝐪)} .

This set thus contains all moment vectors 𝐪 for which (1) has a solution. That is, the moment constraints
are consistent or, in other words, there exists a probability distribution feasible to the generalized moment
problem. We henceforth assume that the moment constraints are consistent so that the ambiguity set P
is nonempty, and therefore a feasible solution to problem (1) always exists.
Now that we have introduced the required notation and studied the constraints of (1), let us turn to
the objective function. Instead of the regular expectation of a function of a random vector studied in
generalized moment problems, 𝔼[𝑔(𝑋 )], we now study

𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 )]
𝔼ℙ [𝑔(𝑋 ) | 𝑋 ∈ Ξ] = ∫ 𝑔(𝑥) dℚ(𝑥) = ,
𝔼ℙ [1Ξ (𝑋 )]
(3)
Ξ
in which ℚ denotes the conditional probability measure, given that it exists. Notice that the objective
function is a fractional, and thus nonlinear, function of the probability distribution ℙ. As a result, (1)
belongs to the class of distributionally robust fractional optimization problems, which are fundamentally
more difficult to solve than the standard problem that simply maximizes the expectation of a function (see,
e.g., [32, 37]).

6
2.2. An equivalent problem and its dual
Problem (1) can be formulated equivalently as a semi-infinite linear-fractional program (LFP). The semi-
infinite LFP reformulation of (1) is given by

∫Ω 𝑔(𝑥)1Ξ (𝑥)dℙ(𝑥)
sup
ℙ(𝑥)⩾0 ∫Ω 1Ξ (𝑥)dℙ(𝑥)

∫ ℎ𝑗 (𝑥)dℙ(𝑥) = 𝑞𝑗 , ∀𝑗 = 0, … , 𝑚,
(4)
s.t.
Ω
ℙ(𝑋 ∈ Ξ) > 0,
in which 1Ξ (𝑥) equals 1 if 𝑥 ∈ Ξ, and 0 otherwise. Here the optimization of the linear-fractional objective is
taken over infinite-dimensional variables (i.e., the probability distribution). We further have a finite num-
ber of constraints that describe the moment information. The final constraint ensures that the conditional
expectation is well-defined by avoiding conditioning on a set of measure zero. In the finite setting, these
linear-fractional programs can be reduced to linear programs through a Charnes-Cooper transformation
[13, Theorem 2]. If we generalize this to infinite-dimensional spaces, the Charnes-Cooper transformation

dℚ(𝑥) 1
= 𝛼, with 𝛼 = .
becomes

dℙ(𝑥) ∫Ω 1Ξ (𝑥)dℙ(𝑥)
(5)

In some sense, this generalized Charnes-Cooper transformation constitutes a change of measure, from the
original probability measure ℙ to its conditional counterpart ℚ. The variable 𝛼 is a scaling parameter that
essentially models the normalization on the random event.
After transformation (5), problem (4) reduces to the semi-infinite linear program (LP)

sup ∫ 𝑔(𝑥)1Ξ (𝑥)dℚ(𝑥)


𝛼,ℚ(𝑥)∈ℝ+ Ω

s.t. ∫ ℎ𝑗 (𝑥) dℚ(𝑥) = 𝛼𝑞𝑗 , ∀𝑗 = 0, … , 𝑚, (6)


Ω

∫ 1Ξ (𝑥)dℚ(𝑥) = 1.
Ω

where all the right-hand sides of the constraints in (6) are scaled by 𝛼, and the last line ensures that ℚ is
a proper (conditional) distribution when defined on its support Ξ. To determine the dual of (6), one can
employ semi-infinite linear programming duality, as in Section 6 of [30]. It then follows from standard
calculations that the Lagrangian dual of (6) is

inf 𝜆𝑚+1
𝜆0 ,…,𝜆𝑚+1
𝑚
s.t. ∑ 𝜆𝑗 𝑞𝑗 ⩽ 0,
𝑗=0
(7)
𝑚
∑ 𝜆𝑗 ℎ𝑗 (𝑥) + 𝜆𝑚+1 1Ξ (𝑥) ⩾ 𝑔(𝑥)1Ξ (𝑥), ∀𝑥 ∈ Ω,
𝑗=0

7
where the dual variables 𝜆0 , … , 𝜆𝑚+1 are associated to each constraint in the primal (6). From this point
on, we use the shorthand notation ℎ𝑚+1 (𝑥) ∶= 1Ξ (𝑥). Notice that the dual problem has a finite number
of decision variables, but an infinite number of constraints. By virtue of weak duality, an upper bound for
(6) follows from a feasible solution to (7). The optimal dual solution to problem (7) yields a viable upper
bound by weak duality, but whether this bound is sharp (i.e., whether strong duality holds) is still an open
question, to which we will seek an answer in the next subsection.

2.3. Strong duality of the generalized conditional-bound problem


The purpose of this section is twofold: we demonstrate that (7) is strongly dual to (1), also extending this
result to more general, convex sets of probability measures P0 that might model structural properties
such as symmetry and unimodality, and we show that (1) can be reduced to a finite-dimensional problem
in which one optimizes over a parametric family of distributions.
Since the objective function of (1) is nonlinear with respect to the expectation operator, it is not im-
mediately clear how to pass down sufficient conditions regarding strong duality for generalized moment
problems to the setting with conditional expectations. Therefore, in order to prove our main results, we use
an alternative formulation of problem (4). The equivalence of this formulation is provided by the following
lemma, which strongly hinges on the relation between fractional and parametric nonlinear programming
[20, 26].

Lemma 1. Suppose that problem (4) has a finite optimal value. Then problem (4) is equivalent to

inf 𝜏
𝜏∈ℝ

sup 𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 ) − 𝜏 1Ξ (𝑋 )] ⩽ 0.


(8)
s.t.
ℙ∈P

That is, the optimal value 𝜏 ∗ agrees with that of (1), and the suprema in both (1) and (8) are achieved exactly
by the same extremal distribution ℙ∗ , or achieved asymptotically by an identical sequence of distributions
{ℙ∗𝑘 }𝑘⩾1 .

This result is closely related to Proposition 2.1 in [37], which deals in a similar fashion with a class of
fractional DRO problems. It is evident that (8) is significantly easier to work with than the original semi-
infinite formulation (4) since the problem is linear with respect to ℙ, rather than linear-fractional. This
can be seen from the constraints of (8) in which the expectation 𝔼ℙ [⋅] appears linearly, whereas in (3), we
have a fraction of expectation operators. From the duality theory of moment problems, solving problem
(8) turns out to be equivalent to solving the dual (7). As a consequence, solving (4) will also be equivalent
to solving (7) due to the equivalence between (4) and (8). We next show strong duality holds. To this end,
we make the following assumptions:

(A1) The function 𝑔(𝑥) is bounded on the support Ω.

8
(A2) There exists a positive number 𝜖 > 0 such that

inf 𝔼ℙ [1Ξ (𝑋 )] ⩾ 𝜖.
ℙ∈P

(A3) The Slater condition holds; that is, 𝐪 ∈ int(QP), where “int” denotes the interior of a set.

Assumption (A1) is standard and could be relaxed. It is satisfied, for example, when Ω is a compact set
and 𝑔(𝑥) is continuous, by virtue of Weierstrass’ Extreme Value Theorem. This assumption is used to
guarantee that the optimal value of (4) is finite. Assumption (A2) provides a sufficient condition for the
conditional expectation to be well-defined, as it avoids conditioning on a set of measure zero. Assumption
(A3) constitutes a Slater-type condition on the moments of 𝑋 , which is standard for generalized moment
problems; see, for example, Proposition 3.4 in [56]. Under these regularity conditions, we show strong
duality holds for a general, convex set of probability distributions P0 , possibly endowed with structural
properties (e.g., symmetry and unimodality). We will focus on structural ambiguity sets P0 (Ω) that possess
a mixture representation. In other words, we assume P0 can be “generated” by a convenient class of
distributions, say T, such that every distribution ℙ ∈ P0 can be written as a mixture (i.e., an infinite convex
combination) of the extremal distributions (i.e., the extreme points of the convex set P0 ) that constitute T.
For every Borel set S ∈ BΩ , it should thus hold that

ℙ(𝑋 ∈ S) = ∫ 𝕋𝐭 (𝑋 ∈ S)d𝕄(𝐭),

where 𝕋𝐭 ∈ T ⊆ P0 , is a parameterized representation of the family of extremal distributions of P0 , and


T

𝕄 represents the mixture distribution that generates ℙ from the extremal distributions in T. This finite-
dimensional parameterization of the family of extremal distribution will prove useful when determining
the optimal bounds. For a thorough discussion on these structural ambiguity sets in the context of DRO,
we refer to the work of Popescu [50]. We use these general ambiguity sets with structural properties when
formulating our main results.
Finally, before formulating our main result, we introduce some final technical notation from conic du-
ality theory for generalized moment problems (see, e.g., [50, 56]). Denote by A = co(P0 ) the cone of
measures A generated by the set of probability distributions P0 . Define it dual cone as A∗ ∶= {ℎ ∈
H ∶ ∫Ω ℎ(𝑥) dℙ(x) ⩾ 0, ∀ℙ ∈ A}, where H is the linear space of functions formed by combinations of
𝑔, ℎ0 , … , ℎ𝑚 , and the spaces of functions and measures are paired by the integral operator. We now have the
necessary preliminaries to demonstrate strong (conic) duality. Lemma 1, in conjunction with assumptions
(A1)–(A3), pave the way for us to formulate the main results.

Theorem 1 (Strong conic duality). Suppose that assumptions (A1)–(A3) hold. Then, the optimal value of

9
the primal problem (1) is finite and equals that of its dual problem

inf 𝜆𝑚+1
𝜆0 ,…,𝜆𝑚+1
𝑚
s.t. ∑ 𝜆𝑗 𝑞𝑗 ⩽ 0,
𝑗=0
(9)
𝑚+1
∑ 𝜆𝑗 ℎ𝑗 (𝑥) − 𝑔(𝑥)1Ξ (𝑥) ∈ A∗ ,
𝑗=0

in which A∗ is the dual cone of A.

Proof. The assumptions ensure that, for all ℙ ∈ P,


| 𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 )] | 1
| |
| 𝔼 [1 (𝑋 )] | ⩽ 𝜖 |𝔼ℙ [𝑔(𝑋 )]|
| ℙ Ξ |
1
⩽ sup |𝔼ℙ [𝑔(𝑋 )]|
ℙ∈P 𝜖
1
⩽ sup |𝑔(𝑥)| < ∞.
𝜖 𝑥∈Ω
Hence, instead of (1), it is equivalent to consider (8). Since the constraints of this problem are linear in the
probability distribution ℙ, standard conic duality for generalized moment problems suffices, which holds
under the Slater-type condition, as defined in [56]. Notice that if we substitute 𝜏 with 𝜆3 and incorporate
the dual problem in the constraint of (8), then (8) is equivalent to (9). For A = M+ , the strong conic dual
problem (9) reduces to the semi-infinite LP (7). The result for general cones of measures follows from conic
duality arguments as in Theorem 3.1 of [50].

As a consequence, duality enables us to reduce the primal problem, which has infinite-dimensional vari-
ables, to a dual problem with 𝑚 + 1 variables, but with an infinite number of constraints. These constraints
are indexed by the probability measures, i.e., the constraints should hold ∀ℙ ∈ P0 . This indexation might
turn out to be difficult. However, this difficulty can be greatly reduced if we instead use the generating set
T, as shown in [50]. In this case, the dual cone can be reduced to A∗ = {ℎ ∈ H ∶ ∫Ω ℎ(𝑥) dℙ(𝑥) ⩾ 0, ∀ℙ ∈
T}, and hence, the indexing now only runs over the set of extreme points of P0 .
Another classical result states that the semi-infinite LP that models a generalized moment problem can be
reduced to an equivalent finite-dimensional problem with the same optimal value. The Richter-Rogosinski
Theorem (see, e.g., [27, 53, 60]) states that there exists an extremal distribution for the semi-infinite LP
with at most 𝑚 + 1 support points. Analogous to the basic solutions for conventional semi-infinite linear
programming, we define a basic distribution as a convex combination of extremal distributions of P0 . As
there are 𝑚 + 1 moment functions, these basic distributions consist of at most 𝑚 + 1 extreme points (i.e.,
elements of T). We let D(𝐪) denote the set of basic distributions that comply with the given moment
sequence 𝐪. We can then state the following result, which is an adaptation of the fundamental theorem for
convex classes of distributions, but now for the generalized conditional-bound problem.

10
Theorem 2 (Fundamental theorem for conditional expectations). Consider problem (1). Under as-
sumptions (A1)–(A3),
sup 𝔼ℙ [𝑔(𝑋 ) | 𝑋 ∈ Ξ] = sup 𝔼ℙ [𝑔(𝑋 ) | 𝑋 ∈ Ξ].
ℙ∈P(𝐪) ℙ∈D(𝐪)

of 𝑚 + 1 probability distributions from the generating set T, that achieves this value.
Moreover, if the optimal value is attained, then there exists an optimal basic distribution, a convex combination

Proof. Once again, under the stated assumptions, it suffices to consider the equivalent problem (8). Since
the constraints of (8) are linear in ℙ, standard generalized moment problem results apply. Hence, The-
orem 6.1 in [50] is applicable and yields the desired result. This is underpinned by Lemma 1, which es-
tablishes that the suprema in both (1) and (8) are attained either by the same extremal distribution or
asymptotically achieved by an identical sequence of distributions.

This theorem states that, even if the bound is not achieved, it is sufficient to consider only the basic
feasible distributions in D(𝐪) to determine the supremum. This result holds for general convex classes of
distributions P with the optimal distributions taken as convex combinations of the extremal distributions
in T. We further remark that both theorems also hold even when the supremum of the conditional expec-
tation grows infinitely large. In such cases, there exists a maximizing sequence of probability measures
(also taken from the set of basic distributions) for which the conditional expectation diverges. The proof
of Lemma 1 demonstrates that (A2) is sufficient for the equivalence of the two formulations. Assumption
(A1) primarily ensures the finiteness of the optimal value. Without (A1) the primal problem may lead to
an unbounded solution, rendering the dual problem infeasible by weak duality. Combined, the concept
of weak duality described in Section 2.2 and the reduction to the basic distributions generated by T as
proposed in Theorem 2 provide an effective way of solving problem (1), as will be demonstrated in the
next section.

In this section, we study several easy examples for the case with 𝑛 = 1; that is, 𝑋 is a random variable
3. Tight bounds for conditional expectations

conditioned on itself. First, in Section 3.1, we seek the best possible bounds for conditional expectation
𝔼[𝑋 | 𝑋 ⩾ 𝑡] when mean-dispersion information, and possibly structural properties of the underlying dis-
tribution, are given. We find the tight bounds using primal-dual arguments. In Section 3.2, we demonstrate
this also works for arbitrary choices of 𝑔(𝑥), using an example from the robust pricing literature. Finally,
Section 3.3 shows that, in the univariate setting, tight bounds can be obtained by solving semidefinite
programming problems.

11
For the sake of exposition, we concentrate our efforts on the event Ξ = {𝑋 ⩾ 𝑡}. We thus seek to bound
3.1. Simple examples for mean-dispersion information

∫Ω 𝑔(𝑥)1Ξ (𝑥)dℙ(𝑥)
the conditional expectation
𝔼ℙ [𝑔(𝑋 ) | 𝑋 ⩾ 𝑡] = ,
∫Ω 1Ξ (𝑥)dℙ(𝑥)
in which 1Ξ (𝑥) is the indicator function modeling the occurrence of the event {𝑋 ⩾ 𝑡} and ℙ is the under-
lying probability distribution of which we assume that it lies in the mean-variance ambiguity set, P(𝜇,𝜎) ,
which contains all distributions that comply with the available mean-variance information. Then, the
problem of interest can be stated as

∫Ω 𝑔(𝑥)1Ξ (𝑥)dℙ(𝑥)
sup
ℙ(𝑥)⩾0 ∫Ω 1Ξ (𝑥)dℙ(𝑥)
(10)
∫ dℙ(𝑥) = 1, ∫ 𝑥 dℙ(𝑥) = 𝜇, ∫ 𝑥 dℙ(𝑥) = (𝜎 + 𝜇 ),
2 2 2
s.t.
Ω Ω Ω

which is a semi-infinite LFP. Through the generalized Charnes-Cooper transformation, introduced in Sec-
tion 2.1, it is possible to write (10) as

sup ∫ 𝑔(𝑥)1Ξ (𝑥)dℚ(𝑥)


𝛼,ℚ(𝑥)⩾0 Ω

∫ dℚ(𝑥) = 𝛼, ∫ 𝑥 dℚ(𝑥) = 𝛼𝜇, ∫ 𝑥 dℚ(𝑥) = 𝛼(𝜎 + 𝜇 ), ∫ 1Ξ (𝑥)dℚ(𝑥) = 1,


(11)
2 2 2
s.t.
Ω Ω Ω Ω

where 𝛼 is a decision variable denoting the “worst-case” probability of 𝑋 exceeding 𝑡. The semi-infinite
linear programming dual of (11) is given by

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

s.t. 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0, (12)

𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 + 𝜆3 1Ξ (𝑥) ⩾ 𝑔(𝑥)1Ξ (𝑥), ∀𝑥 ∈ ℝ,

which, for this specific choice of random event Ξ, results in

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

s.t. 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0,
𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 ⩾ 0, ∀𝑥 < 𝑡,
𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 + 𝜆3 ⩾ 𝑔(𝑥), ∀𝑥 ⩾ 𝑡.

Consider the standard conditional expectation (i.e., consider the function 𝑔 ∶ 𝑥 ↦ 𝑥). We next try to find
feasible solutions to the dual problem, and prove optimality by finding primal solutions with matching

12
objective values. Let 𝑡 < 𝜇. The dual problem of supℙ∈P(𝜇,𝜎) 𝔼[𝑋 | 𝑋 ⩾ 𝑡] is given by

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

s.t. 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0,
𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 ⩾ 0, ∀𝑥 < 𝑡,
(13)

𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 ⩾ 𝑥 − 𝜆3 , ∀𝑥 ⩾ 𝑡.

Denote the left-hand sides of the constraints by 𝑀(𝑥) ∶= 𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 . The function 𝑀(𝑥) is dual
feasible when it is greater than or equal to 0 for 𝑥 < 𝑡 and greater than or equal to 𝑥 − 𝜆3 for 𝑡 ⩽ 𝑥 ⩽ 𝑏.

1{𝑥⩾𝑡} (𝑥 − 𝜆3 )
𝑀1 (𝑥)
𝑀2 (𝑥)

𝑥0
𝑥
𝑡 − 𝜇

𝑡 − 𝜆3

Figure 1: 𝑀1 (𝑥) and 𝑀2 (𝑥)

The first dual function, 𝑀1 (𝑥), does not admit a feasible solution. Since 𝜎 > 0, from 𝜆0 +𝜆1 𝜇+𝜆2 (𝜎 2 +𝜇2 ) ⩽
0 it follows that 𝑀(𝜇) = 𝜆0 + 𝜆1 𝜇 + 𝜆2 𝜇2 < 0, but for the minimizer 𝑥 ∗ = −𝜆1 /(2𝜆2 ), 𝑀(𝑥 ∗ ) ⩾ 0. Thus, this
case is infeasible.
The second parabola, 𝑀2 (𝑥), does admit a feasible solution. This function coincides with the function
𝑔(⋅) at 𝑡 and some point 𝑥0 . Since 𝑀2 (𝑡) = 0, 𝑀2 (𝑥0 ) = 𝑥0 − 𝜆3 and 𝑀2 (𝑥0 ) = 1,

𝑡(𝜆3 (𝑡 − 2𝑥0 ) + 𝑥02 ) 𝑡 2 + 𝑥0 (𝑥0 − 2𝜆3 ) 𝜆3 − 𝑡


𝜆0 = − , 𝜆 = , 𝜆2 = .
(𝑡 − 𝑥0 ) (𝑡 − 𝑥0 ) (𝑡 − 𝑥0 )2
2 1 2
(14)

Hence, this yields an optimization problem in two variables, min𝑥0 ,𝜆3 𝜆3 , with the additional constraint
𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0, in which we substitute (14). If this constraint is tight, the dual problem can be

(𝑡 2 + 𝑥02 )𝜇 − 𝑡(𝑥02 + 𝜇2 + 𝜎 2 )
reduced to
min 𝜆3 ≡ min .
𝜆3 ,𝑥0 𝑥0 (𝑡 − 𝜇)(𝑡 − 2𝑥0 + 𝜇) − 𝜎 2

13
Optimizing over 𝑥0 , it then follows that

𝜎2 𝜎2
𝑥0∗ = 𝜇 + , 𝜆3∗ = 𝜇 + ,
𝜇−𝑡 𝜇−𝑡

with 𝜆3∗ a feasible upper bound for 𝔼[𝑋 |𝑋 ⩾ 𝑡]. To prove this bound is optimal, we construct a distribution
that (asymptotically) achieves 𝜆3∗ . From complementary slackness (see, e.g., [60]), we deduce that the
candidate distribution has all of its probability mass on two points: 𝑡 − and 𝑥0∗ . Solving the system of
moment constraints in (10) yields the probabilities

𝜎2 (𝜇 − 𝑡)2
𝑝𝑡 − = , 𝑝 ∗ = .
𝜎 2 + (𝜇 − 𝑡)2 𝜎 2 + (𝜇 − 𝑡)2
𝑥 0

Indeed, for this two-point distribution,

𝑥0∗ ⋅ 𝑝𝑥0∗ 𝑥0∗ ⋅ 𝑝𝑥0∗ 𝜎2


𝔼[𝑋 | 𝑋 ⩾ 𝑡] = = = 𝑥0∗ = 𝜇 + ,
ℙ(𝑋 ⩾ 𝑡) 𝑝𝑥0∗ 𝜇−𝑡

ensuring the upper bound is tight. Hence, by weak duality,

𝜎2
sup 𝔼[𝑋 | 𝑋 ⩾ 𝑡] = 𝜇 + .
𝜇−𝑡
(15)
ℙ∈P(𝜇,𝜎)

Notice that this bound is not actually attained, but it is achieved in the limit; that is, for 𝑡𝑘 = 𝑡 − 𝑘1 , as
𝑘 → ∞, this construction of the extremal distribution indeed gives the desired result. For 𝑡 ⩾ 𝜇, it can be
shown that
sup 𝔼[𝑋 | 𝑋 ⩾ 𝑡] = ∞,
ℙ∈P(𝜇,𝜎)

since we can construct the following sequence of (maximizing) measures:

1 𝜎2
ℙ𝑘 = 𝛿𝜇+𝑘𝜎2 + 𝛿 1.
𝑘2 𝜎2 + 1 𝜎 2 + 𝑘12 𝜇− 𝑘

Letting 𝑘 → ∞ then results in 𝔼ℙ𝑘 [𝑋 | 𝑋 ⩾ 𝑡] → ∞. Taken together, we obtain the following result.

Proposition 1. For a real-valued random variable 𝑋 with distribution ℙ ∈ P(𝜇,𝜎) , it holds that



⎪𝜇 + 𝜇−𝑡 , for 𝑡 < 𝜇,
𝜎2
sup 𝔼ℙ [𝑋 | 𝑋 ⩾ 𝑡] = ⎨


⎩∞, for 𝑡 ⩾ 𝜇.
(16)
ℙ∈P(𝜇,𝜎)

A number of interesting observations can be drawn from this result. First, note that the maximizing
sequence for the second case, {ℙ𝑘 }, converges weakly to 𝛿𝜇 , which is not included in the ambiguity set.
Notice also that the solution to the second case becomes “uninformative,” as it diverges for values of 𝑡 ⩾ 𝜇.
Degenerate behavior like this also holds for different ambiguity sets, as we will see in later examples.

14
This result confirms tightness of the Mallows and Richter bound for conditional expectations under mean-
variance information, stated in [39]. Further, notice that the worst-case distribution that achieves the upper
bound matches the extremal distribution that yields the Cantelli lower bound for the tail probability, as
also shown in [24]. The authors of the latter work provide a constructive proof of tightness using the
extremal distribution that achieves the Cantelli bound. Despite Proposition 1 being a known result, this
is the first time it has been proven through a duality argument that provides immediate insight into the
extremal distributions.
The method discussed above can be applied to different types of dispersion information, not only the
traditional variance. For example, assume that instead of the variance, we consider the mean absolute
deviation from the mean (MAD), 𝑑 ∶= 𝔼|𝑋 − 𝜇|, as the measure of dispersion. Let P(𝜇,𝑑) denote the mean-
MAD ambiguity set, with the additional constraint that the support of 𝑋 is (a subset of) the interval [𝑎, 𝑏].
We can then prove the following result.

Proposition 2. For a real-valued random variable 𝑋 with distribution ℙ ∈ P(𝜇,𝑑) , it holds that


⎪𝜇 + 2(𝜇−𝑡)−𝑑 , for 𝑡 < 𝜇 − 2(𝑏−𝜇)−𝑑 ,
𝑑(𝜇−𝑡) 𝑑(𝑏−𝜇)
sup 𝔼ℙ [𝑋 | 𝑋 ⩾ 𝑡] = ⎨


⎩𝑏, for 𝑡 ⩾ 𝜇 − 2(𝑏−𝜇)−𝑑 .
𝑑(𝑏−𝜇)
(17)
ℙ∈P(𝜇,𝑑)

Again, we see that the second case becomes uninformative, as it simply reduces to the robust solution
(i.e., it agrees with the upper bound of the support). It is worth noting here the interplay between the size
of the ambiguity set and the set that describes the random event/side information. Despite having only a
limited number of distributions to choose from, if the realizations of the random variable are limited to too
small an interval, the bounds become overly conservative, as an extremal distribution can be constructed
for which the moment constraints are satisfied, yet the support point on Ξ can be made arbitrarily large,
bounded, of course, by the upper bound of the support. For the mean-MAD ambiguity set, the extremal dis-
tribution also agrees with the distribution attaining the lower bound on the corresponding tail probability
[54].
Proposition 2 can be generalized further. Assume that the dispersion information is modeled through
the expectation of a convex function 𝑑(⋅) of the random variable 𝑋 , defined as 𝜎̄ ∶= 𝔼[𝑑(𝑋 )]. We can state
the following result for such arbitrary convex dispersion measures.

Proposition 3. Suppose that there exists a solution 𝑥0∗ ∈ (𝑡, ∞) to the equation
𝜎𝑡
̄ − 𝜇𝑑(𝑡) 𝜇𝑑(𝑥0 ) − 𝜎𝑥
̄ 0
+ = 1,
𝑡𝑑(𝑥0 ) − 𝑥0 𝑑(𝑡) 𝑡𝑑(𝑥0 ) − 𝑥0 𝑑(𝑡)
such that the corresponding two-point distribution, with support {𝑡, 𝑥0∗ }, is feasible. Then, for a real-valued
random variable 𝑋 with distribution ℙ ∈ P(𝜇,𝜎)
̄ , it holds that

sup 𝔼ℙ [𝑋 | 𝑋 ⩾ 𝑡] = 𝑥0∗ . (18)


ℙ∈P(𝜇,𝜎)
̄

15
Proposition 3 covers a wide range of dispersion measures, not limited to only variance and MAD. It also
incorporates asymmetric measures of dispersion, such as semivariance, semimean deviations, and partial
moments. More generally, it encompasses all dispersion measures that are modeled using piecewise convex
functions. Since the 𝑝-norms on ℝ are convex, these are naturally included in the category of convex
dispersion measures as well. Another notable function that falls into this class is the Huber-loss function,
which has been extensively studied in the field of robust statistics.
As explained earlier, using only moment information often leads to overly conservative bounds and
pathological worst-case distributions. We require additional assumptions about the distribution’s shape
to sharpen the bounds and avoid the pathological two-point distributions that constitute the worst-case
scenario in the previous examples. We next study two such structural properties, i.e., symmetry and uni-
modality. The random variable 𝑋 is said to admit a symmetric distribution about a point 𝑚 if ℙ(𝑋 ∈
[𝑚 − 𝑥, 𝑚]) = ℙ(𝑋 ∈ [𝑚, 𝑚 + 𝑥]) for all 𝑥 ⩾ 0. A random variable 𝑋 has a unimodal distribution with mode
𝑚 if its distribution function is a concave function on (−∞, 𝑚] and convex on (𝑚, ∞). Both definitions are
generalized so that they admit probability distributions that allow for point masses at 𝑚. We next consider
a distribution that is symmetric about its mean 𝜇 with the values of the mean and variance given. Making
use of primal-dual arguments, we obtain the following result.
Proposition 4. For a real-valued random variable 𝑋 with a symmetric distribution ℙ ∈ P(𝜇,𝜎) , it holds that
sym



⎪𝜇 + 2(𝑡−𝜇) 2 −𝜎 2 ,
(𝜇−𝑡)𝜎 2
for 𝑡 < 𝜇 − 𝜎,


sup 𝔼ℙ [𝑋 | 𝑋 ⩾ 𝑡] = ⎨𝜇 + 𝜎, for 𝜇 − 𝜎 ⩽ 𝑡 < 𝜇,



(19)
ℙ∈P(𝜇,𝜎)
sym


⎩∞, for 𝑡 ⩾ 𝜇.
Observe that the bounds are sharper than the bound in Proposition 1, but still vacuous for 𝑡 ⩾ 𝜇.
Combining the notions of symmetry and unimodality yields the following, even tighter, bounds:
Proposition 5. For a real-valued random variable 𝑋 with a symmetric, unimodal distribution ℙ ∈ P(𝜇,𝜎)
uni , it


⎪ 3(2 2+1)
√ √
, 𝑡 < 𝜇 − 𝜎,
holds that


4𝜇(𝑥0∗ )3 −3𝜎 2 (𝑡+𝑥0∗ −𝜇)(𝑡−𝑥0∗ +𝜇)

⎪1
4(𝑥0 ) −6𝜎 (𝑡+𝑥0 −𝜇)

7
∗ 3 2 ∗

for 𝜇 − ( 7 ) 𝜎 < 𝑡 < 𝜇,


√ √
for
sup 𝔼ℙ [𝑋 | 𝑋 ⩾ 𝑡] = ⎨ 2 (𝜇 + 𝑡 + 3𝜎 ) , 3 2 2+1



(20)
ℙ∈P(𝜇,𝜎)

uni

⎩∞, for 𝑡 ⩾ 𝜇,
where 𝑥0∗ is the real-valued solution to the quartic equation

6𝜎 2 𝑥02 (3(𝑡 − 𝜇)2 − 𝑥02 ) + 9𝜎 4 (𝜇 − 𝑡 − 𝑥0 )2 = 0,



which satisfies the condition 𝑥0∗ ⩾ 3𝜎.
Although Proposition 5 does not provide a closed-form solution, it does demonstrate the versatility of the
primal-dual arguments used to derive it. It further highlights that structural properties can be addressed
in conjunction with moment information, even for conditional-bound problems.

16
To demonstrate the primal-dual approach for an alternative objective function 𝑔(𝑥), we next turn our at-
3.2. A robust pricing example

tention to a specific minimax problem. We consider the objective of the robust monopoly-pricing problem;
see, e.g., [15, 22, 24]. This involves evaluating the revenue function

Π(𝑧) ∶= 𝔼[𝑝1{𝑋 ⩾𝑝} ] = 𝑝ℙ(𝑋 ⩾ 𝑝),

which models the expected revenue that a seller of a single item receives when the price posted is equal to
𝑝, and the valuation of customers is distributed as 𝑋 . As in [24], we will attempt to minimize the maximum
relative regret by posting the minimax selling price 𝑝∗ ; that is, we solve
max𝑧 𝔼ℙ [Π(𝑧)]
min max .
𝑝 ℙ∈P 𝔼ℙ [Π(𝑝)]
We use the “min” and “max” operators only to avoid notational clutter, as it does not imply that the optima
are actually attained. Chen et al. [15] present various results for robust monopoly pricing and also consider

{ }
this relative regret criterion. In [15] an almost identical objective function is considered, namely,
𝔼ℙ [Π(𝑝)] 𝔼ℙ [Π(𝑝)]
min max 1 − = 1 − max min .
𝑝 ℙ max𝑧 𝔼ℙ [Π(𝑧)] 𝑝 ℙ max𝑧 𝔼ℙ [Π(𝑧)]

We, however, work with the reciprocal


max𝑧 𝔼ℙ [Π(𝑧)] 𝔼ℙ [Π(𝑧)] 𝔼ℙ [Π(𝑧)]
min max = min max max = min max max ,
𝑝 ℙ 𝔼ℙ [Π(𝑝)] 𝑝 ℙ 𝑧 𝔼ℙ [Π(𝑝)] 𝑝 𝑧 ℙ 𝔼ℙ [Π(𝑝)]

where we swap the maximization operators to obtain the final equality. As in the proof of Theorem 4 in
[15], it is imperative to solve the semi-infinite optimization problem
𝔼ℙ [𝑧 1{𝑋 ⩾ 𝑧}]
max .
ℙ∈P(𝜇,𝜎) 𝑝ℙ(𝑋 ⩾ 𝑝)
(21)

Notice that this problem effectively models the conditional expectation 𝔼[ 𝑝𝑧 11{𝑋 ⩾𝑝} | 𝑋 ∈ Ξ] with the ran-
{𝑋 ⩾𝑧}

dom event Ξ = {𝑋 ⩾ 𝑝}. As Chen et al. [15] try to optimize the dual problem directly, their proof
requires several lengthy, tedious arguments to obtain the tight bounds. We will simplify their proof using
primal-dual arguments as in the previous subsection. We assume that 𝑝 < 𝜇, as otherwise, it is possible
to construct an extremal distribution for which the relative regret ratio diverges; see [24]. For 𝑧 < 𝑝, the
expectation in the numerator will evaluate to 1. Hence,
𝔼ℙ [ 𝑝𝑧 1{𝑋 ⩾ 𝑧}] 𝑧 1 𝑧 1
max max = .
ℙ∈P(𝜇,𝜎) ℙ(𝑋 ⩾ 𝑝) 𝑝 ℙ∈P(𝜇,𝜎) ℙ(𝑋 ⩾ 𝑝) 𝑝 min ℙ(𝑋 ⩾ 𝑝)

ℙ∈P(𝜇,𝜎)

To bound the latter, we can use the one-sided version of Chebyshev’s inequality (commonly known as
Cantelli’s inequality). It then follows that
𝔼ℙ [ 𝑝𝑧 1{𝑋 ⩾ 𝑧}] 𝑧 𝜎 2 + (𝜇 − 𝑝)2
max = .
ℙ∈P(𝜇,𝜎) ℙ(𝑋 ⩾ 𝑝) 𝑝 (𝜇 − 𝑝)2

17
𝔼ℙ [ 𝑝𝑧 1{𝑋 ⩾ 𝑧}] 𝜎 2 + (𝜇 − 𝑝)2
Hence,
max max = .
𝑧⩽𝑝 ℙ 𝑝ℙ(𝑋 ⩾ 𝑝) (𝜇 − 𝑝)2
For 𝑧 ⩾ 𝑝, it holds that

𝔼ℙ [ 𝑝𝑧 1{𝑋 ⩾ 𝑧}] 𝑧 ℙ(𝑋 ⩾ 𝑧) 𝑧 ℙ(𝑋 ⩾ 𝑧 ∩ 𝑋 ⩾ 𝑝) 𝑧


= = = ℙ(𝑋 ⩾ 𝑧 | 𝑋 ⩾ 𝑝),
ℙ(𝑋 ⩾ 𝑝) 𝑝 ℙ(𝑋 ⩾ 𝑝) 𝑝 ℙ(𝑋 ⩾ 𝑝) 𝑝

and thus, we require a bound for conditional probabilities. Using primal-dual arguments, we obtain the
following result.

Proposition 6. Suppose that 𝑧 ⩾ 𝑝 and 𝑝 < 𝜇. For a nonnegative random variable 𝑋 with distribution
ℙ ∈ P(𝜇,𝜎) , it holds that




𝜎2
2 +𝜎 2 , for 𝑧 ⩾ 𝜇 + (𝜇−𝑝) 2 +𝜎 2 ,
2𝜎 2 (𝜇−𝑝)


(𝑧−𝜇)
2
sup ℙ(𝑋 ⩾ 𝑧 | 𝑋 ⩾ 𝑝) = ⎨ 𝜎2 +𝜇(𝜇−𝑝)
) , for 𝜇−𝑝 ⩽ 𝑧 ⩽ 𝜇 + (𝜇−𝑝)2 +𝜎 2 ,
+𝜎 𝜎 +𝜇 −𝑝𝜇 2𝜎 2 (𝜇−𝑝)
(
2 2 2 2



2 −𝑝 2 +2𝑧(𝜇−𝑝)


(22)
ℙ∈P(𝜇,𝜎)

⎪1,

otherwise.

Using these bounds, one can show that

𝔼ℙ [𝜋(𝑧, 𝑋 )] 𝜎 2 + 𝜇(𝜇 − 𝑝)
max sup = .
𝑧⩾𝑝 ℙ 𝑝ℙ(𝑋 ⩾ 𝑝) 𝑝(𝜇 − 𝑝)

{ }
Hence, combining the above results, the optimal price should solve

𝜎2 𝜇 𝜎2
min max + , +1 .
𝑝<𝜇 𝑝(𝜇 − 𝑝) 𝑝 (𝜇 − 𝑝)2

It is shown by [24] that the optimal price is the value of 𝑝 on which both branches of the max operator
agree. Observe that by using primal-dual arguments, we have greatly reduced the number of necessary
calculations to obtain the desired result. Contrary to [15], we work with the primal and dual problems
concurrently so as to verify the optimality of the suggested solutions in a more effective manner. It goes
without saying that the proposed duality approach in this section probably also works for other pricing
problems with fractional objectives, such as e.g. the personalized pricing setting in [22].

3.3. Numerical bounds


We next show how the strong dual problem (7) can be reduced to a semidefinite programming problem for
the univariate setting. Consequently, the tight bounds can always be obtained numerically as the solution
to a (computationally tractable) optimization problem. The numerical experiments have been conducted
in the Julia programming language, using the MOSEK solver [43] together with the Julia packages SumOf-
Squares.jl and PolyJuMP.jl [65].

18
Assume the objective function 𝑔(𝑥) is piecewise polynomial and the moment constraints are described
by the traditional power moments. Then the dual problem can be reduced further to a semidefinite pro-
gramming problem by applying standard DRO techniques discussed in, for example, [8, 9, 50]. In particular,
the univariate moment problem reduces to solving a semidefinite program, provided that the dual-feasible
set is semi-algebraic; that is, the dual constraint involves checking whether certain polynomial functions
are nonnegative on intervals described by the event set Ξ and the support Ω. It is well known that a uni-
variate polynomial is nonnegative if, and only if, it is a sum of squares. A classic result then states that
the semi-infinite constraint in the dual-feasible set of (7), with the support Ω a possibly infinite interval,
can be reduced to a set of linear matrix inequalities (LMIs) of polynomial size in the number of moments
𝑚 (see, e.g., [8, 9, 44]). Since 𝑔(𝑥) is piecewise polynomial, the support of the dual constraints in (7) can
be subdivided into subintervals so that these constraints can be reduced to a set of LMIs. Generalized
moment information can be included if these moments are described by piecewise polynomials in the dual
problem. Examples of piecewise polynomial objective functions are the indicator function 𝑔(𝑥) = 1[𝑐,∞) (𝑥)
and the stop-loss function 𝑔(𝑥) = max{𝑥 − 𝑐, 0}. Both of these functions have several relevant applications
in e.g. finance, insurance and inventory control.
In Figure 2 we provide numerical bounds for the conditional probability ℙ(𝑋 ⩾ 𝑐 | 𝑋 ⩾ 𝑡), which
corresponds to the piecewise function 𝑔(𝑥) = 1[𝑐,∞) (𝑥). We assume that the ground truth is given by a
uniform distribution with support [0, 5]. We determine the bounds for three types of ambiguity sets, in
which the number of available moments varies between 𝑚 = 2, 4, and 6. Obviously, the bounds become
sharper when more moment information is included, but in addition, the uninformative solution becomes
less prominent because the size of the ambiguity set reduces. Nevertheless, the uninformative solution
becomes more apparent again as the size of Ξ reduces, as already noted in Section 3.1.
Even if the ambiguity sets are augmented with structural properties, the resulting dual problem can
still be reduced to a semidefinite optimization problem. Consider the setting in which P0 contains only
symmetric distributions. Using the generator class consisting of symmetric pairs of Dirac measures, the
dual problem (9) reduces to

inf 𝜆3
𝑚
s.t. ∑ 𝜆𝑗 𝑞𝑗 ⩽ 0,
𝑗=0
𝑚+1
(23)
∑ 𝜆𝑗 (ℎ𝑗 (𝜇 − 𝑥) + ℎ𝑗 (𝜇 + 𝑥)) ⩾
𝑗=0

𝑔(𝜇 − 𝑥)1Ξ (𝜇 − 𝑥) + 𝑔(𝜇 + 𝑥)1Ξ (𝜇 + 𝑥), ∀𝑥 ⩾ 0.

Likewise, if we consider symmetric, unimodal distributions (which can be generated by the convex
combination of a Dirac measure 𝛿𝜇 and rectangular/uniform distributions, i.e., 𝛿[𝜇−𝑧,𝜇+𝑧] , 𝑧 > 0), the dual

19
P(X ⩾ c | X ⩾ 0.25) P(X ⩾ c | X ⩾ 0.5)
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

k k
3 3.5 4 4.5 5 3 3.5 4 4.5 5

(a) Ξ = {𝑋 ⩾ 0.25} (b) Ξ = {𝑋 ⩾ 0.5}


P(X ⩾ c | X ⩾ 1.25)
1

0.8 m=2
m=4
0.6 m=6
Uniform(0, 5)
0.4

0.2

k
3 3.5 4 4.5 5

(c) Ξ = {𝑋 ⩾ 1.25}

Figure 2: Tight bounds for conditional tail probability for ambiguity sets matching the uniform distribu-
tion on [0, 5]

inf 𝜆3
problem becomes

𝑚
s.t. ∑ 𝜆𝑗 𝑞𝑗 ⩽ 0,
𝑗=0
𝑚+1 𝜇+𝑥 𝜇+𝑥
∑ 𝜆𝑗 ∫ ℎ𝑗 (𝑧)d𝑧 ⩾ ∫ 𝑔(𝑧)1Ξ (𝑧)d𝑧, ∀𝑥 ⩾ 0,
(24)
𝑗=0 𝜇−𝑥 𝜇−𝑥
𝑚+1
∑ 𝜆𝑗 ℎ𝑗 (𝜇) ⩾ 𝑔(𝜇)1Ξ (𝜇).
𝑗=0

Observe that the resulting integral transforms are still piecewise polynomial in 𝑥. As a consequence, we
can reformulate both (23) and (24) as semidefinite programming problems.
Figure 3 shows the results for different structural assumptions. Notice that the addition of structural

20
EP [X |X ⩾ t] EP [X |X ⩾ t]
3 3

2 2

1 1

0 0
t t
−4 −3 −2 −1 0 −4 −3 −2 −1 0 1 2

(a) 𝑚 = 2 (b) 𝑚 = 4
EP [X |X ⩾ t]
3

Moment-based
2 Symmetric
Symmetric and unimodal
Normal(0, 1)
1

0
t
−4 −3 −2 −1 0 1 2

(c) 𝑚 = 6

Figure 3: Tight bounds for conditional expectation for ambiguity sets matching the moments and prop-
erties of the standard normal distribution

information significantly sharpens the bounds, independently of the available moment information. How-
ever, even though the bounds are sharpened with additional information, the bounds still diverge for
𝑡 ⩾ 𝜇 when only mean-variance information is available. Still, it is obvious from the figures that this
conservatism of the bounds can be mitigated significantly by adding additional moment information.

4. Distributionally robust optimization with side information


In this section, we lay the groundwork for a generalized moment-based framework for contextual dis-
tributionally robust stochastic optimization. Section 4.1 introduces moment-based contextual DRO and a
class of ambiguity sets that lead to tractable problems. In Section 4.2, we provide examples of these am-
biguity sets for which computationally tractable conic optimization problems can be derived. Finally, in
Section 4.3, we present some numerical results for a two-dimensional example.

21
Given the side information in the form of the event 𝐗 ∈ Ξ, contextual DRO problems can be stated in
4.1. Contextual DRO with mean-dispersion information

general as
inf sup 𝔼ℙ [𝑓 (ν, 𝐗) | 𝐗 ∈ Ξ], (25)
ν∈ V ℙ∈P

where 𝑓 is some cost function to be minimized, and ν ∈ V denotes the decision vector with V ⊆ ℝ𝑝 a
closed, convex set. The probability distribution ℙ is the joint measure governing 𝐗. Let 𝐘 ∈ Y ⊆ ℝ𝑛𝐲 be a
random vector that models the outcome variables that affect the decision problem directly, and let 𝐙 ∈ Z ⊆
ℝ𝑛𝐳 be the covariates (or features) that influence the outcome random variables. Assuming that the supports
of 𝐘 and 𝐙 are independent, let 𝐗 = (𝐘, 𝐙) ∈ Y × Z =∶ X. Henceforth the boldface lowercase characters
represent the realizations of the random vectors. Furthermore, the expectation inf ν supℙ∈P 𝔼ℙ [𝑣(ν, 𝐗) | 𝐗 ∈
Ξ] is conditioned primarily on the information given by the covariates 𝐙 with Ξ𝐳 ⊆ Z the information set
built from the information on the covariates 𝐙. Therefore, the side information is described by Ξ ∶= {𝐱 =
(𝐲, 𝐳) ∈ X ∶ 𝐳 ∈ Ξ𝐳 }. This includes the case in which Ξ𝐳 is represented by a singleton, which models a
particular realization of the covariates. No conditional information is normally included about the outcome
variables. Hence, in the remainder of the section, we occasionally use the notation 𝔼[𝑓 (ν, 𝐗) | 𝐙 ∈ Ξ𝐳 ] for
the conditional expectation given the side information.
We next introduce some additional technical notation tailored to this section of the paper. Boldfaced
lowercase characters represent vectors, where the italic character 𝑥𝑘 denotes the 𝑘th element of the vector
𝐱 and 𝐱⊤ denotes its transpose. Except for the random vectors described above, all boldface uppercase
characters represent matrices. For a set S, let conv(S), cl(S) and int(S) denote its convex hull, closure
and interior, respectively. For a proper cone K ∈ ℝ𝑛 (i.e., a closed, convex and pointed cone with nonempty
interior), the general inequality 𝐱 ≼K 𝐮 is equivalent to the set constraint 𝐮−𝐱 ∈ K, while the strict variant
𝐱 ≺K 𝐮 expresses that 𝐮 − 𝐱 ∈ int(K). A function 𝐡 ∶ ℝ𝑛 ↦ ℝ𝑚 is called K-convex if 𝐡(𝜃𝐱1 + (1 − 𝜃)𝐱2 ) ≼K
𝜃𝐡(𝐱1 ) + (1 − 𝜃)𝐡(𝐱2 ) for all 𝐱1 , 𝐱2 and 𝜃 ∈ [0, 1]. We use K ∗ to denote the dual cone of K, given by
K ∗ = {𝐮 ∶ 𝐮⊤ 𝐱, ∀𝐱 ∈ K} with 𝐮⊤ 𝐱 the appropriate inner product. The set 𝕊𝑛+ represents the cone of
symmetric positive semidefinite matrices in ℝ𝑛×𝑛 . Finally, for matrices 𝐀, 𝐁, we use 𝐀 ≼ 𝐁 to abbreviate
the relation 𝐀 ≼𝕊𝑛+ 𝐁.
In order to derive solvable reformulations of (28), we shall impose the following conditions:
(C1) The side information Ξ is a closed, convex set with inf ℙ∈P ℙ(𝑋 ∈ Ξ) > 0, and the support set X = ℝ𝑛 .

(C2) The dispersion is modelled by (D-)convex epigraph functions (see [29, 66]).

(C3) The function 𝑓 (ν, 𝐱) can be represented as

𝑓 (ν, 𝐱) = max{𝑓𝑙 (ν, 𝐱)},


𝑙∈L
in which L is a set of indices and the auxiliary functions 𝑓𝑙 are of the form

𝑓𝑙 (ν, 𝐱) = 𝐬𝑙 (ν)⊤ 𝐱 + 𝑡𝑙 (ν)

22
where for all 𝑙 ∈ L, 𝐬𝑙 (⋅) ∈ ℝ𝑛 and 𝑡𝑙 (⋅) ∈ ℝ are some affine functions of ν.

(C4) The ambiguity set P satisfies the Slater condition.

The rationale behind the first part of condition (C1) is twofold: (i) it enables the use of robust optimization
methods to reformulate the model, and (ii) it guarantees that the side event has nonzero measure. There
is no necessity for the second part of the condition (closedness and convexity would suffice), as it merely
serves to facilitate the mathematical exposition in this section.
The second condition states that the dispersion function has an epigraph that can be described through
convex cones. We denote the dispersion function by 𝐝 ∶ ℝ𝑛 ↦ ℝ𝑚 , and assume it admits a D-epigraph that
is conic representable, with D ⊆ ℝ𝑚 a proper cone, meaning that the set {(𝐱, 𝐮) ∈ ℝ𝑛 × ℝ𝑚 ∶ 𝐝(𝐱) ≼D 𝐮}
can be described with conic inequalities, possibly using cones other than D and auxiliary variables. See
[6] for a comprehensive introduction to conic representations. The epigraphic mean-dispersion ambiguity
set can now be defined as

P(µ,σ) = {ℙ ∈ P0 (X) ∶ 𝔼ℙ [𝐗] = µ, 𝔼ℙ [𝐝(𝐗)] ≼D σ}, (26)

where P0 is the set of probability distributions with support X, the vector µ ∈ ℝ𝑛 represents the mean
value of the random vector 𝐗, and σ ∈ ℝ𝑚 is an upper bound on the expected value of the dispersion mea-
sure 𝔼[𝐝(𝐗)]. Although the mean-dispersion ambiguity set P(µ,𝐝) may seem simple, there are numerous
practically relevant ambiguity sets that can be recovered by selecting appropriate dispersion functions 𝐝(⋅).
For example, setting 𝐝 = 𝟎 yields the mean-support ambiguity set, while setting 𝐝(𝐗) = (𝐗−µ)(𝐗−µ)⊤ en-
ables modeling the correlation structures between the elements of the random vector 𝐗. Other (D-)convex
dispersion measures include the Huber loss function, mean absolute deviations, any norm ‖⋅‖ on ℝ𝑛 , and
the other convex dispersion measures that were applicable to Proposition 3. We will elaborate on some of
these dispersion measures in the next subsection.
The third condition states that the objective function 𝑓 is a convex, piecewise-affine function of the
uncertainty 𝐱, for all ν ∈ V, and the decision vector ν, for all 𝐱 ∈ X. We will focus on piecewise-affine
objective functions as these are more than sufficient to capture several interesting models. For example,
they can capture max operators as in the newsvendor model, as well as the Conditional-Value-at-Risk,
which is frequently used to optimize financial portfolios with risk-averse investors. The requirement on
the objective function is not strictly necessary and can be relaxed to a much richer class of functions
𝑓 (⋅, ⋅) that are convex in ν for all 𝐱, and concave in 𝐱 for all admissible ν. For a detailed discussion on
computational tractability, we refer the interested reader to [66].

23
The dual problem of the inner maximization problem of (25), in the multivariate case, is given by

inf 𝜆3
𝜆0 ,λ1 ,λ2 ,𝜆3

s.t. 𝜆0 + λ⊤1 µ + λ⊤2 σ ⩽ 0, (27)

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) ⩾ (𝑓 (ν, 𝐱) − 𝜆3 )1Ξ (𝐱), ∀𝐱 ∈ X,

with 𝜆0 , 𝜆3 ∈ ℝ, λ1 ∈ ℝ𝑛 and λ2 ∈ D∗ . We assume that condition (C4) holds so that supℙ∈P 𝔼ℙ [𝑓 (ν, 𝐗) | 𝐗 ∈
Ξ] is strongly dual to (27). To be more specific, the Slater condition is given by µ ∈ int(X) and 𝐝(µ) ≺D σ.
The semi-infinite constraint in (27) can be amended using standard robust optimization methods. This
yields the following result.

Theorem 3 (Contextual DRO with mean-dispersion information). Let ℙ be a member of the mean-
dispersion ambiguity set P(µ,σ) . If conditions (C1)–(C4) hold, then the objective value of the contextual DRO
problem (25) coincides with the optimal value of the semi-infinite LP

inf 𝜆3
ν,𝜆0 ,λ1 ,λ2 ,𝜆3

s.t. 𝜆0 + λ⊤1 µ + λ⊤2 σ ⩽ 0,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0, ∀(𝐱, 𝐮) ∈ C, (28)

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 + 𝜆3 ⩾ 𝐬𝑙 (ν)⊤ 𝐱 + 𝑡𝑙 (ν), ∀(𝐱, 𝐮) ∈ C, ∀𝑙 ∈ L,


ν ∈ V, 𝜆0 ∈ ℝ, λ1 ∈ ℝ𝑛 , λ2 ∈ D∗ , 𝜆3 ∈ ℝ.

C ∶= {(𝐱, 𝐮) ∈ ℝ𝑛 × ℝ𝑚 ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) ≼D 𝐮},


Here,

C ∶= conv ({(𝐱, 𝐮) ∈ ℝ𝑛 × ℝ𝑚 ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) = 𝐮}) ,


(29)

in which Ξ ∶= cl(ℝ𝑛 \Ξ). Moreover, under additional regularity conditions, problem (28) admits a reformula-
tion as a finite-dimensional conic optimization problem.

Proof. The dual problem of the inner maximization problem is given by

inf 𝜆3
𝜆0 ,λ1 ,λ2 ,𝜆3

subject to 𝜆0 + λ⊤1 µ + λ⊤2 σ ⩽ 0, (30)

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) ⩾ 1Ξ (𝐱)(𝑓 (ν, 𝐱) − 𝜆3 ), ∀𝐱 ∈ ℝ𝑛 .

By decomposing the semi-infinite constraints using the definition of the indicator function, we obtain two
semi-infinite constraints,

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) ⩾ 0, ∀𝐱 ∈ ℝ𝑛 \Ξ,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) + 𝜆3 ⩾ 𝑓 (ν, 𝐱), ∀𝐱 ∈ Ξ,
(31)

24
in which ℝ𝑛 \Ξ is the complement of Ξ. Since λ ∈ D∗ and 𝐝(𝐱) is D-convex by assumption, it holds that
𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) is a convex function of 𝐱 (see, e.g., [12, p. 110]) and, a fortiori, continuous in 𝐱. From
a standard continuity argument, it then follows that we are allowed to replace the complement with its
closure, Ξ. Since 𝑓 (ν, 𝐱) is a convex, piecewise affine function by condition (C3), (31) is equivalent to

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) ⩾ 0, ∀𝐱 ∈ Ξ,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐝(𝐱) + 𝜆3 ⩾ 𝑓𝑙 (ν, 𝐱), ∀𝐱 ∈ Ξ, ∀𝑙 ∈ L.

Then, by lifting the nonlinearity in the uncertainty to the uncertainty set, we obtain the robust counterparts

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0, ∀(𝐱, 𝐮) ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) = 𝐮,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 + 𝜆3 ⩾ 𝑓𝑙 (ν, 𝐱), ∀(𝐱, 𝐮) ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) = 𝐮, ∀𝑙 ∈ L.
(32)

As the constraints are linear in the uncertain parameters, we can equivalently use the convex hull of the
uncertainty sets

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0, ∀(𝐱, 𝐮) ∈ conv ({(𝐱, 𝐮) ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) = 𝐮}) ,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 + 𝜆3 ⩾ 𝑓𝑙 (ν, 𝐱), ∀(𝐱, 𝐮) ∈ conv({(𝐱, 𝐮) ∶ 𝐱 ∈ Ξ, 𝐝(𝐱) = 𝐮}), ∀𝑙 ∈ L,
(33)

From conditions (C1) and (C2), it follows that the convex hull of the uncertainty set for the second set
of robust counterparts is equivalent to C. It is important to note that for the first set of semi-infinite
constraints, the convex hull also forms a convex set. As the robust counterparts in (33) constitute an
(infinite) intersection of halfspaces with respect to the dual variables 𝜆0 , λ1 , λ2 , 𝜆3 and the decision vector
ν, it is deduced that the feasible set is convex and (28) is a convex optimization problem. This in turn
implies that problem (28) can be phrased as a conic optimization problem, provided the convex sets C and
C allow for tractable conic reformulations that meet a Slater condition. To substantiate the second claim,
in Appendix B we recast (28) as a finite-dimensional conic program under these supplementary conditions.
This completes the proof.

The difficulty now lies in reformulating the semi-infinite constraints (or robust counterparts) in the
dual problem by constructing explicit expressions for the convex hulls C and C. To address this, robust
optimization techniques for nonlinear types of uncertainty can be used; see, for example, [3, 67] for further
details. It may be insightful to note here that optimization over C is generally computationally tractable,
provided that the event set and epigraph are (tractable) conic representable, but the reformulation of the
semi-infinite constraint that involves the complement is typically not—since it entails optimizing a convex
function over a nonconvex set. To simplify the problem, it is reasonable to consider only polyhedral
event sets Ξ. This assumption, although sacrificing generality, enhances tractability of the first semi-
infinite constraint. Using De Morgan’s laws, the semi-infinite constraint involving Ξ can be decomposed
into optimization over halfspaces. We consider such an event set in the next subsection. It also seems
noteworthy to mention here that the distribution-free analysis of (28) shares many similarities with the

25
literature based on uncertainty quantification [28, 29], and distributionally robust convex optimization
[66]. However, in contrast to Theorem 5 in [66], we apply a lifting argument when solving the dual problem,
rather than during the construction of the ambiguity set. In the next section, we show that Theorem 3 leads
to computationally tractable models for appropriate choices of P and Ξ.

4.2. Some examples for mean-dispersion information


For the sake of exposition, we limit our attention to two types of ambiguity sets, which are analogous to
Propositions 1 and Proposition 2 in Section 3.1. Further, for the sake of simplicity, we assume that the
event set is defined by a halfspace; that is,

Ξ𝐳 = {𝐳 ∈ ℝ𝑛𝐳 ∶ 𝐜⊤ 𝐳 ⩽ 𝑐},
̄

with 𝐜 ∈ ℝ𝑛𝐳 and 𝑐̄ ∈ ℝ. Since Y = ℝ𝑛𝐲 , Ξ is unrestricted in the outcome space. We have chosen this
specific setup so that, in the remainder of this section, we can obtain computationally tractable conic
reformulations. In this context, “computationally tractable” means that our problems can be formulated
as linear, conic-quadratic or, to a lesser degree, semidefinite programs so that we are able to use mature,
off-the-shelf solvers for conic optimization. The derivations of these conic programs are provided in the
Appendix.
We first construct a Chebyshev-type ambiguity set [19, 64], which allows us to impose conditions on
the covariance matrix of the random vector 𝐗. Let 𝔼[𝐗] = µ denote the mean vector, and define the
dispersion measure as 𝐝(𝐱) = (𝐱 − µ)(𝐱 − µ)⊤ . We identify D with the cone of positive semidefinite
matrices. The Chebyshev ambiguity set then consists of all distributions with mean µ ∈ ℝ𝑛 and covariance
matrix bounded above by Σ ∈ 𝕊𝑛+ . It can be defined as
{ }
P(µ,Σ) = ℙ ∈ P0 (ℝ𝑛 ) ∶ 𝔼ℙ [𝐗] = µ, 𝔼ℙ [(𝐗 − µ)(𝐗 − µ)⊤ ] ≼ Σ .

When we consider this ambiguity set in conjunction with the half-space event set, the constraints in (28)
can be described by LMIs. Hence, Theorem 3 yields the following result.

Corollary 1 (Chebyshev ambiguity set). Suppose conditions (C1)–(C4) are satisfied. Let Ξ𝐳 be defined
by a halfspace. Then, for P = P(µ,Σ) , the contextual DRO problem (28) can be reformulated as a semidefinite
optimization problem.

Alternatively, the MAD can be used as dispersion measure [52]. Let 𝐦 ∈ ℝ𝑛 represent some center point,
in our case the mean. Assume that we have bounds for the componentwise mean deviations 𝔼[|𝐗 − 𝐦|]
and the pairwise mean deviations 𝔼[|(𝑋𝑖 ± 𝑋𝑗 ) − (𝑚𝑖 ± 𝑚𝑗 )|], which are given by 𝐟 ∈ ℝ𝑛 . This information
2

{ }
results in the ambiguity set

P(𝐦,𝐟) = ℙ ∈ P0 (ℝ𝑛 ) ∶ 𝔼ℙ [𝐗] = 𝐦, 𝔼[|𝑋𝑖 − 𝑚𝑖 |] ⩽ 𝑓𝑖,𝑖 , ∀𝑖, 𝔼[|(𝑋𝑖 ± 𝑋𝑗 ) − (𝑚𝑖 ± 𝑚𝑗 )|] ⩽ 𝑓𝑖,𝑗 , ∀𝑖 ≠ 𝑗 .

For this ambiguity set, the constraints in (28) are representable as linear inequalities. Therefore, Theorem 3
leads to the following result.

26
Corollary 2 (MAD ambiguity set). Suppose that conditions (C1)–(C4) hold. Let Ξ𝐳 be defined by a half-
space. Then, for P = P(𝐦,𝐟) , the contextual DRO problem (28) can be reformulated as a linear optimization
problem.

The precise mathematical models are relegated to Appendix B. Although we have obtained computation-
ally tractable models, it is not immediately clear how the ambiguity sets and side information interact, or
under which conditions the contextual DRO problem reduces to its robust counterpart, inf ν sup𝐲 𝑓 (ν, 𝐲),
as discussed in Section 3. Related to this, we would like to mention the broad class of nested ambiguity
sets that were introduced in [66], which encompass several distance-based ambiguity sets as special types
of generalized-moment ambiguity sets. The reason for this is that distance-based ambiguity sets can be
defined by a finite number of (conditional) expectation constraints based on generalized moments [16].
The distance-based ambiguity sets are particularly interesting as they provide an explicit way to relate
the “sizes” of ambiguity sets and event sets. This makes it possible to quantify when the contextual DRO
problem becomes “uninformative.” For an excellent discussion on the interplay between a distance-based
ambiguity set based on optimal transport and the size of the event set, see [46]. As our framework applies
to generalized moments, it is possible to extend Theorem 3 to include nested and distance-based ambiguity
sets. Furthermore, as discussed in Section 3, we can obtain tighter bounds by imposing structural prop-
erties on the base ambiguity set P0 . However, both extensions would entail delving into many technical
details. As these might detract from the main focus of this expository section, we leave them to the avid
reader.

4.3. Newsvendor example


To gain a better understanding of our contextual DRO framework, we now provide a numerical illustration
of our models using a well-known problem in OR, namely the (single-item) newsvendor model. In contrast
to the traditional model, we pose the problem in the form 𝔼ℙ [𝑓 (ν, 𝐗) | 𝐙 ∈ Ξ𝐳 ] to allow for the inclusion
of side information. For further insights, please refer to [1] and the references therein, which discuss the
data-driven newsvendor model with feature information.
Suppose that the newsvendor trades in a single product. Before observing the product demand 𝐷, the
newsvendor places an order for 𝑞 units of the product. Once the demand is realized, the newsvendor sells
all available stock. Any unsatisfied demand results in backorders incurring a penalty cost of 𝑝 per unit, and
any leftover inventory is penalized with a holding cost of ℎ per unit. The newsvendor aims to minimize
the total holding and penalty costs, as a function of the order quantity 𝑞 ⩾ 0 and the demand variable 𝐷,

𝐶(𝑞, 𝐷) = ℎ(𝑞 − 𝐷)+ + 𝑝(𝐷 − 𝑞)+


= ℎ(𝑞 − 𝐷) + (ℎ + 𝑝)(𝐷 − 𝑞)+ ,
(34)

with (𝑥)+ ∶= max{𝑥, 0}. It is evident that the cost function 𝐶 is piecewise affine, consistent with condition
(C3). The key distinction from the traditional setting lies in the newsvendor’s ability to observe a covariate

27
𝑍, which influences the demand variable 𝐷, prior to making the ordering decision. Hence, the contextual
DRO problem can be posed as
inf sup 𝔼ℙ [𝐶(𝑞, 𝐷) | 𝑍 ∈ Ξ].
𝑞⩾0 ℙ∈P
(µ,Σ)

The primary objective is to minimize the total costs incurred by the newsvendor while incorporating the
information provided about the feature, namely, that 𝑍 ∈ Ξ𝐳 .

𝔼ℙ [𝐶(𝑞, 𝐷) | 𝑍 ∈ Ξ𝐳 ] 𝔼ℙ [𝐶(𝑞, 𝐷) | 𝑍 ∈ Ξ𝐳 ]
20 20
Corollary 1
Bivariate normal
15 Scarf 15

10 10

5 5

0 0
2 3 4 5 6 7 8 9 10 𝑞 2 3 4 5 6 7 8 9 10 𝑞

(a) Ξ𝐳 = {𝑍 ⩾ 1}, 𝜚 = 0 (b) Ξ𝐳 = {𝑍 ⩾ 1}, 𝜚 = 0.95


𝔼ℙ [𝐶(𝑞, 𝐷) | 𝑍 ∈ Ξ𝐳 ] 𝔼ℙ [𝐶(𝑞, 𝐷) | 𝑍 ∈ Ξ𝐳 ]
20 20

15 15

10 10

5 5

0 0
2 3 4 5 6 7 8 9 10 𝑞 2 3 4 5 6 7 8 9 10 𝑞

(c) Ξ𝐳 = {𝑍 ⩾ 4}, 𝜚 = 0 (d) Ξ𝐳 = {𝑍 ⩾ 4}, 𝜚 = 0.95

Figure 4: Newsvendor cost bounds for different order quantities with dependent and independent demand

To account for the dependence between the demand variable 𝐷 and the feature 𝑍, we introduce a co-
variance matrix to construct the mean-variance ambiguity set. We assume the ground truth is given by a

28
bivariate normal distribution, characterized by its mean vector µ = (5, 5) and covariance matrix Σ. The
diagonal elements of this matrix are 𝜎1,1
2 = 2.25, 𝜎 2 = 1, while the off-diagonal elements are determined
2,2
by the correlation coefficient 𝜚, representing the dependence between the demand and the feature. We dis-
tinguish between two levels of correlation: no correlation, which is represented by 𝜚 = 0, and high positive
correlation, which is represented by 𝜚 = 0.95. Further, we let the holding cost ℎ = 1, and the penalty cost
𝑝 = 5. As for the feature’s size, we consider two events: {𝑍 ⩾ 1} and {𝑍 ⩾ 4}. To provide context, think
of 𝑍 as being demand for an alternative item with similar traits, but which is already partially observed
during the planning period.
In Figure 4, the numerical results of the described setting are illustrated, offering several intriguing
insights. The plot includes Scarf’s bound [55] for the zero-correlation setting. Here it is important to note
that this bound is valid only for these settings. This is because if the true underlying distribution is a
bivariate normal distribution, then the demand 𝐷 and feature 𝑍 are independent when 𝜚 = 0. Therefore, a
distributionally robust bound that considers only the demand information is sufficient. For the case with
zero correlation, Scarf’s bound and our bound perform similarly when the event set is of moderate size.
However, as the size of the event set decreases, we observe Scarf’s bound outperforms our bound. The
reason behind this is that the worst-case distribution resulting from optimization over P(µ,Σ) still permits
higher orders of dependency. As a result, the value of 𝑍 continues to have a significant impact on the
bound for the conditional expectation. Furthermore, it becomes evident that the differences between the
conditional-expectation bounds and the true values primarily depend on the size of the event set, rather
than that of the correlation coefficient. This is to be expected and aligns with our prior discussions on
this topic in previous sections. Finally, it is essential to keep in mind that the bounds for the conditional
expectations continue to diverge as the side information set {𝑍 ⩾ 𝑧0 } decreases in size. The only practical
way to reduce this conservatism seems to be the inclusion of more information in the ambiguity set.

5. Conclusions and outlook


Let us conclude. This paper presents a novel framework for bounding conditional expectations that, in con-
trast to generalized moment-bound problems, can explicitly incorporate side information into the semi-
infinite formulation. The key idea is to use a simple transformation to reduce the resulting semi-infinite
fractional problem to a semi-infinite LP. The corresponding dual problem highly resembles that of a gen-
eralized moment problem, but it involves an additional constraint that models the conditioning on this
random event. Fortunately, this slight increase in complexity does not seem to affect significantly the com-
putational tractability of the resulting models. The generalized conditional-bound framework can be used
to obtain univariate bounds for different ambiguity sets and general objectives (for, e.g., pricing) through
the use of primal-dual arguments. Moreover, it serves as the foundation for a moment-based contextual
DRO framework that can be applied to stochastic optimization problems with side information.
We finally mention several potentially interesting avenues for further research. First, it seems of in-

29
terest to find more applications for the univariate bounds, such as the robust monopoly-pricing problem.
Second, alternative applications for the contextual DRO framework, discussed in Section 4, can be inves-
tigated. Moreover, it would be beneficial to expand our findings to nested ambiguity sets, as this class of
ambiguity includes the distance-based ambiguity sets, which offer a more direct way to answer questions
about when a solution becomes “uninformative,” i.e., for which instances the DRO problem reduces to a
robust optimization problem. Conducting a comprehensive complexity analysis for the nested ambiguity
sets and different types of side information and objective function structures also seems a worthwhile topic
to explore. In conclusion, our proposed framework provides a promising approach for bounding condi-
tional expectations while incorporating side information. We anticipate that the suggested directions for
future research will contribute to the development and applicability of this framework.

Acknowledgments
The author would like to thank Johan van Leeuwaarden, Pieter Kleer and Bas Verseveldt for communicating their
versions of Propositions 1, 2 and 3, which they derived independently through a more direct approach, essentially
solving the primal problem. The author would also like to thank Dick den Hertog for proofreading the first draft of
this manuscript.

A. Proofs

nonempty. Otherwise, both problems are infeasible with optimal value −∞, as per the conventional definition. Let
Proof of Lemma 1. Assume, without loss of generality, that the moment constraints are consistent so that P is

{̃𝜏 , {ℙ̃ 𝑘 }𝑘⩾1 } and {𝜏 ∗ , (𝜏 ∗ , {ℙ∗ }𝑘⩾1 )} be the optimal values, and optimal solutions (maximizing sequences), of (4) and
𝑘

achieved by an extremal distribution ℙ∗ , we can pick a sequence such that lim ℙ∗𝑘 = ℙ∗ . To prove the result, it is
(8), respectively. We can assume maximizing sequences without loss of generality. If the optimal value is exactly

𝑘→∞
sufficient to show that
𝔼ℙ∗𝑘 [𝑔(𝑋 )1Ξ (𝑋 )] 𝔼ℙ̃ [𝑔(𝑋 )1Ξ (𝑋 )] 𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 )]
𝜏 ∗ = lim = lim 𝑘 = sup =∶ 𝜏̃ .
𝑘→∞ 𝔼ℙ∗𝑘 [1Ξ (𝑋 )] 𝑘→∞ 𝔼ℙ̃ 𝑘 [1Ξ (𝑋 )] 𝔼ℙ [1Ξ (𝑋 )]
(35)
ℙ∈P

The final equality can be attributed to the optimality of {ℙ̃ 𝑘 }. Note that all values are finite by assumption. Notice

𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 )]
⩽ 𝜏̃ ,
further that

𝔼ℙ [1Ξ (𝑋 )]
for all ℙ ∈ P. Rewriting and taking the supremum over P, the inequality above is equivalent to

sup 𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 ) − 𝜏̃ 1Ξ (𝑋 )] ⩽ 0,


ℙ∈P

which implies 𝜏̃ is feasible to problem (8). Since 𝜏 ∗ is a feasible solution to (8), it holds that

sup 𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 ) − 𝜏 ∗ 1Ξ (𝑋 )] ⩽ 0. (36)


ℙ∈P

Recall that the supremum in (36) is achieved by the sequence {ℙ∗𝑘 }. Rewriting (36), we obtain
𝔼ℙ∗𝑘 [𝑔(𝑋 )1Ξ (𝑋 )]
lim 𝔼ℙ∗𝑘 [𝑔(𝑋 )1Ξ (𝑋 ) − 𝜏 ∗ 1Ξ (𝑋 )] ⩽ 0 ⟺ 𝜏 ∗ ⩾ lim ,
𝑘→∞ 𝑘→∞ 𝔼ℙ∗𝑘 [1Ξ (𝑋 )]

30
from which the first identity in (35) follows by optimality of 𝜏 ∗ to problem (8). Since 𝜏̃ is optimal to (4),
𝔼ℙ∗𝑘 [𝑔(𝑋 )1Ξ (𝑋 )] 𝔼ℙ̃ [𝑔(𝑋 )1Ξ (𝑋 )]
𝜏 ∗ = lim ⩽ lim 𝑘 = 𝜏̃ .
𝑘→∞ 𝔼ℙ∗𝑘 [1Ξ (𝑋 )] 𝑘→∞ 𝔼ℙ̃ 𝑘 [1Ξ (𝑋 )]
(37)

We next show that (36) also implies that


𝔼ℙ [𝑔(𝑋 )1Ξ (𝑋 )]
sup ⩽ 𝜏∗.
𝔼ℙ [1Ξ (𝑋 )]
(38)
ℙ∈P

That is, the sequence {ℙ∗𝑘 } is also an optimal solution of problem (4). For the sake of contradiction, assume that for
an arbitrary 𝜖 > 0, there exists a sequence of probability measures {ℙ̃ 𝑘 } such that
𝔼ℙ̃ 𝑘 [𝑔(𝑋 )1Ξ (𝑋 )]
> 𝜏 ∗ + 𝜖,
𝔼ℙ̃ 𝑘 [1Ξ (𝑋 )]

as 𝑘 grows sufficiently large. Fixing ℙ̃ 𝑘 for such 𝑘, we obtain

𝔼ℙ̃ 𝑘 [𝑔(𝑋 )1Ξ (𝑋 ) − 𝜏 ∗ 1Ξ (𝑋 )] > 𝜖𝔼ℙ̃ 𝑘 [1Ξ (𝑋 )].

For 𝔼ℙ̃ 𝑘 [1Ξ (𝑋 )] = ℙ̃ 𝑘 (𝑋 ∈ Ξ) > 0 and 𝜖 > 0, this inequality contradicts (36). Moreover, if ℙ𝑘 (𝑋 ∈ Ξ) = 0, the
fractional objective function would be ill-defined, also yielding a contradiction. Hence, it follows from combining
(37) and (38) that the optimal values of (4) and (8) agree, and there exists a sequence of probability distributions that
achieves this optimal value in both problems jointly. This completes the proof.

Proof of Proposition 2. Replacing the second-moment function 𝑥 2 by |𝑥 − 𝜇| and substituting 𝑑 for (𝜎 2 + 𝜇2 ) in (12)
yields the dual problem for supℙ∈P(𝜇,𝑑) 𝔼[𝑋 |𝑋 ⩾ 𝑡],

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

subject to 𝜆0 + 𝜆1 𝜇 + 𝜆2 𝑑 ⩽ 0,
𝜆0 + 𝜆1 𝑥 + 𝜆2 |𝑥 − 𝜇| ⩾ 0, ∀𝑎 ⩽ 𝑥 < 𝑡,
(39)

𝜆0 + 𝜆1 𝑥 + 𝜆2 |𝑥 − 𝜇| ⩾ 𝑥 − 𝜆3 , ∀𝑡 ⩽ 𝑥 ⩽ 𝑏.

Denote the left-hand sides of the second and third constraints by 𝐷(𝑥) ∶= 𝜆0 + 𝜆1 𝑥 + 𝜆2 |𝑥 − 𝜇|. The function 𝐷(𝑥)
is dual feasible when it is greater than or equal to 0 for 𝑥 < 𝑡 and greater than or equal to 𝑥 − 𝜆3 for 𝑡 ⩽ 𝑥 ⩽ 𝑏.
First, we consider the case 𝑡 < 𝜇. To solve the dual problem, we shall consider three cases for the shape of the dual
function 𝐷(𝑥), as illustrated in Figure 5.
First, we discuss the case where 𝐷(𝑥) is a straight line (i.e., 𝜆2 = 0). If 𝜆1 = 0, then 𝐷1 (𝑥) is a horizontal line, which
is dual feasible because, for a suitable choice of 𝜆3 , this function lies above 1{𝑥⩾𝑡} (𝑥 − 𝜆3 ). Note that for this solution
to satisfy 𝜆0 + 𝜆1 𝜇 + 𝜆2 𝑑 ⩽ 0, the constant 𝜆0 = 0. Since we are minimizing 𝜆3 (or maximizing −𝜆3 ), we push the
function 𝑥 − 𝜆3 upward until it hits the horizontal line, so this choice for the dual function yields 𝑏 as the optimal
objective value. The dual function 𝐷1 (𝑥) cannot have a positive slope because this would imply 𝜆0 + 𝜆1 𝜇 > 0. Note
that a line with negative slope 𝜆1 < 0 will only increase 𝜆3 . Hence, the horizontal{line with 𝜆0 =}0 is the best feasible
−2𝑏+𝑑+2𝜇 , 𝑏 and probabilities
option. A primal solution that attains this value is the distribution with support 𝑏(𝑑−2𝜇)+2𝜇
2

𝑑 𝑑
𝑝1 = 1 − , 𝑝2 = .
2(𝑏 − 𝜇) 2(𝑏 − 𝜇)

31
1{𝑥⩾𝑡} (𝑥 − 𝜆3 )
𝑀1 (𝑥)
𝑀2 (𝑥)
𝑀3 (𝑥)

𝑏
𝜇
𝑥
𝑡−

𝑡 − 𝜆3

Figure 5: 𝐷1 (𝑥), 𝐷2 (𝑥), and 𝐷3 (𝑥)

Using complementary slackness, we argue that the second case, 𝐷2 (𝑥), can be omitted. From Figure 5, observe
that the corresponding primal solution is supported on the values 𝑎, 𝑏. However, we cannot construct a two-point
distribution that, in general, satisfies the moment constraints. Therefore, this second case does not provide a useful

For the third case (wedge), let 𝐷3 (𝑥) coincide with 1{𝑥 ⩾ 𝑡}(𝑥 − 𝜆3 ) at 𝑥 = 𝑡, 𝜇 and 𝑏. Choosing 𝐷3 (𝑥) in this
solution from which we can obtain a tight bound.

particular way, the dual variables satisfy

(𝑡 + 𝜆3 )𝜇 − 2𝑡𝜆3 𝜆3 + 𝑡 − 2𝜇 𝑡 − 𝜆3
𝜆0 = , 𝜆1 = , 𝜆2 = .
2(𝑡 − 𝜇) 2(𝑡 − 𝜇) 2(𝑡 − 𝜇)
Substituting these values into 𝜆0 + 𝜆1 𝜇 + 𝜆2 𝑑 ⩽ 0, we obtain

𝑑(𝜆3 − 𝑡)
−𝜆3 + 𝜇 + ⩽ 0,
2(𝜇 − 𝑡)

in which the left-hand side is a decreasing (linear) function of 𝜆3 for 𝑡 < 𝜇 − 𝑑(𝑏−𝜇)
2(𝑏−𝜇)−𝑑 .
respect to 𝜆3 , we choose 𝜆3 such that equality is attained. Hence, 𝜆3∗ = 𝜇 +
Since we are minimizing with
𝑑(𝜇−𝑡)
{ − ∗} 2(𝜇−𝑡)−𝑑 .
𝑡 , 𝜆3 , and respective probabilities
The distribution with support

𝑑 𝑑
𝑝1 = , 𝑝2 = 1 − ,
2(𝜇 − 𝑡) 2(𝜇 − 𝑡)
achieves 𝜆3∗ asymptotically. Combining the two feasible cases, and ensuring these bounds are tight by constructing
primal feasible solutions that (asymptotically) achieve these bounds, we arrive at the desired result.

Proof of Proposition 3. In this setting, the dual is given by

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

subject to 𝜆0 + 𝜆1 𝜇 + 𝜆2 𝑑 ⩽ 0,
𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑑(𝑥) ⩾ 0, ∀𝑥 < 𝑡,
(40)

𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑑(𝑥) ⩾ 𝑥 − 𝜆3 , ∀𝑥 ⩾ 𝑡.

32
Define 𝑉 (𝑥) ∶= 𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑑(𝑥). Analogous to the variance and MAD settings, a candidate solution is the majorant
𝑉 (𝑥) that touches at 𝑥 = 𝑡 − , and is tangent to 𝑥 − 𝜆3 at a point 𝑥0 > 𝑡 that will be determined a posteriori. Using these
insights, we solve the following system of equations to determine 𝑝𝑡 , 𝑝𝑥0 and the dual variables 𝜆0 , 𝜆1 , 𝜆2 , 𝜆3 (with 𝑥0

𝑝𝑡 𝑡 + 𝑝𝑥0 𝑥0 = 𝜇, 𝑝𝑡 𝑑(𝑡) + 𝑝𝑥0 𝑑(𝑥0 ) = 𝜎,


̄
fixed):

𝜆0 + 𝜆1 𝑡 + 𝜆2 𝑑(𝑡) = 0, 𝜆0 + 𝜆1 𝑥0 + 𝜆2 𝑑(𝑥0 ) = 𝑥0 − 𝜆3
𝜆1 + 𝜆2 𝑑 ′ (𝑥0 ) = 1, 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜇2 + 𝜎 2 ) = 0,
where the first line contains the moment constraint, the second and third line fix the shape of 𝑉 (𝑥), and finally,
we assume that the dual constraint 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜇2 + 𝜎 2 ) ⩽ 0 is tight. The derivative 𝑑 ′ (𝑥) is assumed to be

solution is always feasible, since 𝑉 (𝑥) is convex (𝜆2 ⩾ 0 is a necessary condition for dual feasibility), and therefore,
the right derivative, in order to allow for nondifferentiable dispersion functions. Notice that the resulting dual

𝑉 (𝑥) ⩾ (𝑥 − 𝜆3 )1Ξ (𝑥), ∀𝑥, by the constraints 𝑉 ′ (𝑥0 ) = 1 and 𝑉 (𝑡) = 0. Solving the equations, one obtains

𝜎𝑡
̄ − 𝜇𝑑(𝑡) 𝜎𝑥
̄ 0 − 𝜇𝑑(𝑥0 ) −(𝑡 − 𝜇) (𝑑(𝑥0 ) − 𝑥0 𝑑 ′ (𝑥0 )) − 𝜇𝑑(𝑡) + 𝜎𝑡
̄
𝑝𝑡 = , 𝑝𝑥0 = , 𝜆3 = ,
𝑡𝑑(𝑥0 ) − 𝑥0 𝑑(𝑡) 𝑥0 𝑑(𝑡) − 𝑡𝑑(𝑥0 ) (𝑡 − 𝜇)𝑑 (𝑥0 ) − 𝑑(𝑡) + 𝜎̄

𝜇𝑑(𝑡) − 𝜎𝑡 ̄ 𝜎̄ − 𝑑(𝑡) 𝑡−𝜇


𝜆0 = , 𝜆1 = , 𝜆2 = .
and

(𝑡 − 𝜇)𝑑 ′ (𝑥0 ) − 𝑑(𝑡) + 𝜎̄ (𝑡 − 𝜇)𝑑 ′ (𝑥0 ) − 𝑑(𝑡) + 𝜎 (𝑡 − 𝜇)𝑑 ′ (𝑥0 ) − 𝑑(𝑡) + 𝜎̄


To guarantee strong duality, we choose 𝑥0 such that

−(𝑡 − 𝜇) (𝑑(𝑥0 ) − 𝑥0 𝑑 ′ (𝑥0 )) − 𝜇𝑑(𝑡) + 𝜎𝑡


̄ (𝑡 − 𝑥0 )𝜎̄ + (𝑥0 − 𝜇)𝑑(𝑡) + (𝜇 − 𝑡)𝑑(𝑥0 )
𝑥0 = = 𝜆3 ⟺ = 0,
(𝑡 − 𝜇)𝑑 (𝑥0 ) − 𝑑(𝑡) + 𝜎̄
′ (𝑡 − 𝜇)𝑑 ′ (𝑥0 ) − 𝑑(𝑡) + 𝜎̄
and the normalization constraint 𝑝𝑡 + 𝑝𝑥0 = 1 hold. Both conditions are equivalent to

(𝑡 − 𝑥0 )𝜎̄ + (𝑥0 − 𝜇)𝑑(𝑡) + (𝜇 − 𝑡)𝑑(𝑥0 ) = 0.

Consequently, the 𝑥0∗ that follows from solving

𝜎𝑡
̄ − 𝜇𝑑(𝑡) 𝜎𝑥
̄ 0 − 𝜇𝑑(𝑥0 )
+ = 1,
𝑡𝑑(𝑥0 ) − 𝑥0 𝑑(𝑡) 𝑥0 𝑑(𝑡) − 𝑡𝑑(𝑥0 )
is optimal. Hence, the claim follows.

Proof of Proposition 4. The class of symmetric pairs of Dirac measures (i.e., 𝛿𝜇−𝑥 , 𝛿𝜇+𝑥 , 𝑥 ⩾ 0) generates the set of
symmetric distributions about 𝜇. From Theorem 1, it follows that the dual problem is given by

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

subject to 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0,
2𝜆0 + 2𝜆1 𝜇 + 2𝜆2 (𝑥 2 + 𝜇2 ) + 𝜆3 1Ξ (𝜇 − 𝑥) + 𝜆3 1Ξ (𝜇 + 𝑥)
(41)

⩾ (𝜇 − 𝑥)1Ξ (𝜇 − 𝑥) + (𝜇 + 𝑥)1Ξ (𝜇 + 𝑥), ∀𝑥 ⩾ 0.

The last constraint can be reduced to

2𝜆0 + 2𝜆1 𝜇 + 2𝜆2 (𝑥 2 + 𝜇2 ) ⩾ −2𝜆3 + 2𝜇, ∀0 ⩽ 𝑥 < 𝜇 − 𝑡,


2𝜆0 + 2𝜆1 𝜇 + 2𝜆2 (𝑥 2 + 𝜇2 ) ⩾ 𝑥 + 𝜇 − 𝜆3 , ∀𝑥 ⩾ 𝜇 − 𝑡.

33
1Ξ (𝜇 + 𝑥) = 1, ∀𝑥 ⩾ 0, since it is assumed that 𝑡 < 𝜇. Define the quadratic function 𝑀 sym (𝑥) ∶=
2𝜆0 + 2𝜆1 𝜇 + 2𝜆2 (𝑥 2 + 𝜇2 ). We suggest two possible solutions for the dual problem. The first solution, denoted as
Notice that

𝑀1 (𝑥), touches −2𝜆3 + 2𝜇 at 𝑥 = 0 and 𝑥 + 𝜇 − 𝜆3 at 𝜇 − 𝑡. The second solution, 𝑀2 (𝑥), is a quadratic function that
sym sym

touches 𝑥 +𝜇−𝜆3 at an optimal point 𝑥0 that is unknown a priori. We further postulate that in both dual solutions, the
constraint 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜇2 + 𝜎 2 ) ⩽ 0 is tight. In the interest of space, we omit the figure, but it is easily verified that

and are the pairs of Dirac measure 𝛿𝜇−𝑥 , 𝛿𝜇+𝑥 in which for 𝑥 we substitute the points at which the dual function
the suggested solutions are dual feasible. The corresponding primal solutions follow from complementary slackness

The dual variables which correspond to 𝑀1 (𝑥) are


coincides with the right-hand sides of the constraints in (41).
sym

(𝑡−𝜇)(𝜇2 +𝜎 2 )
−𝜆0 + 𝜇−𝑡
𝜆1 = , 𝜆2 = ,
2(𝑡−𝜇)2 −𝜎 2
𝜇 2(𝑡 − 𝜇)2 − 𝜎 2

2(𝑡 − 𝜇)2 𝜇 − 𝑡𝜎 2
yielding
𝜆3∗ =
2(𝑡 − 𝜇)2 − 𝜎 2

𝑀1 (𝑥) is convex (as 𝜆2 ⩾ 0) and tangent to −2𝜆3 + 2𝜇 at 𝑥 = 0, and further some straightforward calculations show
as our guess for the optimal value of the dual problem. The proposed solution is feasible for the dual problem since
sym

that the derivative of 𝑀1 (𝑥) at 𝑥 = 𝜇 − 𝑡 is greater than 1, so that 𝑀1 (𝑥) ⩾ 𝑥 + 𝜇 − 𝜆3∗ , ∀𝑥 ⩾ 𝜇 − 𝑡. For the primal
sym sym

probabilities, it follows from the variance constraint


1 1
(1 − 𝑝)𝜇2 + 𝑝𝑡 2 + 𝑝(2𝜇 − 𝑡)2 = 𝜇2 + 𝜎 2 ,
2 2

𝜎2
𝑝= .
that

(𝑡 − 𝜇)2

2(𝑡−𝜇)2 (2𝜇 − 𝑡) + (1 − (𝑡−𝜇)2 ) 𝜇


Hence,
2𝜇(𝑡 − 𝜇)2 − 𝜎 2 𝑡
𝜎2 𝜎2

𝔼[𝑋 | 𝑋 ⩾ 𝑡] = = = 𝜆3∗ ,
𝜎2
+ (1 − 𝜎2 2(𝑡 − 𝜇)2 − 𝜎 2
2(𝑡−𝜇)2 (𝑡−𝜇)2 )

support point 𝑥0 first. From the moment constraints and the fact that the solution should be symmetric about 𝜇, we
so that these are the optimal primal-dual solutions by weak duality. For the second case, we determine the candidate

find that 𝑥0∗ = 𝜎 and therefore, the primal candidate is given by the distribution 21 𝛿𝜇−𝜎 + 12 𝛿𝜇+𝜎 . The second dual

4𝜆0 𝜎 + 𝜇2 + 𝜎 2 1
solution then yields
𝜆1 = − , 𝜆2 = , 𝜆∗ = 𝜇 + 𝜎.
4𝜇𝜎 4𝜎 3

𝔼[𝑋 | 𝑋 ⩾ 𝑡] = 𝜇 + 𝑥0∗ = 𝜇 + 𝜎 = 𝜆3∗ .


Thus,

For 𝑡 ⩾ 𝜇, there exists a sequence of measures supported on {𝜇, 𝜇 − 𝑘, 𝜇 + 𝑘} that is feasible in the primal, and for
which the conditional expectation diverges, as 𝑘 → ∞. This feasible sequence is given by
𝜎2 1 𝜎2 1 𝜎2
ℙ𝑘 = 1− 𝛿𝜇 + 𝛿𝜇−𝑘 + 𝛿𝜇+𝑘 .
( 𝑘 )
2 2𝑘 2 2 𝑘2
It is then easily verified that lim𝑘→∞ 𝔼ℙ𝑘 [𝑋 | 𝑋 ⩾ 𝑡] diverges. Combining the cases above, while checking for feasi-
bility of the primal solutions, the claim follows.

34
Proof of Proposition 5. Symmetric, unimodal distributions with mode 𝜇 can be generated by rectangular/uniform
distributions, possibly including a Dirac measure at 𝜇 (i.e., 𝛿[𝜇−𝑧,𝜇+𝑧] , 𝑧 ⩾ 0). From Theorem 1, it follows that the dual
problem is given by

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

subject to 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0,
𝜇+𝑥 𝜇+𝑥
∫ 𝜆0 + 𝜆1 𝑧 + 𝜆2 𝑧 2 d𝑧 ⩾ ∫ (𝑧 − 𝜆3 )1Ξ (𝑧) d𝑧, ∀𝑥 > 0,
(42)
𝜇−𝑥 𝜇−𝑥

𝜆0 + 𝜆1 𝜇 + 𝜆2 𝜇 ⩾ (𝜇 − 𝜆3 ).
2

After computing the integral on the left-hand side of the penultimate constraint, one obtains
2 𝜇+𝑥
𝑥 (3𝜆0 + 3𝜇(𝜆1 + 𝜆2 𝜇) + 𝜆2 𝑥 2 ) ⩾ ∫ (𝑧 − 𝜆3 )1Ξ (𝑧) d𝑧, ∀𝑥 > 0.
3 𝜇−𝑥

For the right-hand side, we distinguish two cases so that we can split the semi-infinite constraint into two sets,
resulting in the system of inequalities
2
𝑥 (3𝜆0 + 3𝜇(𝜆1 + 𝜆2 𝜇) + 𝜆2 𝑥 2 ) ⩾ 2𝑥(𝜇 − 𝜆3 ), ∀0 < 𝑥 < 𝜇 − 𝑡,
3
2 1
𝑥 (3𝜆0 + 3𝜇(𝜆1 + 𝜆2 𝜇) + 𝜆2 𝑥 2 ) ⩾ − (𝑡 − 𝑥 − 𝜇)(𝑡 + 𝑥 − 2𝜆3 + 𝜇), ∀𝑥 ⩾ 𝜇 − 𝑡.
(43)
3 2
Again, we can make an educated guess for an optimal dual solution, and using weak duality, prove optimality by

characterized by a third-order polynomial function, 𝑀 uni (𝑥) ∶= 32 𝑥 (3𝜆0 + 3𝜇(𝜆1 + 𝜆2 𝜇) + 𝜆2 𝑥 2 ). We show through
constructing a matching primal solution using the complementary slackness property. The dual solution is now

primal-dual reasoning that there are merely two feasible options for the extremal distribution. Using this insight, we

𝑀 uni (𝑥) needs to be convex in order to be dual feasible, there cannot exist a tangent point on the interval [0, 𝜇 − 𝑡)
optimize the primal problem directly by plugging in the candidate form of the extremal distribution. Notice that as

because otherwise 𝑀 uni (𝑥) would need to intersect 2𝑥(𝜇 − 𝜆3 ). Furthermore, there can exist only one tangent point
𝑥 = 𝑥0∗ at which 𝑀 uni (𝑥) coincides with the quadratic function − 21 (𝑡 − 𝑥 − 𝜇)(𝑡 + 𝑥 − 2𝜆3 + 𝜇), as 𝑀 uni (𝑥) is a cubic

Dirac measure at 𝜇 and a uniform distribution on the interval [𝜇−𝑥0∗ , 𝜇+𝑥0∗ ], for the first case, or a uniform distribution
function. By complementary slackness, the corresponding extremal distribution is then given by the mixture of a

√ √
on [𝜇 − 3𝜎, 𝜇 + 3𝜎] for the second. Indeed, the latter uniform distribution is the only one that is feasible for the
primal problem. From these observations, it follows that the primal problem can be reduced to a finite-dimensional
(nonconvex) optimization problem. The objective function of the primal problem can be rewritten as
(1 − 𝑝)𝜇 + 𝑝 ∫𝑡 2𝑥0 d𝑧 4𝑥0 𝜇 − 𝑝(𝑡 + 𝑥0 − 𝜇)(𝑡 − 𝑥0 + 𝜇)
𝜇+𝑥0 𝑧

𝔼[𝑋 | 𝑋 ⩾ 𝑡] = = .
(1 − 𝑝) + 𝑝 ∫𝑡 2𝑥0 d𝑧
𝜇+𝑥0 1
4𝑥0 − 2𝑝(𝑡 + 𝑥0 − 𝜇)
(44)

From the variance constraint


(𝜇 − 𝑥0 )2 + (𝜇 − 𝑥0 )(𝜇 + 𝑥0 ) + (𝜇 + 𝑥0 )2
(1 − 𝑝)𝜇2 + 𝑝 = 𝜎 2 + 𝜇2 ,
3

it follows that 𝑝 = 3𝜎 2
𝑥02 . In order to be a probability, it should hold that 𝑝 ⩽ 1, and hence 𝑥0 ⩾ 3𝜎. The variable 𝑝

{ }
can be eliminated from (44), yielding the optimization problem
4𝜇(𝑥0 )3 − 3𝜎 2 (−𝜇 + 𝑡 + 𝑥0 )(𝜇 + 𝑡 − 𝑥0 ) √
max ∶ 𝑥 3𝜎 .
4(𝑥0 )3 − 6𝜎 2 (−𝜇 + 𝑡 + 𝑥0 )
0
𝑥0
⩾ (45)

35
From standard arguments, it follows that the maximum of (45) must be attained at a critical point of the objective
function or at the boundary of the feasible region. To arrive at the first case, we need to solve the first-order condition

6𝜎 2 𝑥02 (3(𝑡 − 𝜇)2 − 𝑥02 ) + 9𝜎 4 (𝜇 − 𝑡 − 𝑥0 )2 = 0,


valued solution which is greater than 3𝜎 with the range of 𝑡 as given in the assertion. The second case corresponds
which is a depressed quartic equation. It can be shown, after some tedious algebra, that there exists only one real-


to the boundary of the feasible region for which 𝑝 = 1. As a consequence, the optimal tangent point 𝑥0∗ = 3𝜎. It is
easy to verify that this solution yields the second case. Finally, for the third case, 𝑡 ⩾ 𝜇, we construct a maximizing
sequence {ℙ𝑘 } for which 𝔼[𝑋 | 𝑋 ⩾ 𝑡] diverges. To this end, consider

3𝜎 2 3𝜎 2
ℙ𝑘 = 1− 𝛿𝜇 + 2 𝛿[𝜇−𝑘,𝜇+𝑘] ,
( 𝑘 )
2 𝑘

which is feasible for the primal problem. Then

∫𝑡 2𝑘 d𝑧 1
𝜇+𝑘 𝑧
𝔼ℙ𝑘 [𝑋 | 𝑋 ⩾ 𝑡] = = (𝑡 + 𝜇 + 𝑘) ⟶ ∞,
𝑘→∞

∫𝑡 2
2𝑘 d𝑧
𝜇+𝑘 1

hence resulting in the third case. Combining the cases above completes the proof.

Proof of Proposition 6. The primal can be equivalently stated as

sup 𝔼[1{𝑋 ⩾𝑧} | 𝑋 ⩾ 𝑝],


ℙ∈P(𝜇,𝜎)

which is (weakly) dual to

inf 𝜆3
𝜆0 ,𝜆1 ,𝜆2 ,𝜆3

subject to 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜎 2 + 𝜇2 ) ⩽ 0, (46)

𝜆0 + 𝜆1 𝑥 + 𝜆2 𝑥 2 ⩾ (1{𝑥⩾𝑧} (𝑥) − 𝜆3 )1Ξ (𝑥), ∀𝑥 ⩾ 0.

The right-hand side of the constraint is equal to 0, for 𝑥 < 𝑡, −𝜆3 for 𝑡 ⩽ 𝑥 < 𝑧, and 1 − 𝜆3 for 𝑥 ⩾ 𝑧. We discuss
three cases, in the order of their appearance in the claim. The first dual solution, 𝑀1 (𝑥), corresponds to a convex
quadratic function that touches (1{𝑥⩾𝑧} (𝑥) − 𝜆3 )1Ξ (𝑥) at 𝑧 and 𝑥0 , where the latter point lies between 𝑝 and 𝑧. The
primal probabilities, dual variables, and the support point 𝑥0 follow from solving

𝑝𝑥0 + 𝑝𝑧 = 1, 𝑝𝑥0 𝑥0 + 𝑝𝑧 𝑧 = 𝜇, 𝑝𝑥0 𝑥02 + 𝑝𝑧 𝑧 2 = 𝜇2 + 𝜎 2 ,


𝜆0 + 𝜆1 𝑥0 + 𝜆2 𝑥02 = −𝜆3 , 𝜆0 + 𝜆1 𝑧 + 𝜆2 𝑧 2 = 1 − 𝜆3 ,
𝜆1 + 2𝜆2 𝑥0 = 0, 𝜆0 + 𝜆1 𝜇 + 𝜆2 (𝜇2 + 𝜎 2 ) = 0,

yielding as solution

(𝑧 − 𝜇)2 𝜎2 𝜎2
𝑝𝑥0 = , 𝑝 = , 𝑥 = 𝜇 + ,
𝜎 2 + (𝑧 − 𝜇)2 𝜎 2 + (𝑧 − 𝜇)2 𝜇−𝑧
𝑧 0

(𝑧 − 𝜇) (𝜇2 (𝑧 − 𝜇) − 𝜎 2 (𝜇 + 𝑧)) 2(𝑧 − 𝜇) (𝜇2 + 𝜎 2 − 𝜇𝑧 ) (𝑧 − 𝜇)2 𝜎2


𝜆0 = , 𝜆1 = , 𝜆2 = , 𝜆 = .
(𝜎 2 + (𝑧 − 𝜇)2 )2 (𝜎 2 + (𝑧 − 𝜇)2 )2 (𝜎 2 + (𝑧 − 𝜇)2 ) 𝜎 2 + (𝑧 − 𝜇)2
2 3

36
Indeed, by weak duality, this gives the best possible bound since

𝜎2
𝔼[1{𝑥⩾𝑧} (𝑋 )] = 𝑝𝑧 = = 𝜆3 .
𝜎2 + (𝑧 − 𝜇)2

For the second case, let 𝑀2 (𝑥) denote a quadratic function that touches at 𝑥 = 𝑧 and some point 𝑥0 , like 𝑀1 (𝑥), but
additionally agrees with (1{𝑥⩾𝑧} (𝑥) − 𝜆3 )1Ξ (𝑥) at 𝑥 = 𝑝. Again, we can use a similar set of conditions, as described
above, to find the optimal primal and dual solutions, but now with a three-point distribution with 𝑥0 = 𝜇2 +𝜎 2 −𝜇𝑧
𝜇−𝑧

Finally, for the third case, 𝑀3 (𝑥) is a quadratic function that touches at 𝑧 and a point 0 ⩽ 𝑥 ⩽ 𝑝. The same set of
(which follows from the conditions). This leads to the second bound. For brevity, we omit the detailed calculations.

calculations leads to the third upper bound, which is equal to the constant 1. We combine the cases above in such a
way that the primal distributions are feasible. This completes the proof.

B. Conic reformulations
Proof of the second claim in Theorem 3. We will focus on the second semi-infinite constraint (the first constraint can
be dealt with analogously), which can equivalently be written as the collection of robust counterparts

𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 + 𝜆3 ⩾ 𝐬𝑙 (ν)⊤ 𝐱 + 𝑡𝑙 (ν), ∀(𝐱, 𝐮) ∈ C, ∀𝑙 ∈ L. (47)

K ∶= cl ({(𝐱, 𝐮, 𝑤) ∈ ℝ𝑛 × ℝ𝑚 × ℝ ∶ (𝐳/𝑤, 𝐮/𝑤) ∈ C, 𝑤 > 0}), to which the cone K ∗ is dual. We henceforth assume
Let “cl” denote the closure of a set. We next generate a proper cone from the uncertainty set as follows. Define

that K and its dual cone K ∗ are representable as tractable cones. The semi-infinite constraint (47) is satisfied if, and

inf (λ1 − 𝐬𝑙 (ν))⊤ 𝐱 + λ⊤2 𝐮


only if,

s.t. (𝐱, 𝐮, 1) ∈ K,
(48)

is greater than, or equal to, 𝑡𝑙 (ν) − 𝜆0 − 𝜆3 . Suppose there exists a strictly feasible solution to this problem. Then, by
conic duality, the strong dual of (48) is given by

sup − 𝑤𝑙
s.t. λ1 − 𝐬𝑙 (ν) − 𝐚𝑙 = 𝟎,
λ2 − 𝐛𝑙 = 𝟎,
(49)

(𝐚𝑙 , 𝐛𝑙 , 𝑤𝑙 ) ∈ K ∗ .

Therefore, the semi-infinite constraint (47) is satisfied if, and only if, there exist solutions (𝐚𝑙 , 𝐛𝑙 , 𝑤𝑙 ) ∈ K ∗ , for all 𝑙 ∈ L,
such that the constraints in (49) are satisfied and −𝑤𝑙 is not less than 𝑡𝑙 (ν) − 𝜆0 − 𝜆3 . Using this dual characterization,

37
we can rewrite (28) in the following way:

inf 𝜆3

s.t. 𝜆0 + λ⊤1 µ + λ⊤2 σ ⩽ 0,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0, ∀(𝐱, 𝐮) ∈ C,
𝜆0 + 𝜆3 − 𝑡𝑙 (𝐱) − 𝑤𝑙 ⩾ 0, ∀𝑙 ∈ L,
λ1 − 𝐬𝑙 (ν) − 𝐚𝑙 = 𝟎, ∀𝑙 ∈ L,
(50)

λ2 − 𝐛𝑙 = 𝟎, ∀𝑙 ∈ L,
(𝐚𝑙 , 𝐛𝑙 , 𝑤𝑙 ) ∈ K ,

∀𝑙 ∈ L,
ν ∈ V, 𝜆0 ∈ ℝ, λ1 ∈ ℝ𝑛 , λ2 ∈ D∗ , 𝜆3 ∈ ℝ.

If C is also conic representable and satisfies a similar Slater condition, an analogous argument enables us to refor-
mulate the first semi-infinite constraint, reducing (50) to a finite-dimensional conic optimization problem. Hence,
the second claim follows.

We use the following result to derive the LMI reformulations for the Chebyshev ambiguity set P(µ,Σ) .

Lemma 2 (S-Lemma, [49]). Consider two quadratic functions of 𝐱 ∈ ℝ𝑛 , 𝑞𝑖 (𝐱) = 𝐱⊤ 𝐂𝑖 𝐱 + 2𝐜⊤𝑖 𝐱 + 𝑐̄𝑖 , 𝑖 = 0, 1, with
𝑞1 (𝐱) > 0 for some 𝐱. Then
𝑞0 (𝐱) ⩾ 0 ∀𝐱 ∶ 𝑞1 (𝐱) ⩾ 0
if, and only if, there exists 𝜏 ⩾ 0 such that

𝑐̄0 𝐜⊤0 𝑐̄1 𝐜⊤1


−𝜏 ∈ 𝕊𝑛+1
+ .
( 𝐜0 𝐂0 ) ( 𝐜1 𝐂1 )

Proof of Corollary 1. By the S-Lemma, we have that

𝜆0 + 𝜏 𝑐̄ 1
(λ1 − 𝜏𝐜)⊤
𝜆0 + λ⊤1 𝐱 + 𝐱⊤ 𝚲𝐱 ⩾ 0, ∀𝐱 ∶ 𝐜⊤ 𝐱 ⩾ 𝑐̄ ⟺ ∃𝜏 ⩾ 0 ∶ 2
≽ 𝟎.
[ 12 (λ1 − 𝜏𝐜) Λ ]
Analogously, for the second semi-infinite constraint,

𝜆0 + 𝜆3 − 𝑡𝑙 (ν) − 𝜒𝑙 𝑐̄ 1
(λ1 − 𝐬𝑙 (ν) + 𝜒𝑙 𝐜)⊤
𝜆0 +(λ1 −𝐬(ν))⊤ 𝐱+𝐱⊤ 𝚲𝐱+𝜆3 −𝑡𝑙 (ν) ⩾ 0, ∀𝐱 ∶ 𝐜⊤ 𝐱 ⩽ 𝑐̄ ⟺ ∃𝜒𝑙 ⩾ 0 ∶ 2
≽ 𝟎.
[ 12 (λ1 − 𝐬𝑙 (ν) + 𝜒𝑙 𝐜) Λ ]
These LMIs yield the following semidefinite programming problem:

inf 𝜆3

s.t. 𝜆0 + λ⊤1 µ + ⟨𝚲, 𝚺⟩ ⩽ 0,


𝜆0 + 𝜏 𝑐̄ 1
(λ1 − 𝜏𝐜)⊤
2
≽ 𝟎,
[ 12 (λ1 − 𝜏𝐜) Λ ]
𝜆0 + 𝜆3 − 𝑡𝑙 (ν) − 𝜒𝑙 𝑐̄ 1
(λ1 − 𝐬𝑙 (ν) + 𝜒𝑙 𝐜)⊤
2
≽ 𝟎, ∀𝑙 ∈ L,
[ 21 (λ1 − 𝐬𝑙 (ν) + 𝜒𝑙 𝐜) ]
ν ∈ V, 𝜆0 ∈ ℝ, λ1 ∈ ℝ𝑛 , Λ ∈ 𝕊𝑛+ , 𝜆3 ∈ ℝ, 𝜏 ∈ ℝ+ , χ ∈ ℝ+|L| ,
Λ

38
where ⟨⋅, ⋅⟩ denotes the trace inner product.

Proof of Corollary 2. The model with MAD information follows from defining separate constraints for the positive
and negative parts of the absolute value terms. This yields

inf 𝜆3
ν,𝜆0 ,λ1 ,λ2 ,𝜆3

s.t. 𝜆0 + λ⊤1 𝐦 + λ⊤2 𝐟 ⩽ 0,


𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0, ∀(𝐱, 𝐮) ∶ 𝐜⊤ 𝐱 ⩾ 𝑐,̄ 𝐮 ⩾ ±𝐝(𝐱)
𝜆0 + λ⊤1 𝐱 + λ⊤2 𝐮 + 𝜆3 ⩾ 𝐬𝑙 (ν)⊤ 𝐱 + 𝑡𝑙 (ν), ∀(𝐱, 𝐮) ∶ 𝐜⊤ 𝐱 ⩽ 𝑐,̄ 𝐮 ⩾ ±𝐝(𝐱)
ν ∈ V, 𝜆0 ∈ ℝ, λ1 ∈ ℝ𝑛 , λ2 ∈ ℝ𝑛+ , 𝜆3 ∈ ℝ,
2

where 𝐝(𝐱) = 𝐦0 + 𝐃𝐱 describes the affine functions 𝑋𝑖 − 𝑚𝑖 and (𝑋𝑖 ± 𝑋𝑘 ) − (𝑚𝑖 ± 𝑚𝑘 ), and the inequalities 𝐮 ⩾ ±𝐝(𝐱)
hold elementwise. Let us focus on the first semi-infinite constraint. It can be rewritten as
{ }
𝜆0 + min λ⊤1 𝐱 + λ⊤2 𝐮 ⩾ 0.
𝐱,𝐮∶𝐜⊤ 𝐱⩾𝑐,̄ 𝐮⩾±𝐝(𝐱)

Using the fact that 𝐝 is an affine, vector-valued function of 𝐱, standard LP duality yields a finite-dimensional linear

{ ⊤ }
reformulation since

𝜆0 + min λ1 𝐱 + λ⊤2 𝐮 ⩾ 0
𝐱,𝐮∶𝐜⊤ 𝐱⩾𝑐,̄ 𝐮⩾±𝐝(𝐱)
{ }
⟺ 𝜆0 + max 𝑐𝜏
̄ + χ⊤+ 𝐦0 − χ⊤− 𝐦0 ∶ 𝜏𝐜 − 𝐃⊤ χ+ + 𝐃⊤ χ− = λ1 , χ+ + χ− = λ2 ⩾0
𝜏∈ℝ+ ,χ+ ,χ− ∈ℝ𝑛+
2

⟺ ∃𝜏 ∈ ℝ+ , χ+ , χ− ∈ ℝ𝑛+ ∶ 𝜆0 + 𝑐𝜏
̄ + χ⊤+ 𝐦0 − χ⊤− 𝐦0 ⩾ 0, 𝜏𝐜 − 𝐃⊤ χ+ + 𝐃⊤ χ− = λ1 , χ+ + χ− = λ2 .
2

A similar argument applies to the other semi-infinite constraint. Therefore, all robust counterparts can be rewritten
in terms of linear inequalities, yielding the result.

References
[1] Ban, G.-Y. and Rudin, C. (2019). The big data newsvendor: Practical insights from machine learning. Operations
Research, 67(1):90–108.
[2] Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. (2013). Robust solutions of
optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357.
[3] Ben-Tal, A., Den Hertog, D., and Vial, J.-P. (2015). Deriving robust counterparts of nonlinear uncertain inequal-
ities. Mathematical Programming, 149(1-2):265–299.
[4] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. (2009). Robust Optimization, volume 28. Princeton University
Press, Princeton, NJ.
[5] Ben-Tal, A. and Nemirovski, A. (1998). Robust convex optimization. Mathematics of Operations Research,
23(4):769–805.
[6] Ben-Tal, A. and Nemirovski, A. (2001). Lectures on Modern Convex Optimization: Analysis, Algorithms, and
Engineering Applications. SIAM, Philadelphia.
[7] Bertsimas, D. and Kallus, N. (2020). From predictive to prescriptive analytics. Management Science, 66(3):1025–
1044.

39
[8] Bertsimas, D. and Popescu, I. (2002). On the relation between option and stock prices: A convex optimization
approach. Operations Research, 50(2):358–374.
[9] Bertsimas, D. and Popescu, I. (2005). Optimal inequalities in probability theory: A convex optimization approach.
SIAM Journal on Optimization, 15(3):780–804.
[10] Birge, J. R. and Dulá, J. H. (1991). Bounding separable recourse functions with limited distribution information.
Annals of Operations Research, 30(1):277–298.
[11] Birge, J. R. and Wets, R. J.-B. (1987). Computing bounds for stochastic programming problems by means of a
generalized moment problem. Mathematics of Operations Research, 12(1):149–162.
[12] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press, Cambridge, UK.
[13] Charnes, A. and Cooper, W. W. (1962). Programming with linear fractional functionals. Naval Research Logistics
Quarterly, 9(3-4):181–186.
[14] Chebyshev, P. L. (1874). Sur les valeurs limites des intégrales. Journal de Mathématiques Pures et Appliquées,
12:177–184.
[15] Chen, H., Hu, M., and Perakis, G. (2022). Distribution-free pricing. Manufacturing & Service Operations Man-
agement, 24(4):1939–1958.
[16] Chen, Z., Sim, M., and Xiong, P. (2020). Robust stochastic optimization made easy with RSOME. Management
Science, 66(8):3329–3339.
[17] Cozman, F. G. (1999). Calculation of posterior bounds given convex sets of prior probability measures and
likelihood functions. Journal of Computational and Graphical Statistics, 8(4):824–838.
[18] de Klerk, E., Kuhn, D., and Postek, K. (2020). Distributionally robust optimization with polynomial densities:
Theory, models and algorithms. Mathematical Programming, 181:265–296.
[19] Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application
to data-driven problems. Operations Research, 58(3):595–612.
[20] Dinkelbach, W. (1967). On nonlinear fractional programming. Management Science, 13(7):492–498.
[21] Dupačová, J. (1987). The minimax approach to stochastic programming and an illustrative application. Stochas-
tics: An International Journal of Probability and Stochastic Processes, 20(1):73–88.
[22] Elmachtoub, A. N., Gupta, V., and Hamilton, M. L. (2021). The value of personalized pricing. Management
Science, 67(10):6055–6070.
[23] Esteban-Pérez, A. and Morales, J. M. (2022). Distributionally robust stochastic programs with side information
based on trimmings. Mathematical Programming, 195(1-2):1069–1105.
[24] Giannakopoulos, Y., Poças, D., and Tsigonias-Dimitriadis, A. (2020). Robust revenue maximization under min-
imal statistical information. In International Conference on Web and Internet Economics, pages 177–190, Cham,
Switzerland. Springer.
[25] Gorissen, B. L. (2015). Robust fractional programming. Journal of Optimization Theory and Applications,
166(2):508–528.
[26] Gorissen, B. L., Blanc, H., den Hertog, D., and Ben-Tal, A. (2014). Deriving robust and globalized robust solutions
of uncertain linear programs with general convex uncertainty sets. Operations Research, 62(3):672–679.

40
[27] Han, S., Tao, M., Topcu, U., Owhadi, H., and Murray, R. M. (2015). Convex optimal uncertainty quantification.
SIAM Journal on Optimization, 25(3):1368–1387.
[28] Hanasusanto, G. A., Roitch, V., Kuhn, D., and Wiesemann, W. (2015). A distributionally robust perspective on
uncertainty quantification and chance constrained programming. Mathematical Programming, 151(1):35–62.
[29] Hanasusanto, G. A., Roitch, V., Kuhn, D., and Wiesemann, W. (2017). Ambiguous joint chance constraints under
mean and dispersion information. Operations Research, 65(3):751–767.
[30] Hettich, R. and Kortanek, K. O. (1993). Semi-infinite programming: Theory, methods, and applications. SIAM
Review, 35(3):380–429.
[31] Isii, K. (1962). On sharpness of Tchebycheff-type inequalities. Annals of the Institute of Statistical Mathematics,
14(1):185–197.
[32] Ji, R. and Lejeune, M. A. (2021). Data-driven optimization of reward-risk ratio measures. INFORMS Journal on
Computing, 33(3):1120–1137.
[33] Karlin, S. and Studden, W. J. (1966). Tchebycheff Systems: With Applications in Analysis and Statistics, volume 15.
John Wiley and Sons, New York.
[34] Kemperman, J. H. (1968). The general moment problem: A geometric approach. The Annals of Mathematical
Statistics, 39(1):93–122.
[35] Lasserre, J. B. (2001). Global optimization with polynomials and the problem of moments. SIAM Journal on
Optimization, 11(3):796–817.
[36] Lavine, M. (1991). Sensitivity in Bayesian statistics: The prior and the likelihood. Journal of the American
Statistical Association, 86(414):396–399.
[37] Liu, Y., Meskarian, R., and Xu, H. (2017). Distributionally robust reward-risk ratio optimization with moment
constraints. SIAM Journal on Optimization, 27(2):957–985.
[38] Mallows, C. (1956). Generalizations of Tchebycheff’s inequalities. Journal of the Royal Statistical Society: Series
B (Methodological), 18(2):139–168.
[39] Mallows, C. L. and Richter, D. (1969). Inequalities of Chebyshev type involving conditional expectations. The
Annals of Mathematical Statistics, 40(6):1922–1932.
[40] Markov, A. (1884). On certain applications of algebraic continued fractions. Ph.D. thesis (in Russian), St Peters-
burg.
[41] Marshall, A. W. and Olkin, I. (1960). Multivariate Chebyshev inequalities. The Annals of Mathematical Statistics,
pages 1001–1014.
[42] Mohajerin Esfahani, P. and Kuhn, D. (2018). Data-driven distributionally robust optimization using the Wasser-
stein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115–166.
[43] MOSEK (2019). MOSEK optimization software. Technical report, version 9.1.9.,
http://docs.mosek.com/9.1/toolbox/index.html.
[44] Nesterov, Y. et al. (1997). Structure of non-negative polynomials and optimization problems. Technical report,
Université Catholique de Louvain, Center for Operations Research.
[45] Nguyen, V. A., Zhang, F., Blanchet, J., Delage, E., and Ye, Y. (2020). Distributionally robust local non-parametric
conditional estimation. Advances in Neural Information Processing Systems, 33:15232–15242.

41
[46] Nguyen, V. A., Zhang, F., Blanchet, J., Delage, E., and Ye, Y. (2021). Robustifying conditional portfolio decisions
via optimal transport. arXiv preprint arXiv:2103.16451.
[47] Parrilo, P. A. (2000). Structured semidefinite programs and semialgebraic geometry methods in robustness and
optimization. Ph.D. thesis, California Institute of Technology.
[48] Parrilo, P. A. (2003). Semidefinite programming relaxations for semialgebraic problems. Mathematical Program-
ming, 96(2):293–320.
[49] Pólik, I. and Terlaky, T. (2007). A survey of the S-lemma. SIAM Review, 49(3):371–418.
[50] Popescu, I. (2005). A semidefinite programming approach to optimal-moment bounds for convex classes of
distributions. Mathematics of Operations Research, 30(3):632–657.
[51] Popescu, I. (2007). Robust mean-covariance solutions for stochastic optimization. Operations Research, 55(1):98–
112.
[52] Postek, K., Ben-Tal, A., Den Hertog, D., and Melenberg, B. (2018). Robust optimization with ambiguous stochas-
tic constraints under mean and dispersion information. Operations Research, 66(3):814–833.
[53] Rogosinski, W. W. (1958). Moments of non-negative mass. Proceedings of the Royal Society of London. Series A.
Mathematical and Physical Sciences, 245(1240):1–27.
[54] Roos, E., Brekelmans, R., Van Eekelen, W., Den Hertog, D., and Van Leeuwaarden, J. S. (2022). Tight tail
probability bounds for distribution-free decision making. European Journal of Operational Research, 299(3):931–
944.
[55] Scarf, H. E. (1958). A min-max solution of an inventory problem. In Arrow, K. J., Karlin, S., and Scarf, H. E.,
editors, Studies in the Mathematical Theory of Inventory and Production. Stanford University Press, Stanford, CA.
[56] Shapiro, A. (2001). On duality theory of conic linear problems. In Goberna, M. A. and Lopez, M. A., editors,
Semi-Infinite Programming: Recent Advances, pages 135–165. Kluwer Academic Publishers, Amsterdam.
[57] Shapiro, A. (2006). Worst-case distribution analysis of stochastic programs. Mathematical Programming,
107(1):91–96.
[58] Shapiro, A., Dentcheva, D., and Ruszczyński, A. (2009). Lectures on Stochastic Programming: Modeling and
Theory. SIAM, Philadelphia, 3d edition.
[59] Shapiro, A. and Kleywegt, A. (2002). Minimax analysis of stochastic problems. Optimization Methods and
Software, 17(3):523–542.
[60] Smith, J. E. (1995). Generalized Chebychev inequalities: Theory and applications in decision analysis. Operations
Research, 43(5):807–825.
[61] Srivastava, P. R., Wang, Y., Hanasusanto, G. A., and Ho, C. P. (2021). On data-driven prescriptive analytics with
side information: A regularized Nadaraya-Watson approach. arXiv preprint arXiv:2110.04855.
[62] Stieltjes, T.-J. (1894). Recherches sur les fractions continues. Annales de la Faculté des Sciences de Toulouse:
Mathématiques, 8(4):1–122.
[63] Stieltjes, T.-J. (1895). Recherches sur les fractions continues. Annales de la Faculté des Sciences de Toulouse:
Mathématiques, 9(1):5–47.
[64] Vandenberghe, L., Boyd, S., and Comanor, K. (2007). Generalized Chebyshev bounds via semidefinite program-
ming. SIAM Review, 49(1):52–64.

42
[65] Weisser, T., Legat, B., Coey, C., Kapelevich, L., and Vielma, J. P. (2019). Polynomial and moment optimization
in Julia and JuMP. https://pretalx.com/juliacon2019/talk/QZBKAU/.
[66] Wiesemann, W., Kuhn, D., and Sim, M. (2014). Distributionally robust convex optimization. Operations Research,
62(6):1358–1376.
[67] Yanıkoğlu, İ., Gorissen, B. L., and den Hertog, D. (2019). A survey of adjustable robust optimization. European
Journal of Operational Research, 277(3):799–813.
[68] Žáčková, J. (1966). On minimax solutions of stochastic linear programming problems. Časopis pro Pěstovánı́
Matematiky, 91(4):423–430.

43

You might also like