Professional Documents
Culture Documents
Practical Guidance For Bayesian Inference in Astronomy
Practical Guidance For Bayesian Inference in Astronomy
10 February 2023
ABSTRACT
In the last two decades, Bayesian inference has become commonplace in astronomy. At the same time, the choice of algorithms,
terminology, notation, and interpretation of Bayesian inference varies from one sub-field of astronomy to the next, which can
lead to confusion to both those learning and those familiar with Bayesian statistics. Moreover, the choice varies between the
astronomy and statistics literature, too. In this paper, our goal is two-fold: (1) provide a reference that consolidates and clarifies
terminology and notation across disciplines, and (2) outline practical guidance for Bayesian inference in astronomy. Highlighting
both the astronomy and statistics literature, we cover topics such as notation, specification of the likelihood and prior distributions,
inference using the posterior distribution, and posterior predictive checking. It is not our intention to introduce the entire field of
Bayesian data analysis – rather, we present a series of useful practices for astronomers who already have an understanding of the
Bayesian "nuts and bolts" and wish to increase their expertise and extend their knowledge. Moreover, as the field of astrostatistics
and astroinformatics continues to grow, we hope this paper will serve as both a helpful reference and as a jumping off point for
deeper dives into the statistics and astrostatistics literature.
Key words: astrostatistics – computational methods – parallax
1 INTRODUCTION data such as missing and censored data. Fourth, astronomers of-
ten have prior knowledge about allowable and realistic ranges of
Over the past two decades, Bayesian inference has become increas-
parameter values (e.g., through physical theories and previous ob-
ingly popular in astronomy. On NASA’s Astrophysics Data Sys-
servations/experiments) which can naturally be included in prior
tem (ADS), a search using “keyword:statistical” and “abs:bayesian”
distributions and thereby improve the final inference.
yields 2377 refereed papers, and shows exponential growth since the
Importantly, in addition to the aforementioned advantages of the
year 2000, with over 237 papers in 2021.
Bayesian approach, efficient and increased computing power, along
Bayesian analyses have become popular in astronomy due to sev-
with easy-to-use or out-of-the-box algorithms, have brought Bayesian
eral key advantages over traditional methods. First, an estimate of
methodology to astronomers in convenient practical forms (e.g., em-
the posterior distribution of model parameters provides a more com-
cee (Foreman-Mackey et al. 2013), Rstan (Stan Development Team
plete picture of parameter uncertainty, joint parameter uncertainty,
2020), PyStan (Riddell et al. 2017), PyMC3 (Salvatier et al. 2016),
and parameter relationships given the model, data, and prior as-
BUGS (Lunn et al. 2000), NIMBLE (de Valpine et al. 2017), and
sumptions than traditional methods. Second, the interpretation of
JAGS (Plummer et al. 2003)).
Bayesian probability intervals is often closer to what scientists de-
Interestingly, the surge in popularity of Bayesian statistics comes
sire, and is an appealing alternative to point estimates with confi-
in spite of the fact that Bayesian methods are rarely taught in un-
dence intervals which often rely on the sampling distribution of the
dergraduate astronomy and physics programs, and has only recently
estimator. Third, Bayesian analysis easily allows for marginalization
been introduced at a basic level in astronomy graduate courses (Eadie
over nuisance parameters, incorporation of measurement uncertain-
et al. 2019b,a). Some challenges faced by both new and seasoned
ties through measurement error models, and inclusion of incomplete
users of Bayesian inference are the varied notation, terminology, in-
terpretation, and choice of algorithms available in the astronomy and
★ E-mail: gwen.eadie@utoronto.ca statistics literature.
Table 1. Bayesian models for inferring (1) a star’s distance parameter 𝑑 from its parallax measurement 𝑦, assuming a Gaussian distribution for its true parallax
𝜛 (top panel), and (2) a star cluster’s distance parameter 𝑑cluster from the parallax measurement of 𝑛 stars 𝒚 (bottom panel).
( 𝑦−1/𝑑) 2
h i
Posterior on distance 𝑝 (𝑑 |𝑦, 𝜎 𝜛 ) ∝ 𝑑 2 exp − 𝐿
𝑑
−
2𝜎 2
𝑁 (𝒚 | 1/𝑑cluster , 𝜎𝑖2 )
Î𝑛
Sampling density / likelihood (𝑛 parallax measurements) 𝑝 (𝒚 | 𝑑cluster ) = 𝑖=1
2 In both the upper and lower box, 𝜎 and 𝜎𝑖 : 𝑖 = 1, 2, . . . , 𝑛 are assumed to be known. Throughout 𝑁 ( 𝑦 | 𝑎, 𝑏) denotes the Gaussian density function of 𝑦
with mean 𝑎 and variance 𝑏.
Carlin & Louis 2008; Berger & Wolpert 1988; Casella & Berger a number of valuable references in the statistics literature on these
2002). However, in astronomy, it is not unusual to see phrases such topics (Rubin 1976; Little & Rubin 2019). We recommend writing
as “the likelihood function of the data given the model parameters”, down enough mathematical details to uniquely determine the like-
which might be misconstrued as treating the likelihood function as a lihood by defining (algebraically) not only the physical process of
function of data. Finally, we note that the likelihood function of 𝜽 is interest but also the sampling/measurement process that generated
not a probability density function of 𝜽. the data.
In Table 2, we summarize common notation for the likelihood
found in both the statistics and astronomy literature, which ranges
from being very explicit (e.g., 𝑓𝑌 (𝒚|𝜽)) to quite simplified (e.g., 2.2.1 Parallax Example: the likelihood function
𝑃(𝒚|𝑀)). We note that a subtlety sometimes missed in astronomy
The Gaia spacecraft has measured parallaxes for over a billion stars
is the difference between 𝑝 and 𝑃. In statistics, 𝑓 and 𝑝 are often
(Gaia et al. 2018; Collaboration et al. 2018). These parallaxes have
used for probability density functions (pdf) of continuous random
been shown empirically, through simulations (Holl et al. 2012; Lin-
variables, and 𝑃 or 𝑃𝑟 are used to denote probabilities of discrete
degren et al. 2012), to follow a normal distribution with mean equal
events (probability mass functions (pmf) are an exception, and are
to the true underlying parallax 𝜛 so that
often denoted using 𝑓 , 𝑝, or 𝑃). A capital 𝐹 is usually reserved for
the cumulative distribution function (cdf). 1
(𝑦 − 𝜛) 2
Determining what the likelihood function should be in a given 𝑝(𝑦 | 𝜛) = √ exp − or equivalently, (5)
2𝜋𝜎 2 2𝜎 2
astronomy problem can be challenging, and care must be taken to
choose an appropriate sampling distribution. The likelihood is of- 𝑦 | 𝜛, 𝜎 ∼ 𝑁 (𝜛, 𝜎 2 ) (6)
ten taken to be a product of independent and identically distributed where 𝑦 is the measured parallax and 𝜎 is the associated (assumed
(i.i.d.) Gaussian random variables with known variance. While this known) measurement uncertainty (Hogg 2018). The parameter of
choice is sometimes plausible, there are also many cases in which it interest is the distance 𝑑, so we rewrite Equation 6 as
is inappropriate, and has a material effect on inference. For instance,
(𝑦 − 1/𝑑) 2
when describing the brightness of a high energy source, a discrete 1
𝑝(𝑦 | 𝑑) = √ exp − or equivalently, (7)
distribution such as the Poisson distribution is usually more appro- 2𝜋𝜎 2 2𝜎 2
priate than a Gaussian. In other cases, uncertainty in the variance
𝑦 | 𝑑, 𝜎 ∼ 𝑁 (1/𝑑, 𝜎 2 ). (8)
of the data might lead us to use a likelihood function based on the
𝑡-distribution or another non-Gaussian parametric family. A further We note that a similar Gaussian model assumption is widely ap-
consideration is whether the data being modeled are collected as a plicable to various sub-fields in observational astronomy such as
function of space or time, in which case the assumption of exchange- detecting exoplanets by RV (Danby 1988; Mayor et al. 2011; Pepe
ability – that data can be reordered without affecting the likelihood et al. 2011; Fischer et al. 2013; Butler et al. 2017) or by transit
– is generally unwarranted. In these cases, an expanded model that (Konacki et al. 2003; Alonso et al. 2004; Dragomir et al. 2019), in-
includes correlation among observations should be considered. ferring the true brightness of a source (Tak et al. 2017), or estimating
Best practice includes all non-negligible contributors to the mea- the Hubble constant (Hubble 1929). This is because the statistical
surement process in the likelihood function. For example, it is impor- details are analogous; the observation is measured with Gaussian
tant to account for substantial truncation and censoring issues when measurement error, estimated measurement error uncertainty 𝜎 is
present, because these can strongly influence parameter inference in treated as a known constant, and the mean model can be written as
some cases (Rubin 1976; Eadie et al. 2021). Other common issues a deterministic function of other parameters, e.g., 𝜛 = 1/𝑑 in Equa-
to check for and address are measurement uncertainty, correlated er- tion 6. On the other hand, as mentioned already, in each new setting
rors, measurement bias, sampling bias, and missing data. There are it is important to carefully consider which model is most appropriate;
L (𝜽; 𝒚) or L (𝜽) explicit notation for the likelihood with specific statistics topics, e.g.,
argument 𝜽 maximum likelihood estimation
Gaussian-based models are (i) sometimes misused, (ii) overused, or after a transformation. For instance, in our example (Section 2.3.1)
(iii) sometimes inappropriate. a “non-informative” uniform prior distribution on the parallax of a
star is actually quite informative in terms of distance (third panel,
Figure 1). Thus, we recommend carefully considering what direct
2.3 Prior Distributions or indirect information is available about the value of a parameter
The prior probability distribution, or the prior, captures our initial before resorting to default or non-informative priors; in astronomy,
knowledge about the model parameters before we have seen the data. we usually have at least a little information about the range of al-
Priors may assign higher probability (or density) to some values of lowed or physically reasonable values. Even when a non-informative
the model parameters over others. Priors are often categorized into prior does seem appropriate, checking that the chosen distribution is
two classes: informative and non-informative. The former type sum- consistent with known physical constraints is essential.
marizes knowledge gained from previous studies, theoretical predic- Complete descriptions and mathematical forms of prior distribu-
tions, and/or scientific intuition. The latter type attempts to include as tions, including the values of hyperparameters defining these dis-
little information as possible about the model parameters. Informa- tributions, help promote reproducibility and open science. Unfortu-
tive priors can be conjugate (Diaconis & Ylvisaker 1979), mixtures nately, a recent meta-analysis of the astronomical literature showed
of conjugate priors (Dalal & Hall 1983), scientifically motivated (Tak that prior definitions are often incomplete or unstated (Tak et al.
et al. 2018; Lemoine 2019), based on previous data, or in the case of 2018), making it difficult for others to interpret results.
empirical Bayes, based on the data at hand (often called data-driven To summarize, good practices in the context of priors are: (1)
priors) (Carlin & Louis 2000; Maritz 2018). Non-informative priors choosing informative priors when existing knowledge is available,
can be improper or “flat”, weakly-informative, Jeffrey’s priors (Tuyl (2) choosing priors with caution if there is no prior knowledge,
et al. 2008), or other reference distributions. Conjugate priors are (3) testing the influence of alternative priors (see the discussion of
sometimes defined to be non-informative. sensitivity analyses in Section 2.4), and (4) explicitly specifying the
One popular choice for a non-informative prior is an improper chosen prior distributions for clarity and reproducibility.
prior — a prior that is not a probability distribution and in particular
does not integrate to one. Good introductions to improper priors are
available in the statistics literature (Gelman et al. 2013, 2017). An
example of an improper prior is a flat prior on an unbounded range, 2.3.1 Parallax Example: choosing a prior
e.g., Unif(0, ∞) or Unif(−∞, ∞). When an improper prior has been A naive choice of prior on the true parallax 𝜛 is 𝑝(𝜛) ∝ constant,
adopted, it is imperative to check whether the resulting posterior is a an improper prior that assigns equal density to all values of 𝜛 from
proper probability distribution before making any inference. Without (0, +∞). A straightforward way to be more informative and proper
posterior propriety the analysis has no probability interpretation. is to instead define a truncated uniform prior, where 𝜛 is uniformly
Empirical checks may not be sufficient; posterior samples may not distributed between 𝜛 = (𝜛min , 𝜛max ) so that
reveal any evidence of posterior impropriety, forming a seemingly
(
reasonable distribution even when the posterior is actually improper constant 𝜛min < 𝜛 < 𝜛max
(Hobert & Casella 1996; Tak et al. 2018). 𝑝(𝜛) ∝ , (9)
0 otherwise
Research on quantifying prior impact is active (e.g., effective prior
sample size Clarke 1996; Reimherr et al. 2014; Jones et al. 2020) as or equivalently,
is the discussion on choosing a prior in the context of the likelihood
(Reimherr et al. 2014; Gelman et al. 2017; Jones et al. 2020). 𝜛 ∼ Unif(𝜛min , 𝜛max ) (10)
In astronomy, there is a tendency for scientists to adopt non-
informative prior distributions, perhaps because informative priors Here, 𝜛min and 𝜛max are hyperparameters set by the scientist (e.g.,
are perceived as too subjective or because there is a lack of easily using some physically-motivated cutoff for 𝜛min and the minimum
quantifiable information about the parameters in question. However, realistic distance to the star for 𝜛max ). Thus the prior in Equation 9
all priors provide some information about the likely values of the can be regarded as weakly-informative because some physical knowl-
model parameter(s), even a “flat” prior. Notably, a flat prior is non-flat edge is reflected in the bounds. Similarly, we could instead define a
× Physically-motivated
∝
Non- Uniform (distance)
Gaussian Gaussian
Figure 1. Bayesian inference of the distance 𝑑 = 1/𝜛 to a star based on the measured parallax 𝑦. Far left: The likelihood for parallax 𝜛 is normal with variance
assumed known. Center left: A transformation of parameters from 𝜛 to 𝑑 gives a non-normal PDF. Note that a non-negativity constraint was applied to the
distribution of 𝑑. Center right: We highlight three possible priors 𝑝 (𝑑) over the distance: uniform in parallax 𝜛 = 1/𝑑 (blue), uniform in distance 𝑑 (red), and
a physically-motivated prior (Bailer-Jones et al. 2018) (orange). Far right: The posteriors that correspond to each of the three priors.
Mode
Probability [normalized]
Median
Distance [kpc]
Figure 2. The three posterior distributions corresponding to each prior distribution shown in Figure 1: uniform in parallax prior (blue curve), uniform in distance
prior (red curve), and physically-motivated prior (orange curve). Also shown are one summary statistic for each posterior: the mode (blue dotted line), the median
(red dotted-dashed line), and the 90% credible interval (orange dashed lines and shaded region).
From Bayes’ theorem, the posterior is for 𝑑min < 𝑑 < 𝑑max (and 0 otherwise). While none of these have an-
alytic solutions for point estimates or credible intervals, they can be
𝑝(𝑑|𝑦) ∝ 𝑝(𝑦|𝑑) 𝑝(𝑑). (14) computed using computational techniques. Approximations to these
three posterior distributions are show in Figures 1 and 2. In this illus-
For the three priors discussed previously, this corresponds to the tration, though each resulting posterior distribution is right-skewed,
following posteriors: the shape is notably different for each considered prior distribution.
(𝑦 − 1/𝑑) 2
1
Equation 9 ⇒ 𝑝(𝑑|𝑦) ∝ 2 exp − (15)
𝑑 2𝜎 2
2.4.2 Extended Example: inferring the distance to a cluster of stars
(𝑦 − 1/𝑑) 2
Equation 11 ⇒ 𝑝(𝑑|𝑦) ∝ exp − (16) We now extend our example to infer the distance to a cluster of
2𝜎 2
stars, based on the collection of parallax measurements of each in-
𝑑 (𝑦 − 1/𝑑) 2
Equation 13 ⇒ 𝑝(𝑑|𝑦) ∝ 𝑑 2 exp − − , (17) dividual star. Assuming that there are 𝑛 stars located at approxi-
𝐿 2𝜎 2 mately the same distance 𝑑cluster and that the measured parallaxes
Measured
Measurements
Parallax [mas]
Individual
Posterior
Number of Objects
Figure 3. Extended Example for Open Cluster M67. An extension of the example shown in Figure 1 illustrating how to infer the distance to an open cluster
(M67) based on parallax measurements of many stars. Top: Parallax measurements (gray) for likely cluster members (based on proper motions), sorted by their
observed signal-to-nose ratio 𝜛obs /𝜎 𝜛 . Bottom: The joint likelihood (gray) and posterior (blue) for the cluster parallax 𝜛cluster = 1/𝑑cluster as more and more
stars are added to our analysis. The (Gaussian) prior distribution on the cluster’s parallax is illustrated in the narrow left panel. When there is only a small number
of stars, the location of the prior has a substantial impact on the posterior. However, as more stars are added, the information from the data dominates.
performed. The latter is typically more relevant in science and more 2.6 Conclusion
closely coincides with sensitivity analyses.
We hope that this article has identified, clarified, and illuminated fun-
Good practices outlined in this section can be summarized as (1)
damental Bayesian inference notation and techniques from the statis-
using multiple ways to summarize the posterior inference, (2) quan-
tics literature, and in particular, has made a case for fully specifying
titatively and graphically checking the posterior distribution (e.g.,
the model, posterior predictive checking, and the use of underused
using posterior predictive checks, Q-Q plots), and (3) providing evi-
aids such as the Q-Q plot. In summary, we highlight sound practices
dence that diagnostic checks were completed.
for conducting Bayesian inference in astronomy as follows:
• Be explicit about notation, and use appropriate terminology for
the interpretation of concepts such as the likelihood and credible
intervals, which will help interdisciplinary collaboration and repro-
ducibility.
• Describe the likelihood as a function of the parameters, given
2.5.1 Extended Example: posterior predictive checks the data.
• Use informative priors whenever possible and justified. Care-
We investigate the validity of our model for the distance to a cluster fully consider what direct or indirect information is available about
of stars by computing the posterior predictive distribution for the the parameters.
observed stellar parallaxes. While in this case the posterior predictive • Use non-informative priors carefully, and assess their properties
can be written in closed form (since it is a Gaussian distribution), we under parameter transformations.
also approximate it by simulating values of 𝑑cluster from the posterior • Test the sensitivity of the posterior distribution to different prior
and then subsequently simulating values for the predicted parallax distributions.
measurements 𝜛pred,𝑖 given 𝑑cluster . • Fully specify the Bayesian model in terms of the likelihood,
In Figure 4, we compare both the distribution and quantiles esti- prior, and posterior, and provide open-source code whenever possi-
mated for the simulated dataset and the observed dataset via a density ble.
and Q-Q plot respectively. While there are differences, especially in • Perform posterior predictive checks of the model, using visual-
the tails of the distribution, overall the cluster model reproduces most izations such as Q-Q plots where appropriate.
of the observed properties of the data. It would be worth investigat- • Strive to include all non-negligible contributors to the measure-
ing whether these differences persist under different models – for ment process.
example, a model in which the distance to each star is not assumed
to be identical, or a model in which measurement uncertainty is not We hope that there is a continued growth of interdisciplinary col-
assumed to be known exactly. laborations between astronomers and statisticians in the future. Data
99% REFERENCES
95%
Akeret J., Refregier A., Amara A., Seehars S., Hasner C., 2015, Journal of
80% Cosmology and Astroparticle Physics, 2015, 043
Alonso R., et al., 2004, The Astrophysical Journal Letters, 613, L153
50% Astraatmadja T. L., Bailer-Jones C. A. L., 2016a, ApJ, 832, 137
Astraatmadja T. L., Bailer-Jones C. A. L., 2016b, ApJ, 833, 119
20% Bailer-Jones C. A. L., 2015, PASP, 127, 994
5%
Bailer-Jones C. A. L., Rybizki J., Fouesneau M., Mantelet G., Andrae R.,
2018, AJ, 156, 58
1% Beaumont M. A., 2019, Annual review of statistics and its application, 6, 379
Beaumont M. A., Cornuet J.-M., Marin J.-M., Robert C. P., 2009, Biometrika,
96, 983
Berger J. O., Wolpert R. L., 1988, The likelihood principle. IMS Lecture
Notes-Monograph Series Vol. 6, Institute of Mathematical Statistics
Predicted Parallax [mas] Blei D. M., Ng A. Y., Jordan M. I., 2003, Journal of machine Learning
research, 3, 993
Figure 4. Quantile-Quantile (Q-Q) Plot. This figure demonstrates a way Brooks S., Gelman A., Jones G., Meng X.-L., 2011, Handbook of Markov
to perform posterior predictive checking for the model shown in Figure 3. chain Monte carlo. CRC press
Top: The distribution of parallax measurements from the data (gray) and Butler R. P., et al., 2017, The Astronomical Journal, 153, 208
simulated values from the posterior predictive (light blue). The posterior Carlin B. P., Louis T. A., 2000, Bayes and empirical Bayes methods for data
mean is indicated using the dashed dark blue line. The distributions appear analysis. Texts in Statistical Science Vol. 88, Chapman & Hall/CRC Boca
relatively consistent with each other by eye, but a quantile-quantile (Q-Q) Raton
plot is more informative and suggests otherwise. Bottom: The Q-Q plot of the Carlin B. P., Louis T. A., 2008, Bayesian methods for data analysis. CRC
quantiles from the posterior predictive simulated parallax data (𝑥-axis) and Press
of the observed parallaxes (𝑦-axis). If the real and simulated data followed Casella G., Berger R. L., 2002, Statistical inference, second edn. Duxbury
the same distribution, then the quantiles would lie on the one-to-one line. Pacific Grove, CA
However, strong discrepancies are apparent below ∼ 10th percentile and Clarke B., 1996, Journal of the American Statistical Association, 91, 173
above ∼ 70th percentile. Collaboration G., et al., 2018, yCat, pp I–345
Craiu R. V., Rosenthal J. S., 2014, Annual Review of Statistics and Its Appli-
cation, 1, 179
Dalal S., Hall W., 1983, Journal of the Royal Statistical Society: Series B
(Methodological), 45, 278
Danby J., 1988, Willmann-Bell, 1988. 2nd ed., rev. & enl.
from cutting-edge telescopes such as the Vera Rubin Observatory, Diaconis P., Ylvisaker D., 1979, The Annals of statistics, pp 269–281
the James Webb Space Telescope, and many others, have the poten- Dragomir D., et al., 2019, The Astrophysical Journal Letters, 875, L7
tial to drive the field of astronomy, but this new information is best Eadie G., et al., 2019a, in Bulletin of the American Astronomical Society.
understood in the context of existing knowledge and careful statisti- p. 233 (arXiv:1909.11714)
cal inference. Bayesian inference provides a framework in which this Eadie G., et al., 2019b, in Canadian Long Range Plan for As-
type of analysis and discovery can occur. Areas of astronomy where tronony and Astrophysics White Papers. p. 10 (arXiv:1910.08857),
prior information and non-Gaussian based likelihoods are common doi:10.5281/zenodo.3756019
can especially benefit from Bayesian methods, for example X-ray and Eadie G. M., Webb J. J., Rosenthal J. S., 2021, arXiv e-prints, p.
arXiv:2108.13491
gamma-ray astronomy.
Fischer D. A., Marcy G. W., Spronck J. F., 2013, The Astrophysical Journal
Bayesian inference is a broad topic, and many subtopics were Supplement Series, 210, 5
not covered in this article. Ultimately, we hope that this article not Foreman-Mackey D., Hogg D. W., Lang D., Goodman J., 2013, Publications
only serves as a useful resource, but will also be the inception for a of the Astronomical Society of the Pacific, 125, 306
series of more specific papers on Bayesian methods and techniques Gabry J., Mahr T., 2019, bayesplot: Plotting for Bayesian Models, https:
in astronomy and physics. //mc-stan.org/bayesplot