Professional Documents
Culture Documents
Data Analysis in LifeCourseEpi - Article
Data Analysis in LifeCourseEpi - Article
Data Analysis in LifeCourseEpi - Article
9
© The Author(s) 2021. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of https://doi.org/10.1093/aje/kwab087
Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Advance Access publication:
March 29, 2021
Practice of Epidemiology
Initially submitted July 2, 2020; accepted for publication March 23, 2021.
Life-course epidemiology is useful for describing and analyzing complex etiological mechanisms for disease
development, but existing statistical methods are essentially confirmatory, because they rely on a priori model
specification. This limits the scope of causal inquiries that can be made, because these methods are suited
mostly to examine well-known hypotheses that do not question our established view of health, which could lead
to confirmation bias. We propose an exploratory alternative. Instead of specifying a life-course model prior to
data analysis, our method infers the life-course model directly from the data. Our proposed method extends the
well-known Peter-Clark (PC) algorithm (named after its authors) for causal discovery, and it facilitates including
temporal information for inferring a model from observational data. The extended algorithm is called temporal
PC. The obtained life-course model can afterward be perused for interesting causal hypotheses. Our method
complements classical confirmatory methods and guides researchers in expanding their models in new directions.
We showcase the method using a data set encompassing almost 3,000 Danish men followed from birth until age
65 years. Using this data set, we inferred life-course models for the role of socioeconomic and health-related
factors on development of depression.
Abbreviations: CPDAG, completed partially directed acyclic graph; DAG, directed acyclic graph; PC, Peter-Clark; TPC, temporal
Peter-Clark; TPDAG, temporal partially directed acyclic graph.
Life-course epidemiology facilitates modeling risk factors give little insight into life-course disease development (10).
as they develop and aggregate over the life course (1). In the latter case, the models are more consistent with the
Such a perspective is both useful and necessary in order life-course perspective, but model building largely relies on
to understand the etiology of complex chronic diseases elaborate theories of cause and effect (11). Such models need
such as cardiovascular disease (2), diabetes (3), and mental to describe both temporal and cross-sectional relationships
disorders (4–6). However, it is not obvious exactly how the among the variables, and thus require extensive prior knowl-
theoretical life-course framework should be operationalized edge.
into study designs facilitating empirical life-course analysis The reliance on a priori model specification has several
(7). shortcomings. First, it limits the scope of topics that can
Currently, empirical life-course studies either 1) rely on be studied using life-course analysis, given that a com-
traditional statistical exposure-outcome models (for exam- prehensive body of prior knowledge is needed. Second,
ple regression models) that only address a single chosen even when studying supposedly well-known phenomena,
outcome at a time, or 2) use joint models that try to describe the confirmatory nature of the methodology poses a risk of
development over the entire life course at once, for example reproducing existing biases and limits the ability to uncover
by use of structural equation models or path analysis (8, 9). new etiological mechanisms. To a large extent, traditional
In the former case, the life-course perspective is not really life-course analysis methods only facilitate quantification
utilized, except for interpreting the results, and the models of mechanisms that we already consider well-established.
are therefore essentially simplistic risk-factor analyses that An exception is approaches that perform model selection,
Am J Epidemiol. 2021;190(9):1898–1907
1900 Petersen et al.
for each pair of nodes (Xi , Xj ), the algorithm searches for so- (untestable) assumptions. Assumption S1 relates to the
called separating sets S such that Xi ⊥ ⊥ Xj | S. If such an S probability distribution of the data, and therefore it is a
exists, the edge between Xi and Xj is removed. The algorithm statistical assumption. Assumptions C1–C3 are, on the other
terminates when the smallest possible separating set, or no hand, causal; they refer to the data-generating mechanism,
separating set, is found for each pair of variables (Xi , Xj ). which we cannot observe directly.
Afterward, a complete set of orientation rules can be applied
to obtain a CPDAG (step 3). These rules rely primarily on
TEMPORAL PC FOR LIFE-COURSE DISCOVERY
the fact that selection variables create special independence
structures that make it possible to recover some v-structures, We propose an extension of the PC algorithm that accom-
as well as the assumption of acyclicity. Web Appendix 1 modates life-course data where the same individuals are
and Web Figures 1–2 (available at https://doi.org/10.1093/ followed over time, and where variables have a known
aje/kwab087) provide an example that shows how the PC partial temporal ordering into time periods. As examples, we
Am J Epidemiol. 2021;190(9):1898–1907
Data-Driven Life-Course Epidemiology 1901
defined as follows:
α · Zk , if Zk is binary
fk (Zk ) = S (Zk ), if Zk is numeric,
Figure 2. Temporal partially directed acyclic graph resulting from where s is a cubic spline. Due to lack of symmetry, we
using the temporal Peter-Clark (TPC) algorithm in a simulated data need to consider models in both directions. We then test the
example. The placement of nodes into columns represents their hypotheses
periods such that X1 and X2 were measured in childhood, X3 and
X4 in youth, and X5 and X6 in old age.
H0a : M0a = M1a and H0b : M0b = M1b
Am J Epidemiol. 2021;190(9):1898–1907
1902 Petersen et al.
Data
Results
Am J Epidemiol. 2021;190(9):1898–1907
Data-Driven Life-Course Epidemiology 1903
ψ decreases. The retention thus measures the percentage of retention rate is 100%, which means that no new edges are
edges that are not newly introduced in the TPDAG between 2 added when ψ is reduced by a factor 10. Thus, when ψ is
consecutive sparsity levels. We see that 96.88% of the edges reduced, not many new edges are introduced in the TPDAGs.
present in the TPDAG with sparsity 10−6 are retained from This implies that conclusions from small values of ψ are
the previous sparsity level. For all other pairs of graphs, the generally retained for larger values of ψ.
Figure 5. Temporal partially directed acyclic graph for the Metropolit data (Danish men born in 1953) for ψ = 0.00001. Nodes are ordered
in columns according to time from left to right: birth (orange), childhood (purple), youth (green), adulthood (blue), and early old age (red). The
edges are colored according to the first time period that they refer to.
Am J Epidemiol. 2021;190(9):1898–1907
1904 Petersen et al.
to the one in the row above. We have proposed a method extending the PC algorithm,
total − dnew )/dtotal .
e The retention rate is computed as (d
temporal PC, that produces life-course models from an
Am J Epidemiol. 2021;190(9):1898–1907
Data-Driven Life-Course Epidemiology 1905
observed data set. We have implemented TPC in the Moreover, in the oracle setting there is no difference
causalDisco R package (20), and we hope it will find prac- between the skeletons constructed by TPC and the original
tical use in life-course epidemiology. The TPC algorithm PC algorithm; the temporal independence constraints uti-
was used for generating new hypotheses in an applica- lized in the former will be available in the data already. How-
tion concerning development of depression. However, it ever, when conditional independencies are estimated from
requires strong causal assumptions in order to be interpret- observed data, the choice of skeleton-construction method
able. can make a difference, because the TPC algorithm places
A strength of the TPC algorithm is that it considers more trust in temporally induced conditional independencies
information from the whole life course jointly and allows for than conditional independencies inferred from data. We
exploratory model building. This facilitates building global consider this to be an attractive feature of the TPC algorithm.
models for the whole life course that can provide empirical This feature is not present in an alternative suggestion
evidence about presence or absence of causal links between for how to incorporate background information originally
Am J Epidemiol. 2021;190(9):1898–1907
1906 Petersen et al.
In the application, we noted that the life course was Health, University of Copenhagen, Copenhagen, Denmark
modeled retrospectively. This means that the resulting mod- (Merete Osler).
els should be thought of as descriptions of data-generating This work was funded by the Independent Research
mechanisms rather than tools for designing interventions Fund Denmark (grant 8020-00031B).
on, for example, children, because the variables are con- Data availability statement: The data used in the example
ditional on being alive at age 65 years. They are thus not (temporal PC on simulated data) can be reproduced using
representative of all children in the relevant background the replication code available in Web Appendix 5. The data
population. A natural first step to overcome this limitation used in the application (development of depression in
will be to incorporate right censoring and an absorbing state Danish men) are not available online for replication
(death) into the conditional-independence testing procedure, because they cannot be anonymized. Researchers interested
for example, by using methods for survival analysis with in gaining access to the data may contact the Public Health
competing risks (36). This would also make it possible to Database at the Department of Public Health, University of
Am J Epidemiol. 2021;190(9):1898–1907
Data-Driven Life-Course Epidemiology 1907
Am J Epidemiol. 2021;190(9):1898–1907