Water Quality Statistical Analysis 1425848451

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Ecological Indicators 118 (2020) 106684

Contents lists available at ScienceDirect

Ecological Indicators
journal homepage: www.elsevier.com/locate/ecolind

An analysis of the sample size requirements for acceptable statistical power T


in water quality monitoring for improvement detection
Christopher Wellena, , Philippe Van Cappellenb, Larissa Gospodyna, Janis L. Thomasc,

Mohamed N. Mohamedd
a
Land and Water Resources Research Group, Department of Geography and Environmental Studies, Ryerson University, Toronto, Ontario, Canada
b
Ecohydrology Research Group, Department of Earth Sciences, University of Waterloo, Waterloo, Ontario, Canada
c
Environmental Monitoring and Reporting Branch, Ontario Ministry of the Environment, Conservation, and Parks, Toronto, Ontario, Canada
d
Canadian Centre for Inland Waters, Environment and Climate Change Canada, Burlington, Ontario, Canada

ARTICLE INFO ABSTRACT

Keywords: Many water quality managers seek to demonstrate reductions in pollutants after a remedial program or policy
Statistical power change of some sort is implemented, but there is little information in the literature to help guide the extent of
Phosphorus water quality sampling that is required to be confident that a change has occurred. Statistical power refers to the
Nitrogen likelihood of avoiding a Type II error in hypothesis testing. It is critical to examine statistical power levels to
SRP
ensure results are not unduly influenced by insufficient quantity of data. This study presents the first published
Hypothesis test
record, to the best of our knowledge, on sample size requirements to achieve acceptable levels of statistical
Water quality
power in hypothesis testing of annual water quality (nutrients) in streams. We examined 13 temperate agri-
cultural watersheds spanning a gradient of size from 11 to 16,000 km2 using data synthesized from long-term
flow and water quality records. We found that achieving commonly accepted levels of statistical power (0.8)
after reductions of 20% in load or flow-weighted mean concentration (FWMC) required an inordinate quantity of
data (50–250 years for load, 10–120 years for FWMC), while achieving statistical power of 0.8 after reductions of
80% of load or FWMC required very little data (2–4 years for FWMC, 2–7 years for load). Load reductions of 40%
required a range of 8–50 years of data depending on analyte, while FWMC reductions of 40% required
3–10 years of total phosphorus (TP) data, 5–25 years for soluble reactive phosphorus (SRP), and 2–6 years for
nitrate (NO3). We examined relationships among times to achieve statistical power and a number of common
landscape descriptors (discharge, baseflow index, basin size, concentration-discharge slope) and found no dis-
cernable relationships for either TP or SRP, whereas catchments with higher baseflow indices were found to have
lower data requirements for achieving statistical power of 0.8 for NO3. We also show through subsampling
experiments that higher frequency sampling tended to reduce data requirements to achieve acceptable statistical
power, though these gains diminish as the sample frequency increases. The information presented will help those
tasked with watershed monitoring to design appropriate sampling regimes to ensure adequate data are obtained
to detect change.

1. Introduction Nonpoint sources, largely agricultural but also inclusive of urban, have
been highlighted in both Europe and North America as the primary
The United States Environmental Protection Agency estimates that source of nutrients to waterways (USEPA, 2016; European Environment
nearly half of monitored lakes, rivers, and estuaries are in poor biolo- Agency, 2018). Many conservation measures, variably referred to best
gical condition (USEPA, 2016), with nutrients, primarily phosphorus in management practices, beneficial management practices, or good
freshwaters, being the main driver of impaired water quality. Phos- management practices, have been devised to control the loss of nu-
phorus impairment is rising in the United States (USEPA, 2016). In the trients and other pollutants from agricultural landscapes to streams
European Union, only 40% of surface waters are in good ecological or (McDowell et al., 2016). Governments all over the world have spent
chemical status, with nutrient enrichment being highlighted as a key billions of dollars on programs to encourage the adoption of con-
impairment of water quality (European Environment Agency, 2018). servation measures by farmers. Reductions of nonpoint source


Corresponding author.
E-mail address: christopher.wellen@ryerson.ca (C. Wellen).

https://doi.org/10.1016/j.ecolind.2020.106684
Received 9 September 2019; Received in revised form 15 May 2020; Accepted 29 June 2020
Available online 23 July 2020
1470-160X/ © 2020 Published by Elsevier Ltd.
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

pollutants at the plot and field scale have been demonstrated routinely This study fills these gaps in the literature by estimating sample
(Cherry et al., 2008; McDowell et al., 2016). However, programs aimed sizes required for detecting watershed scale changes brought about by
at empirically demonstrating and quantifying the effectiveness of ben- conservation measures in agricultural watersheds. Here we posit that
eficial management practices at the watershed scale often struggle to hypothesis testing of differences in means are the most appropriate tests
detect changes to water quality after conservation measures have been to answer questions typically asked of watershed intervention pro-
adopted (Cherry et al., 2008; Tomer and Locke, 2011). grams. This is due to the nature of programs that seek to measure a
There have been many hypotheses proposed to explain the seeming difference (usually a reduction) in mean watershed loading to a water
disconnect of the demonstrated field scale efficacy of beneficial man- body after a campaign of management intervention (e.g., Annex 4
agement and the elusive watershed scale efficacy of the same. One of Objectives and Targets Task Team, 2015). We acknowledge that trend
the most prominent explanations is that controlling a single field is analyses are often adopted by watershed scientists for many valid sta-
tractable, while controlling an entire watershed is generally not, espe- tistical reasons (Meals et al., 2011). However, trend analyses are not
cially in agricultural areas (Cherry et al., 2008). There has been limited designed to quantify reductions of mass inputs, which is the most ty-
research into using small, paired agricultural basins to evaluate the pical question asked by watershed managers and policy analysts. We
watershed scale effects of land management (Loftis et al., 2001, King therefore focus on statistical hypothesis tests, specifically the Student’s
et al., 2008). While these studies did emphasize the statistical ad- t-test. This test has been used to evaluate watershed management (e.g.,
vantages of the paired basin approach, it is often not possible to pair Smart et al. 2015), and many other impact studies use a hypothesis test
large basins and certainly not possible to manage the land use over their to search for the effects of management practices (e.g., Desrosiers et al.,
extent. Other work has emphasized that effects of single fields are 2006).
quickly lost over the extent of even a small basin simply due to the Due to the widely acknowledged difficulties in detecting watershed
confounding factors associated with the other fields in the watershed. scale improvements after improvements to land management (Cherry
Smart et al. (2015) were able to encourage farmers to adopt beneficial et al., 2008), we assume that it is more pressing to understand the
management to reduce erosion across a watershed that is 8218 km2 and propensity for committing Type II errors. We therefore conduct an
were able to detect a 38% reduction of sediment load. However, de- analysis of statistical power to determine sample sizes required to re-
monstrating this reduction required a data record spanning nearly duce the probability of a Type II error to acceptable levels. Statistical
40 years. Examples of reduced nutrient loads demonstrated with em- power refers to the conditional probability that one will reject a null
pirical data are quite rare. hypothesis of no difference between two groups (H0) given that the null
Several studies have attributed such difficulties in detecting nutrient hypothesis is indeed false, and given other specifications such as the
load reductions on watershed nutrient legacies that buffer any changes sample size and the degree of sample variability. Using subsampling
to nutrient losses at the farm scale (Cherry et al., 2008; Tomer and and a Monte-Carlo approach, we also address the uncertainty in-
Locke, 2011; Yang et al., 2013; Van Meter and Basu, 2015; Stuart, troduced by different sampling regimes and analytic error. Finally, we
2017). Undoubtedly, legacies and time lags play an important role in examine whether commonly available landscape metrics co-vary sig-
determining stream response to land management changes. Van Meter nificantly with the sample sizes (annual loads) we estimate. The simple
and Basu (2015) have estimated the time lag between changes to the case of an instantaneous reduction was assumed, and it was also as-
landscape and stream nitrate concentrations of between 8 and 15 years sumed that a simple t-test would be used to detect differences before
for a watershed of around 8 km2; longer lags are likely in larger wa- and after the implemented change. These assumptions are simplistic.
tersheds. In part because of complications due to lag effects, many However, the effects of a transitory phase between intervention and the
approaches to evaluate conservation measures rely on process-based full realization of that intervention would be to increase the time
models (Cherry et al., 2008; Tomer and Locke, 2011; Stuart, 2017; Yang needed to accumulate sufficient sample sizes. Further, most policy
et al., 2013). However, it remains desirable to have tangible evidence questions posed to watershed monitoring data seek to establish whether
substantiating the effects of beneficial management to address concerns watershed loading has declined, possibly resulting from interventions
arising from the uncertainty of modelling approaches (Wellen et al., or to meet a prescribed threshold. We therefore present in this paper
2015). Indeed, many programs expect to demonstrate efficacy with meaningful (under)estimates of sample sizes, in units of years of load
empirical data. data required to detect changes to aid planning of watershed mon-
Despite the expectation that significant watershed scale investments itoring strategies.
should result in demonstrable improvements in water quality, little The overall objectives of this study were to: i) synthesize a long-
research has been dedicated to defining a priori the data requirements term daily record of watershed scale discharge and nutrient loading at a
of an experiment to detect watershed scale differences in water quality number of watersheds dominated by agricultural land cover spanning a
after implementation of conservation measures. Past work on statistical range of basin sizes from 11 km2 to 18,000 km2, and ii) extract the
power and sample size requirements in the context of water quality has necessary statistical parameters to quantify statistical power from these
typically focused on trend analyses (e.g. Irvine et al., 2012). While long-term synthetic time series. We posited three cases:
previous research has emphasized the advantages of paired basin ex-
periments (Loftis et al., 2001; Bishop et al., 2005; King et al., 2008), i) Case I assumes that loads can be measured continuously without
most watershed scale evaluations of conservation measures are per- error;
formed on large basins, where replication is not possible (Tomer and ii) Case II assumes that uncertainty is contributed by sampling, though
Locke, 2011). A recent analysis by the Northeast-Midwest Institute not by analytic error; and
suggested that a hypothesis test of monthly total phosphorus con- iii) Case III assumes that uncertainty is contributed by both sampling
centrations would require roughly 10 years of monitoring data to have and analytic error of concentration and of flow.
confidence in the results of hypothesis tests (Betanzo et al., 2015).
However, these detection times varied widely, suggesting that certain The details of the power analysis and sample size estimation, as well
aspects of the landscape may result in faster (or slower) detect times. as how each case was analyzed, is presented below. We apply this ap-
This could be an important factor to consider in designing monitoring proach to records of total phosphorus, soluble reactive phosphorus, and
approaches. Many issues remain, including the role of uncertainties nitrate. We chose these three analytes due to their widespread presence
introduced by sampling and analytic error, the role of catchment size, in monitoring programs, their management importance, and because
and the possibility of identifying a priori appropriate basins for a they are subject to varying transport regimes (Moatar and Meybeck,
monitoring program dedicated to empirically detecting changes due to 2005).
conservation measures.

2
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

2. Methodology The critical value of a t-test is calculated as:

2.1. Statistical power calculations t (1), v = t ,v


2sp2
n (4)
A statistical power calculation requires four pieces of information: i)
an estimate of the population variance (σ2); ii) an estimate of the effect where , n and sp the same as in Eq. (3), to the probability of com-
size, or difference between population means (δ = μ1-μ2); iii) an esti- mitting a Type I error, v to the degrees of freedom of the test (v = n-1),
mate of the sample size (n); and the significance level of the test (α). and β to the probability of committing a Type II error. The null hy-
The way these pieces of information are related depends on the test to pothesis is rejected when the t value exceeds the critical t value. We
be employed. We assume that a two-sample t-test will be used to assumed that the t-test used would be two-tailed, allowing for the
quantify the differences in mean annual loading rates of total and so- possibility that factors outside the control of management (climate,
luble reactive phosphorus before and after a conservation measures major land use change) may result in increases or decreases of water
‘treatment’ is applied. We assume that changes to the watershed result quality indicators. We also assumed a two-sample, Independent samples
in instantaneous changes to the mean annual watershed load. We point t-test. Statistical power is defined as 1-β. All computations of power and
out that a t-test characterizes power per group. We assumed that the sample size were conducted with MATLAB’s function pwrsamplesize and
groups represented years before and years after a change had taken checked for consistency with G*Power (Faul et al., 2007).
place, and that the years before were sufficient to meet data require-
ments. In cases where the stream had not been sufficiently character- 2.2. Dataset
ized before change had taken place, the total data requirements would
increase. The datasets used in this investigation were multi-decadal time
It is critical to keep in mind the difference between the time re- series of daily loads of total phosphorus, soluble reactive phosphorus,
quired to detect a change, and the time required for a change in the and nitrate. These data were required in order to robustly apply sub-
environment to manifest. While in practice the two are always inter- sampling approaches. The sub-daily timescale records needed to syn-
twined, in theory the first arises solely from the variability inherent in thesize reliable long-term daily or annual fluxes from simple aggrega-
the monitoring data and the size of the change one wishes to detect tion approaches do exist at some research sites (e.g. Moatar and
(Zar, 1999), while the second is due to the nature of the system under Meybeck, 2005), but they are not available for many critical manage-
study and is typically influenced by biogeochemical turnover rates and ment areas. Due to these inherent data limitations, in most cases long-
hydrological flowpath lengths (e.g. Van Meter and Basu, 2015). Sta- term daily records of water quality are synthesized with some form of
tistical power is typically used to assess the sizes of samples required to statistical approach that fills gaps (e.g. Moatar and Meybeck, 2005;
detect a specific change assuming the change has already occurred and Robertson et al., 2018; Lathrop et al., 2019). For an analysis of statis-
refers to the probability that a failure to reject the null hypothesis is not tical power, it is particularly important that reliable estimates of the
a true Type II error (Zar, 1999). While statistical hypothesis testing is variability of load and concentration across several decades are used.
always done with Type I errors in mind, it is important to ensure the The data used and synthesis procedure to generate the long-term data
probability of a Type II error is low so that we may trust the results of a needed are described below.
statistical hypothesis test regardless of its results. Ensuring adequate Thirteen catchments of varying basin size in the Great Lakes region
statistical power requires sufficient data to detect the ‘signal in the of Canada and the United States were selected for the power analysis
noise’ (Zar, 1999). A t-test (or similar test, such as a simple ANOVA) and sample size estimation. Their characteristics are presented in
may be the only viable option in cases where analysis is to be performed Table 1 and their locations given in Fig. 1. The watersheds are primarily
in a before/after design, as is often the case with change detection agricultural basins, with a minimal agricultural land cover of 66% with
studies where replicability is difficult, i.e. in large basins with mixed the exception of Redhill Creek, a predominantly urban watershed. All
land use (Tomer and Locke, 2011). Note that t-tests are concerned so- sites were subject to intensive monitoring of baseflows and event flows
lely with detecting statistical differences, and do not attribute that of water chemistry. Water quality data were collected by the Healthy
difference to any intervention or confounding factor. In the case of only Lake Huron campaign (http://www.healthylakehuron.ca/index.php);
two groups with equal variance, a t-test will give the same results as an Pine River); Ontario Ministry of the Environment, Conservation and
ANOVA (Zar, 1999). Parks (Long et al., 2015 Grindstone and Redhill Creeks; Provincial
Given these assumptions, we wish to quantify the probability of a Water Quality Monitoring Network, Grand River; Nutrient Monitoring
Type-II error when conducting a hypothesis test. For the analysis pre- program, all other sites except Maumee); and Heidelburg University
sented in this paper, we assumed for simplicity that a t-test of difference (http://www.heidelberg.edu/academiclife/distinctive/ncwqr/data/
between two means would be employed. The null hypothesis (H0) is an guide; Maumee River). All discharge data were supplied by the Water
assumption of equality of means between two groups, and the research Survey of Canada (https://www.canada.ca/en/environment-climate-
hypothesis (H1) is a finding of differences between means: change/services/water-overview/quantity/monitoring/survey.html).
Some sites had exceptionally long water quality records (up to
H 0 : µ1 = µ 2 (1)
69 years, e.g. Grand and Maumee Rivers). At these sites, there have
been long term trends in water quality that have been influenced by
H1: µ1 µ2 (2)
multiple changes to land use (Maumee, Baker et al., 2014) or to im-
with μ1 and μ2 being defined as the population means to be tested, in provements in point source loading (Grand River, Van Meter and Basu,
this case the mean annual load or FWMC before and after conservation 2017). In the interest of having a comparable data record to other sites
measures have been implemented. with shorter water quality records, we decided to use only the most
The test statistic is calculated as: recent water quality data from these sites. To select the starting date of
the water quality data, a Mann-Kendall trend analysis was performed
t= on the concentrations for the entire period of record for each site. The
2sp2
oldest measurement was omitted, and the trend test run again. This was
n (3)
repeated until there was no significant trend at the 95% confidence
where refers to the difference in means ( = μ1 − μ2), n refers to the level (p > 0.05). This was performed on the water quality record for
sample size, and sp refers to the pooled sample standard deviation the Grand and the Maumee Table 2.
(assumed to be an estimate of the population standard deviation σ). Eight covariates were calculated for the basins to help explain the

3
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

Table 1
List of Sites.
Site Name Size (km2) Agriculture (percent Number of water Largest flow Year Range of used Year Range of Daily Total Years of
cover) quality samples used percentile sampled samples Flows Flow Record

Pine River 29.3 88 115 99% 2012–2015 1974–2014 32


Nissouri Creek 31 86 53 97% 2003–2009 1949–2017 69
Silver Creek 11 96 50 91% 2003–2009 2002–2017 15
Middle Maitland Creek 46 87 56 90% 2003–2009 1954–2017 64
South Thames 30 95 50 94% 2003–2009 1957–2017 60
Little Ausable River 64 95 45 62% 2003–2009 2007–2016 9
Redhill Creek 63 13* 82 99% 2010–2012 1978–2014 28
Nineteen Creek 26 95 51 92% 2003–2009 1952–2017 54
Blyth Brook 18 64** 54 95% 2003–2009 1985–2017 25
Venison Creek 44 90 321 99% 2014–2018 1967–2015 35
Grindstone Creek 88 56 70 94% 2010–2012 1965–2014 49
Grand River 5200 72 120 99% 1995–2012 1947 – 2015 68
Maumee River 16,409 80 5016 99% 2005–2015 1975–2014 40

*85% urban
**30% forested; Land use data obtained from SOLRIS, 2008

trends observed in required sample sizes. They are listed in Tables 3–5. where C refers to the concentration of a particular nutrient species at a
Baseflow indices (BFI) were calculated for all sites as the ratio of particular instant in time (ug/L), Q to the instantaneous discharge (m3/
baseflow to total flow. Baseflow separation was done using the HydRun s), DOY to the day of the year, Log to the base-10 logarithm, and n to
toolbox (Tang and Carey, 2017). Specific discharge was calculated as coefficients estimated through calibration. This model is analogous to
the ratio of average flow to basin area, and expressed in mm/yr. one of the models in the LOADEST software package and is intended to
account for seasonality as well as concentration-discharge dynamics
2.3. Synthesis of concentrations (Runkel et al., 2004). Eq. (5) was fit to the concentration data for each
site using MATLAB’s regress command. The trigonometric seasonal
The objective of the concentration synthesis was to create a time terms were only retained if at least one coefficient was significantly
series of daily loads which realistically reproduced observed patterns of different from zero. The model fit was generally acceptable overall. The
variability. Concentrations on days with no samples were estimated by average coefficient of determination (r2) was 0.43, commonly cited as
fitting the following regression model: acceptable for estimating loads (Moatar and Meybeck, 2005). The de-
tails of the model fits are presented in the electronic supplementary
2 DOY 2 DOY material (Tables ESM1 – ESM3). Bias correction was done during back-
Log (C ) = 1 Log (Q ) + 2 Sin + 3 Cos +
365 365 4
(5)

Fig. 1. Characteristics of catchments in the Great lakes region of Canada and the United States used in the analysis.

4
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

Table 2
Brief description of the cases.
Analysis Case Description Sources of uncertainty

Case I Synthesized daily loads used as though they were perfect. Loads simply summed to annual loads. None
Case II Beale annual loads calculated from once per 7 day or once per 30 day subsamples. Sampling
Case III Beale annual loads calculated from once per 7 day or once per 30 day subsamples with random error added to the flows and Sampling and Data Error
concentrations.

Table 3 power calculations do not address any statistical non-detects in our


Correlation coefficients between years to detect a change in load and FWMC at dataset. However, measured concentrations were nearly always much
a power of 0.8 for Total Phosphorus and various landscape descriptors and higher than the analytical detection methods employed. We present
hydrometrics. All bolded correlations have p < 0.1, with asterisk* indicating time series plots of an example of the raw data for Blyth Creek as Fig.
p < 0.05. ESM 1, and annual loads and FWMC synthesized time series for all
Total Phosphorus Years to detect a change in Years to detect a change in catchments as Figures ESM 2–15.
load for a reduction of FWMC for a reduction of

Covariate 20% 40% 80% 20% 40% 80%


2.4. Analysis procedure for cases I, II, and III
Specific discharge −0.25 −0.24 −0.22 −0.47 −0.52 −0.46
Discharge −0.37 −0.39 −0.19 −0.45 −0.42 −0.11 The three cases examined for the power analysis and sample size
Log discharge −0.33 −0.35 −0.18 −0.36 −0.37 0.01 estimation were intended to capture progressively more detail about
BFI −0.44 −0.44 −0.50 −0.09 −0.12 −0.05
Basin size −0.37 −0.39 −0.19 −0.44 −0.41 −0.11
sampling, and were progressively more complex. In all cases, we ex-
Log size −0.26 −0.29 −0.12 −0.24 −0.25 0.11 amine data requirements only due to interannual variability. The time
CQ slope model 0.06 0.05 0.02 0.41 0.37 0.53 required for change to manifest is not accounted for. The results of our
r2 model 0.02 0.04 −0.01 0.28 0.29 0.55 analyses can be thought of either as an instantaneous change, or as a
change long after any transient period has passed. The detection times
we show assume no lag time – any detect times should be interpreted as
Table 4 underestimates of time to detect changes in environmental processes.
Correlation coefficients between years to detect a change in load and FWMC at
In Case I, it was assumed that loading could be measured con-
a power of 0.8 for SRP and various landscape descriptors and hydrometrics. No
tinuously without error, and that the synthesized daily load time series
correlations had p < 0.1.
was a perfect record of these loads. Case I represents an approximation
Soluble Reactive Years to detect a change in Years to detect a change in of the best possible monitoring situation achievable. This was oper-
Phosphorus load for a reduction of FWMC for a reduction of
ationalized by simply taking the model estimated loads generated by
Covariate 20% 40% 80% 20% 40% 80% applying Eqs. (5) and (6) and extracting the parameters necessary for a
power analysis and sample size determination (σ, μ). We then estimated
Specific −0.02 −0.02 −0.10 −0.01 −0.01 0.02 the number of years it would take to reach a power of 0.8, the most
discharge
common power convention.
Discharge −0.29 −0.28 −0.29 −0.32 −0.33 −0.42
Log discharge −0.27 −0.27 −0.25 −0.34 −0.35 −0.35 In Case II, it was assumed that sampling would be conducted weekly
BFI −0.43 −0.44 −0.40 −0.34 −0.34 −0.38 or monthly, but that data error was not significant. Weekly sampling
Basin size −0.29 −0.28 −0.29 −0.32 −0.33 −0.42 was simulated by dividing the entire period of record into weekly in-
Log size −0.24 −0.24 −0.21 −0.32 −0.33 −0.33 crements. One day within the loading record was chosen at random to
CQ slope model −0.03 −0.03 0.05 0.09 0.07 0.02
r2 model 0.11 0.12 0.05 0.18 0.18 0.20
represent the sample of that week. Load estimates were calculated for
each year with Beale’s estimator (Maccoux et al., 2016):

transformation using: 1+
1 Slq
l nlq
Ld = Q
(6)
2 S
C = 10 Log (C ) × e 2.65 × RMSE q 1 q2
1+ n q2
(7)
where C is the predicted concentration, Log denotes the base 10 loga-
rithm, and RMSE refers to the root-mean squared error (RMSE) of the
concentration model (Long et al., 2015). Table 5
The calibrated coefficients were used with long time series of flows Correlation coefficients between years to detect a change in load and FWMC at
to synthesize long term records. At Pine River, long term daily a power of 0.8 for Nitrate and various landscape descriptors and hydrometrics.
flow records were estimated by regression with daily flows from a All bolded correlations have p < 0.1, with asterisk* indicating p < 0.05.
downstream station (WSC gauge 02FD001; Upstream
Nitrate Years to detect a change in load Years to detect a change in
= 0.092*Downstream + 0.077, r2 = 0.8). The daily discharges were for a reduction of FWMC for a reduction of
used to estimate long term concentration and load records. Note, only
complete years were used for the analysis. Covariate 20% 40% 80% 20% 40% 80%
Flow weighted mean concentrations (FWMC) were calculated for
Specific discharge 0.27 0.26 0.32 −0.12 −0.12 0.01
many analyses. These were calculated as the total load over a time Discharge 0.10 0.11 −0.01 0.44 0.43 0.01
period divided by the total flow during that time period. FWMC values Log discharge −0.06 −0.07 −0.21 0.38 0.38 0.01
are less sensitive to variability in discharge and are often used as an BFI −0.58* −0.58* −0.69* −0.32 −0.39 0.01
alternative to loading values. We tested all annual synthetic datasets for Basin size 0.10 0.11 −0.01 0.44 0.43 0.01
Log size −0.12 −0.13 −0.26 0.38 0.38 0.01
conformation to the assumptions of a t-test (normality, lack of serial
CQ slope model 0.35 0.33 0.47 −0.05 −0.04 0.01
autocorrelation). While all sites were normal, one site had a weak au- r2 model 0.57* 0.54 0.50 0.13 0.21 0.01
tocorrelation for SRP and NO3 (r < 0.4, p < 0.05). Note that our

5
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

n
1
Slq = qi li nq l
n 1 i=1 (8)

n
1
S q2 = qi2 nq 2
n 1 i=1 (9)

Monthly sampling proceeded similarly, with the exception that the


period of record was divided into monthly increments. Statistical
parameters needed for the analysis (σ, μ) were taken from the annual
time series estimated with Eqs. (7)–(9). This procedure was repeated
100 times, with a power analysis and sample size determination being
performed on each realization of the annual loading time series.
In Case III, it was assumed that sampling would be conducted
weekly or monthly, and that data error is significant. Weekly and
monthly sampling was simulated with the same procedure used in Case
II. However, before Eqs. (7)–(9) were used to estimate the annual loads,
both the sub-sampled concentrations and flows were perturbed by a
multiplicative error of 10%. This value was chosen as it is a realistic
value of analytical and flow errors in optimal conditions (Dickinson,
1967; Sauer and Meyer, 1992; James and Roulet, 2006), which include
a valid stream rating curve and optimal lab techniques and equipment.
Note that this approach assumed there was no bias in this error. As with
Case II, the statistical parameters needed for the power analysis and
sample size determination were extracted from each realization of the
annual time series. Table 2 lists the three cases with a brief description.

3. Results

3.1. Case I

The number of years required to achieve a statistical power of 0.8


for TP, SRP, and NO3 loads and FWMC values are presented in Fig. 2 for
reductions of 20%, 40%, and 80%. We remind readers that these totals
assume the baseline had been characterized already; these sample size
requirements refer to years needed after a change had been made to the
watershed. It is immediately clear that detecting small (20%) changes
in loading or FWMC require decades or longer for all analytes, and is
likely to be impractical in most settings. Detecting moderate (40%)
changes in load also require decades for TP and SRP, while for NO3 the
time requirements are shorter (< 10 years median). For moderate
changes to stream chemistry, detecting changes to FWMC may be far
Fig. 2. Boxplots of years of annual samples required for TP, SRP, and NO3 load
more practical than changes to load, with median detection times on
(left column) and flow-weighted mean concentration (right column). The rows
the order of 3–5 years for TP and NO3, while SRP still took significant
refer to a 20% reduction of mean (top), a 40% reduction of mean (middle), and
amounts of time to detect changes (median of 10 years, ranging to an 80% reduction of mean (bottom).
25 years). Detecting large changes (80%) requires only between 3 and
6 years for load and between 2 and 4 years for FWMC.
Fig. 2 reflects significant variability across catchments for a given robustness of these patterns. We note that no correlations were sig-
analyte and a given parameter – e.g., detection times for moderate nificant for SRP.
changes to SRP FWMC ranged from 4 to 26 years. We present an ana- Regarding NO3, the landscape metrics were slightly more ex-
lysis of the correlations between landscape drivers and detection times planatory. The ability to detect changes to NO3 load were consistently
in Tables 3–5, below. There were relatively few significant variables, negatively correlated with the baseflow index, showing that watersheds
and the correlations which were significant were not strong (r2 ~ 0.5). with more baseflow had a lower detection time for NO3. The slope of
While we hypothesized that baseflow index and specific discharge the C-Q relationship was weakly positively correlated with time to
would influence time to detection for TP and SRP, they explained detect large changes to load (p < 0.1, r ~ 0.5), but not to changes in
variability in detection time of TP load only for reductions of 80%. FWMC. The positive slope indicates that more episodic concentration-
Specific discharge explained significant variability for reductions of TP discharge relationships result in longer detect times, presumably from
FWMC of 40%. Taken together, these suggest that sites with more higher annual variability of export. Taken together for all analytes,
surface water (less baseflow and higher discharge per area) tend to have these results suggest that changes to nutrient export can be first de-
longer detection times, presumably from more episodic export patterns tected in basins with less episodic flows and less episodic concentration-
resulting in more annual variability. The slope and fit of the model discharge relationships, though this relationship is not strong and not
positively correlated with detection times for FWMC, suggesting that consistent.
basins with more episodically driven concentrations tended to have It is possible that our ability to detect these relationships is limited
longer detection times. However, we note that none of the correlations by our modest sample size (n = 13). According to Zar (1999), the
were significant for more than one reduction, calling into question the minimum estimate of Pearson’s r we would be able to detect with

6
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

Fig. 3. Sample size-power relationships for flow-weighted concentrations at two analytes at four catchments: TP at Maumee River (a), TP at Nineteen Creek (b), NO3
at Venison Creek (c), and NO3 at Silver Creek (d). Note that these correspond to the shortest (a, c) and longest (b, d) times to achieve a statistical power of 0.8 for TP
and NO3. Also note that the colored envelopes denote the 95% confidence intervals for the specified detect times under sampling intervals of once per 7 days (blue)
and once per 30 days (green).

n = 13 is 0.7, for an r2 of 0.49. To be able to detect relationships of a continuous sampling resulted in no appreciable reduction in detect
more subtle nature (e.g., r of < 0.5) we would need at least 29 catch- times. It should be noted that under sparser sampling regimes, it is
ments. It is quite possible that the metrics in Tables 2–4 would be possible to underestimate the variability of the FWMC, and hence arrive
significant predictors of detect times with larger sample sizes, but if so at an underestimate of detect time than would be obtained with con-
they explain less than half of the variability in detect times. tinuous sampling.

3.2. Case II 3.3. Case III

Case II seeks to understand what the effects of common stream Case III seeks to understand the effects of measurement error on
sampling methodologies are on the sample requirements to achieve a detect times, and how it interacts with sampling error. To estimate this,
given statistical power. The sampling protocols tested are sampling the catchment with median detect times for both TP and NO3 (South
once per 7 days, and sampling once per 30 days. Thames) was selected for an in-depth analysis. Thames had a detect
The results of four scenarios are presented in Fig. 3. They represent time of 6 years for a 40% reduction of TP FWMC and a detect time of
the time to a power of 0.8 for TP and NO3 FWMC taken from the 3 years for a reduction of 40% of NO3 FWMC at 0.8 power in Case I.
catchments with the greatest and least required sample sizes for those Fig. 4 presents the results of a subsampling experiment where no
respective analytes. These are presented as extreme cases. Note that the measurement error was present, measurement error was added as 10%
envelopes shown for the weekly and monthly sample sizes depict the error of concentrations, 10% error in flows, and 10% error in both
95% confidence interval generated from the various subsamples. concentration and flow. These scenarios were repeated for both weekly
Fig. 3 shows that non-continuous sampling typically introduces and monthly sampling. In the case of TP, the years required to reach 0.8
uncertainty into the time to detect at 0.8 power, and that this un- power did not increase appreciably due to measurement error under
certainty is typically larger with coarser sampling intervals. While weekly sampling, but did increase modestly (8 years to 9 years for TP
moving from monthly to weekly sampling did improve detect times for and 5 years to 7 years for NO3) due to measurement error under
TP, moving from weekly to continuous sampling resulted in modest monthly sampling. Taken together, these results show that when mea-
gains in detect times. For NO3, moving from monthly to weekly to surement error is limited to optimal levels (10%), the effect on detect

7
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

Fig. 4. Required sample sizes in years to achieve power of 0.8 for TP (a) and NO3 (b) FWMC at the Middle Thames station. These sampling scenarios are under
weekly (1–4) and monthly (5–8) sampling. In order, 1 denotes no error, 2 denotes 10% error in concentration measurements, 3 denotes 10% error in flow mea-
surements, and 4 denotes 10% error in both concentration and flow measurements.

times is modest. The effect of coarser sampling intervals is greater or net nitrogen inputs and reduction in in-stream nitrogen concentrations
comparable. in various subbasins of a large basin with urban and agricultural inputs.
Muenich et al. (2016) have investigated the effects of legacy soil
4. Discussion phosphorus in reductions of phosphorus concentrations in a large river.
Their estimates were that it would take roughly 80 years for the effect
This paper sought to scope the data requirements to detect changes of nutrient input reductions to fully manifest as reductions in the
in annual nutrient loads and flow weighted concentrations, neglecting stream.
any transient period. While the assumptions adopted in this work are We do caution that more research is needed to fully quantify the
quite simplistic, a number of important conclusions for planning water range of lag times for landscape change to manifest in the streams that
quality monitoring for change detection were arrived at. We address a drain those landscapes. The estimates of detection times for flow
number of issues in this discussion, specifically: i) a contrast of detect weighted mean concentration we present in this work for the various
times with actual time for a change to take place; ii) the specific im- nutrient species did vary. However, referring to Fig. 3, it is clear that for
plications for monitoring; and iii) a recommendation of other methods both TP and NO3 at monthly sampling, the maximum lag times of
of change detection to supplement those derived from empirical data 22 years (TP at Nineteen Creek) and 15 years (NO3 at Silver Creek) are
gathering. of the same order if not smaller than the lag times estimated above. It is
likely that detect times would serve to add a significant amount of time
4.1. Which takes longer – lag times due to real world dynamics or statistical to the overall time needed to establish changes in water quality, though
power? may well be shorter than the actual lag time for changes to manifest.

The estimates in data requirements we present here have nothing to 4.2. Implications for monitoring
do with the timescales required for a change to actually manifest.
Should hypothesis tests be used to detect and quantify changes to water The chief motivation of this study was to provide concrete estimates
quality, they will ideally be initiated after the full magnitude of the of data requirements to detect change in nutrient water quality at the
effects of landscape change on water quality has manifested. The effects watershed scale. Doing so will allow us to set reasonable expectations
of statistical power in delaying conclusions of efficacy would in practice for monitoring programs. There were a number of findings from this
be added to lag times for change to manifest. We propose that in cases work that can inform future monitoring. Perhaps the most pertinent
where the data requirements we present here are small compared to the finding is the significant time required to detect changes – between 3
time lags for landscape change to manifest as changes to water quality, and 10 years for TP flow weighted mean concentrations of a reduction
the effects of power may not be profound. In cases where the statistical of 40%, and between 10 and 30 years for TP loads of a 40% reduction
power requirements are greater than lag times, the time required to (Fig. 2). We also note that in cases where changes on the order of 80%
gather sufficient data may not be a major factor in successfully obser- reductions in stream concentration were plausible in a short period of
ving change. time (e.g. improvements to pesticide management; Kreuger and
In practice, even after significant management intervention it is Nilsson, 2001; Todd and Struger, 2014), even short monitoring cam-
possible to see little to no change in watershed scale water quality paigns are sufficient to detect changes in flow weighted concentrations
(Cherry et al., 2008). Time lags between landscape change and water (2–3 years). When interpreting our results vis-a vis changes to pesticide
quality change have not been well quantified, though there has been concentrations, it should be noted that we did not need to account for
some work on this topic. Recent research suggests that there can be non-detects in our dataset, whereas when working with pesticide data it
significant time lags between landscape and water quality changes (Van is likely that one would encounter non-detects.
Meter and Basu, 2015). For instance, Van Meter and Basu (2017) The findings of Case I show that when detecting change, flow-
compared variability in catchment scale nitrogen inputs and outputs weighted mean concentrations have far more power than loads. This is
over long time scales to estimate 10–34 years lag between reduction in largely due to the confounding influence on loads of precipitation

8
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

variability. We do not elaborate on the utility of flow weighted con- nutrient loading or FWMC. We consider two here: trend analyses and
centrations over loads here, as it has been established in the literature process-based models.
elsewhere (Annex 4 Nutrients Task Team, 2015; Betanzo et al., 2015). First, trend analyses are ubiquitous in watershed science (e.g.,
We also establish in the Case I work that, in the case of total and Stammler et al., 2017), and provide a much more familiar way of
soluble reactive phosphorus, common hydrometric or geomorphic analyzing watershed data then hypothesis tests of differences in means.
landscape metrics were unable to predict detect times. The variability Changes to water quality do not typically manifest as step changes, but
in detect times for a particular effect size – the signal – varies inversely often as more gradual trends (Van Meter and Basu, 2015, 2017).
with the degree of variability – the noise (Eq. (4)). The factors that lead However, the use of trend analysis to demonstrate compliance with
to variability in catchment loading or flow-weighted concentration policy goals is problematic for three reasons.
have been enumerated in the literature. One of the most prominent is Policy questions are typically framed in ways that strongly suggest a
the degree of chemostasis, defined as the insensitivity of chemical hypothesis test of differences between group central tendencies. For
concentrations to changes in discharge (Godsey et al., 2009; Basu et al., instance, in the Lake Erie context there is great interest in whether total
2011). Chemostasis may be quantified as the slope of a concentration- phosphorus loadings have declined by 40% with respect to a particular
discharge relationship in log space (Godsey et al., 2009), which is one baseline (Annex 4 Nutrients Task Team, 2015), while in a particular
of the terms estimated in Eq. (5). Another critical factor is the wa- Wyoming case study it was desired to show that a reduction of sus-
tershed size, as larger watersheds tend to ‘integrate’ signals from many pended sediment concentrations had occurred after a program of ran-
headwater source areas and are characterized by lower variability geland management improvement was implemented (Smart et al.,
(Creed et al., 2015). We show in Tables 3–5 little evidence to suggest 2015). Second, it may be desired for a trend analysis to be used to
that chemostasis or basin size is related to the detect times we obtained demonstrate that management is ‘on track’ to achieve a desired re-
for N or P species. duction. For instance, given a declining trend of 2% per year in FWMC,
The lack of significant landscape predictors for detect times may a 40% reduction would be realized in 20 years if this present trend
imply that finding ‘sentinel watersheds’ – where detect times are lower continues. However, simply assuming this hypothetical trend will con-
than others – is not possible using the commonly available metrics in tinue into the future is an extrapolation beyond what a simple statistical
Tables 3–5. However, the sample size we were working with was not analysis is able to provide. Finally, like any statistical analysis, a trend
large (13 catchments), and we do show that only relationships with an analysis is also subject to constraints of statistical power, and natural
r2 of roughly 0.5 or higher would have been detected. Future work with variability diminishes power. A power analysis on the ability to detect
more catchments will be needed to examine whether the commonly trends is warranted on these grounds. It is possible that, like differences
available metrics presented in Tables 3–5 are able to identify catch- between groups, detecting trends takes decades of observations. While
ments with shorter detect times, though we do not expect any very trend analyses are a more natural way to analyze watershed data than
strong bivariate relationships. We also caution that catchments with hypothesis tests, hypothesis tests of a difference in central tendency are
earlier detectable change are likely those with less variable flows and a natural fit for the policy questions typically asked of water quality
nutrient concentrations. The dynamics that dampen their variability improvement.
(e.g. higher baseflow index, slower response to precipitation) may also There is a second alternative to hypothesis tests of differences in
give these catchments different magnitudes of types of response than central tendency when establishing that water quality has been influ-
others. enced by management interventions: process-based watershed models
The results of Case II shows that the effects of sampling frequency (Cherry et al., 2008). Typically, the model is calibrated with full
are significant and cannot be ignored. Interestingly, weekly sampling knowledge of the detailed land management in the area, including
was found to result in detect times nearly as short as continuous data, beneficial practices, and a scenario is run where the beneficial practices
implying that the benefits from sampling more intensive than weekly are removed. The difference in water quality between the current and
are not significant. While this at first might seem to undermine the case counterfactual scenarios is an estimate of the efficacy of the beneficial
for event sampling, a closer consideration of the timescale of our sub- management practices. This approach is followed by a number of pro-
sampling shows this is not the case. Our subsampling was done on a grams, including the Conservation Effects Assessment Program (Tomer
daily timescale, and assumed that each data point was an accurate and Locke, 2011) and the Great Lakes Agricultural Stewardship In-
depiction of the average conditions of that day. It has been shown that itiative (Yang et al., 2013, 2017). This counterfactual approach relies
stream chemistry and flow dynamics change significantly on the order on the assumptions that i) the process model is an adequate re-
of hours (e.g. Long et al., 2014, 2015). In order to obtain an adequate presentation of the watershed as a whole, and ii), that the representa-
representation of a daily load, sampling must target events (Long et al., tions of the individual BMPs in the model are adequate. However, the
2015). counterfactual approach is not based on extrapolation of statistical
The results of Case III shows that the role of measurement error is techniques. Assuming the models are developed with proper attention
much less significant than that of sampling frequency, provided that the to best practices (which Wellen et al. (2015) found to be the case in the
measurement errors are not systematic. For instance, very high peak minority of cases), this simulation based approach may be the most
flows above the rating curve of a stream are actually extrapolated and rigorous way to avoid the delays due to statistical power and to long lag
subject to very high uncertainty, and likely systematic error. times (Muenich et al., 2016; Van Meter and Basu, 2017) required to
deliver empirical evidence of achieving management targets.
4.3. The need for alternatives to empirical demonstration of nutrient
reductions 5. Conclusions

The work in this paper has demonstrated that it may not always be This study presents the first published results on the sample sizes
feasible to demonstrate small or moderate nutrient reductions in a required to achieve acceptable levels of statistical power in hypothesis
timely manner with monitoring data alone due to the high data re- tests of annual nutrient water quality in streams. We present evidence
quirements. This conclusion has been advocated elsewhere (e.g., that for pollutant reductions of 20%, the data required to achieve ac-
Cherry, 2008; Tomer and Locke, 2011), though it is typically arrived at ceptable levels of statistical power were quite onerous, taking decades
by considering lags for change on the landscape to manifest in a stream. to centuries, whereas for pollutant reductions of 80%, the data required
This study is the first published study to show that statistical power may for similar levels of power were only in the range of 2–7 years. For
play a strong role in this as well. There is a need for assessment methods pollutant reductions of 40%, we showed that using FWMC instead of
other than those focused on demonstrating a particular change of load results in much shorter times to acceptable levels of statistical

9
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

power. For pollutant reductions of 40%, times to achieve acceptable Baker, D.B., Confesor, R., Ewing, D.E., Johnson, L.T., Kramer, J.W., Merry, B.J., 2014.
levels of statistical power were generally less than the estimates avail- Phosphorus loading to Lake Erie from the Maumee, Sandusky and Cuyahoga rivers:
the importance of bioavailability. J. Great Lakes Res. 40, 502–517. https://doi.org/
able for lag times in the literature. We found no discernable relation- 10.1016/j.jglr.2014.05.001.
ships between landscape descriptors and time to acceptable power for Basu, N.B., Thompson, S.E., Rao, P.S.C., 2011. Hydrologic and biogeochemical func-
total phosphorus or SRP. Catchments with higher baseflow indices were tioning of intensively managed catchments: a synthesis of top-down analyses. Water
Resour. Res. 47 (10), 1–12. https://doi.org/10.1029/2011WR010800.
found to have lower data requirements for achieving statistical power Betanzo, E., Choquette, A., Reckhow, K., Hayes, L., Hagen, E., Argue, D., & Cangelosi, A.
of 0.8 for NO3, suggesting catchments with high baseflow indices may (2015). Water Data to Answer Urgent Water Policy Questions : Monitoring design,
allow for acceptable statistical power before those with lower baseflow available data, and filling data gaps for determining the effectiveness of agricultural
management practices ... Water Data to Answer Urgent Water Policy Questions:
indices. We also show through subsampling experiments that higher Monitoring desi. https://doi.org/10.13140/RG.2.1.1102.5684.
frequency sampling tended to reduce data requirements to achieve Bishop, P.L., Hively, W.D., Stedinger, J.R., Rafferty, M.R., Lojpersberger, J.L., Bloomfield,
acceptable statistical power. However, there were generally not sig- J.A., 2005. Multivariate analysis of paired watershed data to evaluate agricultural
best management practice effects on stream water phosphorus. J. Environ. Quality 34
nificant gains in statistical power when intensifying from weekly to
(3), 1087–1101. https://doi.org/10.2134/jeq2004.0194.
continuous sampling. Finally, we show through Monte Carlo and sub- Cherry, K.A., Shepherd, M., Withers, P.J.A., Mooney, S.J., 2008. Assessing the effec-
sampling that the effects of sampling frequency on statistical power are tiveness of actions to mitigate nutrient loss from agriculture: a review of methods.
much stronger than those of data error. The information presented Sci. Total Environ. 406 (1–2), 1–23. https://doi.org/10.1016/j.scitotenv.2008.07.
015.
herein will help those undertaking watershed monitoring for stream Creed, I.F., McKnight, D.M., Pellerin, B.A., Green, M.B., Bergamaschi, B.A., Aiken, G.R.,
water quality changes to design appropriate sampling regimes to ensure Stackpoole, S.M., 2015. The river as a chemostat: fresh perspectives on dissolved
adequate statistical power. organic matter flowing down the river continuum. Can. J. Fish. Aquat. Sci. 72 (8),
1272–1285. https://doi.org/10.1139/cjfas-2014-0400.
Desrosiers, M., Planas, D., Mucci, A., 2006. Short-term responses to watershed logging on
CRediT authorship contribution statement biomass mercury and methylmercury accumulation by periphyton in boreal lakes.
Can. J. Fish. Aquat. Sci. 1745, 1734–1745. https://doi.org/10.1139/F06-077.
Dickinson, W. (1967). Accuracy of discharge determinations. Fort Collins, Colorado.
Christopher Wellen: Conceptualization, Methodology, Software, Faul, F., Erdfelder, E., Lang, A.-G., Buchner, A., 2007. G*Power 3: a flexible statistical
Formal analysis, Investigation, Writing - original draft, Writing - review power analysis program for the social, behavioral, and biomedical sciences. Behavior
& editing. Philippe Van Cappellen: Supervision. Larissa Gospodyn: Res. Methods 39, 175–191.
Godsey, S.E., Kirchner, J.W., Clow, D.W., 2009. Concentration-discharge relationships
Software, Data curation. Janis L. Thomas: Project administration, reflect chemostatic characteristics of US catchments. Hydrol. Process. 23 (13),
Writing - review & editing. Mohamed N. Mohamed: 1844–1864. https://doi.org/10.1002/hyp.7315.
Conceptualization, Resources, Writing - review & editing, Supervision, Irvine, K.M., Manlove, K., Hollimon, C., 2012. Power analysis and trend detection for
water quality monitoring data: An application for the Greater Yellowstone Inventory
Funding acquisition.
and Monitoring Network. Natural Resource Report NPS/GRYN/NRR—2012/556.
National Park Service, Fort Collins, Colorado.
Declaration of Competing Interest James, A., Roulet, N., 2006. Investigating the applicability of end-member mixing ana-
lysis (EMMA) across scale: a study of eight small, nested catchments in a temperate
forested watershed. Water Resour. Res. 42 (8).
The authors declare that they have no known competing financial King, K.W., Smiley, P.C., Baker, B.J., Fausey, N.R., 2008. Validation of paired watersheds
interests or personal relationships that could have appeared to influ- for assessing conservation practices in the Upper Big Walnut Creek watershed, Ohio.
J. Soil Water Conserv. 63 (6), 380–395. https://doi.org/10.2489/jswc.63.6.380.
ence the work reported in this paper. Kreuger, J., Nilsson, E., 2001. Catchment scale risk-mitigation experiences- key issues for
reducing pesticide transport to surface waters. British Crop Protection Council
Acknowledgements Symposium Proceeding NO. 78: Pesticide Behaviour in Soil and Water, (78),
319–324. Retrieved from http://www.slu.se/Documents/externwebben/cen-
trumbildningar-projekt/ckb/ovrigt/BCPC_Symposium_78.pdf.
This work was funded by the Canada-Ontario Agreement, through Lathrop, T.R., Bunch, A.R., Downhour, M.S. (2019). Regression models for estimating
the Ontario Ministry of the Environment, Conservation, and Parks. Data sediment and nutrient concentrations and loads at the Kankakee River, Shelby,
Indiana. Shelby, Indiana.
for Pine River were collected by the Ausable-Bayfield Conservation Loftis, J.C., MacDonald, L.H., Streett, S., Iyer, H.K., Bunte, K., 2001. Detecting cumulative
Authority, Maitland Valley Conservation Authority, Saugeen Valley watershed effects: the statistical power of pairing. J. Hydrol. 251 (1–2), 49–64.
Conservation Authority, and St. Clair Region Conservation Authority. https://doi.org/10.1016/S0022-1694(01)00431-0.
Long, T., Wellen, C., Arhonditsis, G., Boyd, D., 2014. Evaluation of stormwater and
Funding for collecting these data was provided by the Ontario Ministry
snowmelt inputs, land use and seasonality on nutrient dynamics in the watersheds of
of Agriculture, Food, and Rural Affairs through the Canada Ontario Hamilton Harbour, Ontario, Canada. J. Great Lakes Res., 40(4). https://doi.org/10.
Agreement respecting the Great Lakes, the Ministry of the Environment, 1016/j.jglr.2014.09.017.
Conservation, and Parks, and Environment and Climate Change Long, T., Wellen, C., Arhonditsis, G., Boyd, D., Mohamed, M., O’Connor, K., 2015.
Estimation of tributary total phosphorus loads to Hamilton Harbour, Ontario,
Canada. The views expressed in this publication are the views of the Canada, using a series of regression equations. J. Great Lakes Res., 41(3). https://doi.
authors and do not necessarily reflect the views of any of the funding org/10.1016/j.jglr.2015.04.001.
organizations. All codes used in the generation of this work can be Maccoux, M., Dove, A., Backus, S., Dolan, D., 2016. Total and soluble reactive 493
phosphorus loadings to Lake Erie: a detailed accounting by year, basin, country, and
obtained from the corresponding author upon request. tributary. J. Great Lakes Res. 42, 1151–1165.
McDowell, R.W., Dils, R.M., Collins, A.L., Flahive, K.A., Sharpley, A.N., Quinn, J., 2016. A
Appendix A. Supplementary data review of the policies and implementation of practices to decrease water quality
impairment by phosphorus in New Zealand, the UK, and the US. Nutr. Cycl.
Agroecosyst. 104 (3), 289–305. https://doi.org/10.1007/s10705-015-9727-0.
Supplementary data to this article can be found online at https:// Meals, D.W., Spooner, J., Dressing, S.A., Harcum, J.B., 2011. Statistical Analysis for
doi.org/10.1016/j.ecolind.2020.106684. Monotonic Trends Introduction. Fairfax, VA.
Van Meter, K., Basu, N., 2017. Time lags in watershed-scale nutrient transport: an ex-
ploration of dominant controls Time lags in watershed-scale nutrient transport: an
References exploration of dominant controls. Environ. Res. Lett. 12, 084017.
Moatar, F., Meybeck, M., 2005. Compared performances of different algorithms for es-
timating annual nutrient loads discharged by the eutrophic River Loire. Hydrological,
Agency, E. E. (2018). European waters Assessment of status and pressures 2018. European
444(September 2004), 429–444. https://doi.org/10.1002/hyp.5541.
waters Assessment of status and pressures 2018. Copenhagen, Denmark. https://doi.
Muenich, R.L., Kalcic, M., Scavia, D., 2016. Evaluating the impact of legacy P and agri-
org/doi:10.2800/303664.
cultural conservation practices on nutrient loads from the Maumee River Watershed.
Agency, U. S. E. P. (2016). Preoperative local staging of colorectal cancer patients with
Environ. Sci. Technol. 50 (15), 8146–8154. https://doi.org/10.1021/acs.est.
MDCT. National Rivers and Streams Assessment 2008-2009: A Collaborative Survey.
6b01421.
Washington, DC. Retrieved from http://www.epa.gov/national‐aquatic‐resource‐-
Robertson, D.M., Hubbard, L.E., Lorenz, D.L., Sullivan, D.J., 2018. A surrogate regression
surveys/nrsa.
approach for computing continuous loads for the tributary nutrient and sediment
Annex 4 Objectives and Targets Task Team. (2015). Recommended Phosphorus Loading
monitoring program on the Great Lakes. J. Great Lakes Res. 44 (1), 26–42. https://
Targets for Lake Erie.
doi.org/10.1016/j.jglr.2017.10.003.

10
C. Wellen, et al. Ecological Indicators 118 (2020) 106684

Runkel, R., Crawford, C., Cohn, T., 2004. Load estimator (LOADEST): a FORTRAN pro- after a cosmetic pesticides ban. Challenges 5 (1), 138–151. https://doi.org/10.3390/
gram for estimating constituent loads in streams and rivers. challe5010138.
Sauer, V., Meyer, R., 1992. Determination of Error in Individual Discharge Measurements. Tomer, M.D., Locke, M.A., 2011. The challenge of documenting water quality benefits of
Smart, A.J., Clay, D., Stover, R.G., Parvez, M., Reitsma, K., Janssen, L., Mousel, E., 2015. conservation practices: a review of USDA-ARS’s conservation effects assessment
Persistence wins: long-term agricultural conservation outreach pays off. J. Extens., project watershed studies. Water Sci. Technol. 64 (1), 300–310. https://doi.org/10.
53(2). Retrieved from http://www.joe.org/joe/2015april/rb6.php. 2166/wst.2011.555.
Southern Ontario Land Resource Information System (SOLRIS) Land Use Data. Toronto, Van Meter, K.J., Basu, N.B., 2015. Catchment legacies and time lags: A parsimonious
Ontario: The Ontario Ministry of Natural Resources, 2008. watershed model to predict the effects of legacy storage on nitrogen export. PLoS
Stammler, K.L., Taylor, W.D., Mohamed, M.N., 2017. Long-term decline in stream total ONE 10 (5), 1–16. https://doi.org/10.1371/journal.pone.0125971.
phosphorus concentrations: a pervasive pattern in all watershed types in Ontario. J. Wellen, C., Kamran-Disfani, A.-R., Arhonditsis, G.B., 2015. Evaluation of the current state
Great Lakes Res. 43 (5), 930–937. https://doi.org/10.1016/j.jglr.2017.07.005. of distributed watershed nutrient water quality modeling. Environ. Sci. Technol. 49
Stuart, V., 2017. Watershed Evaluation of Beneficial Management Practices (WEBs): (6). https://doi.org/10.1021/es5049557.
Managing our Land and Protecting our Water Through Long-Term Watershed-Scale Yang, W., Liu, Y., Simmons, J., Oginskyy, A., McKague, K., 2013. SWAT Modelling of
Research: Final Report (2004–2013). Ottawa, ON. Agricultural BMPs and Analysis of BMP Cost Effectiveness in the Gully Creek
Tang, W., Carey, S.K., 2017. HydRun: a MATLAB toolbox for rainfall – runoff analysis. Watershed. Guelph, ON.
Hydrol. Process. 31, 2670–2682. https://doi.org/10.1002/hyp.11185. Zar, J., 1999. Biostatistical Analysis. Prentice Hall.
Todd, A., Struger, J., 2014. Changes in acid herbicide concentrations in urban streams

11

You might also like