Transgenic Corn, Nebraska University

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 155

Design and Analysis of Multistage Group Testing Surveys with Application

to Detecting and Estimating Prevalence of Transgenic Corn in México

by

Osval Antonio Montesinos López

A DISSERTATION

Presented to the Faculty of


The Graduate College at the University of Nebraska
In Partial Fulfillment of Requirements
For the Degree of Doctor of Philosophy

Major: Statistics and Agronomy

Under the Supervision of Professors Kent M. Eskridge and P. Stephen Baenziger

Lincoln, Nebraska
May, 2014
UMI Number: 3615944

All rights reserved

INFORMATION TO ALL USERS


The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

UMI 3615944
Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Design and Analysis of Multistage Group Testing Surveys with Application

to Detecting and Estimating Transgenic Corn in México

Osval Antonio Montesinos López, Ph.D.


University of Nebraska, 2014
Advisor: Kent M. Eskridge and P. Stephen Baenziger
Group testing is a cost efficient technique first proposed by Dorfman (1943) to detect
soldiers with syphilis during World War II. It is a cost efficient technique since in place
of performing a diagnostic test on each individual, the blood of s individuals is pooled
and a diagnostic test is applied to each pool. Assuming a perfect diagnostic test, if a pool
tests positive, at least one of the s individuals in the pool is positive, while if a pool tests
negative, all s individuals in a pool are free of the disease. When a pool tests positive, the
blood of each individual is re-tested to identify the individuals with syphilis. However, If
the purpose of the analysis is only to estimate prevalence, it is not necessary to re-test the
members of a positive group (Remund et al., 2001; Hernández-Suárez et al., 2008).
When a positive individual is rare this method produces a significant saving of time and
resources, with reported reductions in the number of required diagnostic tests of more
than 80% (Remund et al., 2001).

Group testing has been used to estimate the prevalence of diseases or to classify
individual animals, plants or humans (Dodd et al., 2002; Remlinger et al., 2006;
Verstraeten et al., 1998; Peck, 2006; Hernández-Suárez et al., 2008) and has motivated
the development of sampling methods and regression models for group testing data. In
the case of sampling methods for group testing, only methods for sample size
determination under simple random sampling (SRS) have been developed. Group testing
regression models have been well developed under SRS and under two stage sampling.
However, under two stage sampling these models assume that the sample of clusters
(primary sampling units) and individuals (secondary sampling units) are taken under
SRS. However, in many applications this assumption is violated since the primary
sampling units are selected with probability proportional to size (PPS) as opposed to
SRS. An additional problem is that PPS sampling can produce an informative sampling
process, where the response variable is correlated with the probability of selection even
after conditioning on the model covariates (Pfeffermann, et al., 2006). Informative
sampling can produce substantial biases when using traditional estimation methods in
either group testing or non-group testing applications

One recent application of group testing used to estimate the prevalence of rare traits
under a complex survey structure was described in Piñeyro-Nelson, et al. (2009). They
used group testing to estimate the presence of transgenic corn in Mexico. However, due
to the lack of an appropriate methodology they analyzed the data ignoring the complex
sampling structure with the likely consequences of producing inefficient and possible
biased estimates. For this reason, the present work develops sampling designs and
regression group testing methods for complex surveys and these methods are used to
estimate the prevalence of transgenic corn in México.
iv

DEDICATION

To my loved family
v

ACKNOWLEDGMENTS

My gratitude goes first to God for my life, my health, and my family.

A deep appreciation to my advisor, Kent M. Eskridge for efficiently mentoring and

guiding me during my research and studies. I thank him for treating me as a friend and

for giving me the freedom and support to my research ideas and writing. Thanks for his

mentoring, feedback, and advice given during the development of this project. It has been

a great privilege for me to work under his supervision.

Likewise, I thank Dr. Stephen Baenziger (Co-advisor) for all the things that I learn

with him and the members of my Supervisory Committee, Drs. Dong Wang, Tom

Hoegemeyer and Drew Tyre for the help, suggestions and criticisms that contributed to

the improvement of this work. Special thanks to Dr. Jose Crossa for his suggestion to

come to University of Nebraska to study my PhD program, for his help and assistance

during my research and studies.

I am grateful to my parents Lourdes López Morales and Abelardo Montesinos Crúz

for their love and encouragement. Also for teaching me that with devotion to the work

and creativity we can achieve our dreams.

Although I have not included their names in this acknowledgement accounting, I feel

profoundly thankful to those individuals who diligently and generously have shared their

knowledge and talent with me and others through instructional means. That includes
vi

professors, book writers, scientists and non-professional individuals from whom I

acquired knowledge that I believe has shaped me as a person and as a researcher.

Last but not least, thanks are given to Universidad de Colima, to Programa de

Mejoramiento del Profesorado (PROMEP) of México for the economic support provided

during my studies and to the University of Nebraska-Lincoln for having been the niche of

my Doctorate.
vii

Table of Contents
Chapter 1: Introduction ....................................................................................................... 1
1.1 Group testing to detect transgenic corn in México .................................................. 1
1.2 Problem Statement ................................................................................................... 4
1.3 Research Objectives .................................................................................................. 6
1.4 References ................................................................................................................ 7
Chapter 2: Optimal sample sizes for group testing in two stage sampling ....................... 10
2.1 Introduction ............................................................................................................. 10
2.2 Random logistic model for individual testing ......................................................... 13
2.3 Random logistic model for group testing................................................................ 14
4.4 Approximate marginal variance of the proportion.................................................. 15
2.5 Optimal sample size assuming equal cluster size ................................................... 18
2.5.1 Minimizing variance subject to a budget constraint ........................................ 18
2.5.2 Minimizing the budget to obtain a certain width of the confidence interval ... 19
2.5.3 Minimizing the budget to obtain a certain power ............................................ 20
2.6 Behavior of the optimal sample size for equal cluster sizes ................................... 22
2.7 Correction factor for unequal cluster sizes ............................................................. 22
2.8 Comparison of the relative efficiency and its Taylor approximation ..................... 24
2.9 Estimating the proportion of transgenic plants––An example ................................ 25
2.10 Tables for determining sample size ...................................................................... 29
2.11 Discussion and conclusions .................................................................................. 30
2.12 References ............................................................................................................. 31
Appendix 2.A. ................................................................................................................... 41
Appendix 2.B. ................................................................................................................... 42
Appendix 2.C. ................................................................................................................... 43
Appendix 2.D. ................................................................................................................... 44
Chapter 3: A regression group testing model for a two-stage survey under informative
sampling for detecting and estimating the presence of transgenic corn ........................... 45
3.1 Introduction ............................................................................................................. 45
3.2 Sampling design and generalized linear model ...................................................... 49
3.3 Probability of selection ........................................................................................... 52
viii

3.4 Level 1 scaling methods.......................................................................................... 53


3.5 Incorporating the weights in the PML .................................................................... 54
3.5.1 Sandwich estimator of the standard errors...................................................... 55
3.6 Simulation study ..................................................................................................... 56
3.7 Results ..................................................................................................................... 58
3.7.1 Simulation with a covariate at the individual level ......................................... 62
3.8 Application .............................................................................................................. 62
3.9 Conclusions ............................................................................................................. 66
3.10 References ............................................................................................................. 68
Appendix 3.A. ................................................................................................................... 82
Appendix 3.B. ................................................................................................................... 83
Chapter 4: Three-stage optimal sampling plans for group testing data given a pool size 84
4.1 Introduction ............................................................................................................. 84
4.2. Random logistic model for individual testing ........................................................ 85
4.3 Random logistic model for group testing................................................................ 86
4.4 Approximate marginal variance of the proportion.................................................. 87
4.5 Sample sizes without constraints ............................................................................ 90
4.5.1. How to obtain a certain width of the confidence interval ............................... 90
4.5.2 How to obtain a certain power ........................................................................ 91
4.6 Optimal sample sizes under constraints for a given pool size ................................ 92
4.6.1 Minimizing variance subject to a budget constraint ........................................ 92
4.6.2 Minimizing the budget to obtain a certain width of the confidence interval ... 93
4.6.3 Minimizing the budget to obtain a certain power ............................................ 94
4.7 Tables for determining the optimal sample size assuming equal locality and field
sizes ............................................................................................................................... 95
4.8 Adjusting for unequal locality and field sizes......................................................... 96
4.9 A numerical example for estimating the presence of transgenic maize.................. 98
4.10 Discussion and conclusions ................................................................................ 102
4.11 References ........................................................................................................... 102
Appendix 4.A. ................................................................................................................. 108
Appendix 4.B. ................................................................................................................. 110
ix

Appendix 4. C. ................................................................................................................ 111


Appendix 4.D. ................................................................................................................. 113
Appendix 4.E. ................................................................................................................. 114
Chapter 5: Optimal sampling plans for multistage group testing data with optimal pool
size .................................................................................................................................. 116
5.1 Introduction ........................................................................................................... 116
5.2 Optimal sample sizes for two stages ..................................................................... 117
5.2.1. Tables for sample size determination for a two stage sampling process...... 119
5.3 Optimal sample sizes for three stages ................................................................... 121
5.3.1. Tables for sample size determination for a three stage sampling process ... 123
5.4 Conclusions ........................................................................................................... 124
5.5 References ............................................................................................................. 125
Appendix 5.A. ................................................................................................................. 135
Appendix 5. B. ................................................................................................................ 137
Appendix 5.C. ................................................................................................................. 138
Chapter 6: General conclusions ...................................................................................... 139
6.1 Future research ...................................................................................................... 141
6.2. References ............................................................................................................ 142
x

List of Tables
Table 2. 1. Cluster size distribution used for calculating relative efficiency. ................... 35
Table 2. 2. Optimal sample sizes ( and ) for group testing for two stages given a pool
). ............................................. 36
size that minimize the variance of the proportion (π
Table 2. 3. Optimal sample sizes ( and ) for confidence interval estimation using
group testing in two stages given a pool size. ........................................................... 37
Table 2. 4. Optimal sample sizes ( and ) for power estimation using group testing in
two stages given a pool size. ..................................................................................... 38
Table 3. 1. Simulation means and standard deviations using 6 clusters. .......................... 72
Table 3. 2. Simulation means and standard deviations using 12 clusters. ........................ 73
Table 3. 3. Simulation means and standard deviations using 24 clusters. ........................ 74
Table 3. 4. Comparison using three different elementary units size selected under SRS. 75
Table 3. 5. Comparison of NLMIXED vs GLIMMMIX for weighting methods 3, 4 and 5.
................................................................................................................................... 76
Table 3. 6. Comparison of informative sampling at both levels, at the cluster level and
non-informative......................................................................................................... 77
Table 3. 7. Simulation means and standard deviations (std) of the model with a covariate
at the individual level (β = -4.7598, β = 0.8290 and σ = 0.9820 true values).
................................................................................................................................... 78
Table 3. 8. Population data with two regions, eight fields (clusters) and two strata per
field (levels of fertility). ............................................................................................ 79
Table 3. 9. Sample resulting by getting two fields within each region under PPS and
doing stratified sampling within each field at a fixed sample size of six plants per
stratum under SRS. ................................................................................................... 79
Table 3. 10. Sample prepared in terms of pools for analysis. ........................................... 80
Table 3. 11. NLMIXED statements for two-stage group testing regression under PPS... 81
Table 4. 1. Optimal sample sizes (, ) given the pool size (s=10, 20) for group testing
in three stages for five values of the variance between fields................................ 105
Table 4. 2. Optimal sample sizes (, ) given the pool size (s=10, 20) for group testing
in three stages for five values of c3/c1. ................................................................ 106
xi

Table 4. 3. Number of pools comprised of leaf samples from two localities in Oaxaca,
Mexico (2004). ........................................................................................................ 107
Table 5. 1. R code for obtaining optimum values of g, and s for a two-stage survey with
group testing............................................................................................................ 127
Table 5. 2. Optimal sample sizes (, ) for group testing in two stages for three
combinations of  and . .................................................................................... 128
Table 5. 3. Optimal sample sizes (, ) for group testing in two stages for three values of
2/1. ..................................................................................................................... 129
Table 5. 4. Relative cost ( !(, ) = !( ∗,  = )/!( ∗,  ∗) for two stages
sampling. ................................................................................................................. 130
Table 5. 5. R code for obtaining optimum values of m, g, and s for a three-stage survey
with group testing. .................................................................................................. 131
Table 5. 6. Optimal sample sizes (, , ) for group testing in three stages for three
combinations of  and . .................................................................................... 132
Table 5. 7. Optimal sample sizes (, , ) for group testing in three stages for three
values of c2/c1. ..................................................................................................... 133
Table 5.8. Relative cost ( !(, , ) = !( ∗,  ∗,  = 10)/!( ∗,  ∗,  ∗) for three
stage sampling. ........................................................................................................ 134
xii

List of Figures
Figure 2. 1. Ratio of the number of clusters and number of pools per cluster. ................ 39
Figure 2.2. Relative efficiency of unequal versus equal cluster sizes as a function of the
intraclass correlation ρ for four distributions of cluster size: a) uniform, b) unimodal,
c) bimodal, and d) positively skewed distribution. ................................................... 40
Figure 5.1. Var(π
) for a two stage survey as a function of pool size (s). ....................... 126
1

Chapter 1: Introduction

1.1 Group testing to detect transgenic corn in México

México is one of the most biologically diverse countries in the world, providing a rich
treasure of information about our ecosphere. México is the center of origin and
diversification of many food crops of major commercial importance such as dry beans,
corn, tomatoes, squash, chili, cocoa, vanilla, avocado and many others (Otero, 2007). But
there are many challenges to implementing standards to protect this natural heritage.

One of the most important challenges is regarding possible gene flow through
outcrossing between transgenic crops, their landraces and wild relatives. Specifically, the
effects of transgenic maize outcrossing with traditional maize landraces and wild
relatives such as tripsacum and teocinte virtually are unknown (Hernández–Suárez et al.,
2008).

The presence of transgenes in some maize landraces in México was confirmed in


2001 by studies sponsored by the Mexican government. Quist and Chapela (2001; 2002)
reported the presence of genes from Bacillus thuringiensis (Bt), in maize landraces in the
Sierra Juarez region of the Mexican State of Oaxaca. Bt is a soil bacterium used to create
maize that is resistant to some insects. The results of this research motivated the Ministry
of Agriculture, Livestock, Rural Development, Fisheries and Food in México (SAGARPA)
to promote more research to verify the presence of transgenes in native maize in México.
Herrera (2002) reported that 44% of the samples studied tested positive for the presence
of the Cauliflower Mosaic Virus (CaMV-35S) promoter. The CaMV-35S is used to
create transgene insertion cassettes for maize herbicide or resistance. The ETC group
(ETC group, 2003a, b) reported that 49% of the samples evaluated in this region were
positive for the CaMV-35S promoter at the protein level. In contrast, Ortiz-García et al.
(2005a; 2005b) in sampled maize landraces in the same region of Oaxaca State failed to
detect transgenes, also termed genetically modified organism (GMOs). Two recent
studies in México showed the presence of GMO in the southeast and west-central regions
of México. One study found that 3.1% and 1.8% of the studied samples tested positive for
2

the presence of transgenic maize (Dyer et al., 2009). Another study found evidence of the
presence of transgenic maize in 1.1% of samples using polymerase chain reaction (PCR)
and 0.89% using the Southern blot test (Piñeyro-Nelson et al., 2009). Also Carreón-
Herrera (2011) reported that 36% of the studied samples tested positive for the presence
of transgenes (with the CaMV-35S promoter) in native maize in the regions Sierra
Nororiental and Mixteca Baja of the State of Puebla. Based on this work, there is little
doubt that transgenes are present in native maize in México and were likely spread to
other areas.

A probable pathway of transgene spread is due to imported transgenic grain that may
be experimentally planted by small-scale farmers. Indeed, small-scale farmers are known
to occasionally plant Diconsa seeds (Diconsa S. A. de C. V.), adjacent to their local
landraces. Cross-pollination can occur between modern cultivars and landraces that
flower at the same time and grow near each other. Farmers save and trade seed, some of
which may be transgenic, and thus the cycle of gene flow is repeated and transgenes
spread further. Novel alleles introduced by gene flow may or may not persist in recipient
populations depending on: (1) whether gene flow is a one time or recurrent event, (2) the
rate of gene flow, and (3) whether the novel allele is locally detrimental, beneficial, or
neutral and depending on the size of the recipient population. These principles apply to
both conventional genes and transgenes (CCA, 2004).

When testing for GMOs, two distinct statistical activities should be emphasized. The
first is determining the optimal sampling strategy and sample size to be used when
sampling plants from a field (Cleveland et al., 2005; Ortiz-García et al., 2005c). The
second is determining the sample preparation and testing method to be used in the
laboratory (Remund et al., 2001). The sensitivity and specificity of the tests are important
factors that may affect the rates of false negative and false positive results (Remund et
al., 2001). Usually it will be necessary to collect a large number of plants from a
reference population located in the region of interest, since the frequency of GMO is
likely to be very low. Because laboratory tests are expensive, it is not feasible to analyze
all individual plants (Hernández–Suárez, et al., 2008). There are several testing plans for
3

reducing the number of samples to be analyzed (Montgomery, 2002). Group testing of


pooled plant samples is one such method (Remund et al., 2001).

Group testing consists in performing a diagnostic test of the pooled material of s


individuals instead of performing a test for each of the s individuals. Federer (1991)
reported that we can use the group testing method when: (1) the trait is discrete, i.e., it
can be measured as presence or absence, or some countable quantity; (2) the proportion
of positives (e.g., GMO) is relatively small; and (3) pooling the samples does not alter the
characteristics of individual samples.

The group testing technique of Dorfman (1943) can be very effective for reducing
the number of diagnostic tests (Federer, 1991). However, a major disadvantage of the
Dorfman group testing plan applied to testing for the presence of transgenic corn is that it
is sensitive to the dilution effects that arise when the group is formed. This is particularly
true for large group sizes where the number of kernels in the pool can be diluted below
the sensitivity of the analyses, which causes the rate of false negatives to increase.
However, incorporating the sensitivity and specificity of the diagnostic test in the
probability of a positive pool can reduce problems associated with false negatives and
positives.

Survey designs for pooled data are available for choosing the appropriate numbers of
pools given a specific pool size. However, these methods assume that the distribution of
GMOs in the population is uniform and that the selected sample size is small in relation
to the reference population. Specifically, acceptance sampling and testing is based on the
binomial distribution (Kay and Van den Eade, 2001; Kay and Paoletti, 2002;
USDA/GIPSA, 2000a, USDA/GIPSA, 2000b). However, the reality is that the
distribution of GMOs in the reference population is not uniform and selecting the
required sample size under simple random sampling is not practical.

Usually, researchers use a two or three stages sampling process with stratification to
obtain the plants to be studied. For example, with a 3-stage sampling process, the target
population is divided into regions and a sample of localities is taken in each region, a
sample of fields is taken in each locality and a sample of plants is taken in each field.
4

Then the plants of each field are randomly assigned to pools, and a diagnostic test is used
to classify each pool as positive or negative for transgenes. For this reason, a survey
design methodology should be used for choosing the appropriate optimal values of
number of localities, fields and pools of size s per field.

There are methods for the analysis of pooled data for one and two stage sampling
plans, but they assume that the probabilities of selection are the same. However, when the
sampling process uses probability proportional to size or is informative, i.e. the response
variable is related to the probabilities of selection even after conditioning on model
covariates, the currently available group testing methods can produce severely bias
results because the sample model is different form the population model.
For the design or analysis of a three-stages sampling process for group testing, there is
nothing reported in the literature either for non-informative sampling or informative
sampling. For this reason, I will develop methods for the design (two and three stages)
and analysis (two stages) of group testing data under informative and non-informative
sampling to guarantee precision and small bias in the parameter estimates.

1.2 Problem Statement

From 2001 to 2013 at least 10 surveys have been conducted for detecting and quantifying
the presence of transgenic corn in México. Most of the surveys are designed in two or
three stages with stratification. For example, when a researcher wants to quantify the
presence of transgenic corn in a specific locality, usually the sampling frame is composed
of a list of fields in this locality-this list represents the primary sampling units (PSU)- and
a sample of m fields is selected from the total N fields in the sampling frame with
probability proportion to size (PPS) or with simple random sampling (SRS). Then a
sample of () plants are selected under SRS from field *, * = 1,2, … , . Next, the ()
plants of each field are randomly formed into ) pools of size s, where () = ) .
Finally each pool is subjected to a diagnostic laboratory test for detecting the presence or
absence of a specific promoter of transgenic corn [eg. Coliflower Mosaic Virus (CaMV-
35S), Bacillus thuringiensis (Bt), etc.]. After the classification of each pool as either
positive or negative for transgenic material, the researchers needs to estimate the
5

proportion of transgenic plants. However, since group testing is used and the sampling
plan is in two-stages with unequal probability of selection there is no clear estimation
methodology. For this reason, most researchers only report the percentage of positive
pools per field and the percentage of fields with the presence of transgenic corn. Under
this approach unfortunately they ignore the sampling structure and these estimates are not
useful for the entire population of fields. A new method of analysis is needed for the
detection and estimation of the presence of transgenic corn. The problem of design and
analysis is more complex when researchers want to quantify the presence of transgenes
for a state composed of many localities, because in this case they must first take a sample
of localities and then apply the same sampling process described above for each locality.

Design methods for group testing that have been used so far for detecting transgenes
calculate the required sample size under the binomial model but ignore the two stage
sampling process and group testing. Recently group testing methods for SRS have been
developed to calculate the required sample size for a given power or precision, but they
have not been used much (Yamamura and Hino, 2007; Hernández-Suárez, et al., 2008).
These methods developed for SRS, provide analytical expressions or computational
solutions for calculating the appropriate numbers of pools (g) for a given pools size (s),
but these sample size values are not valid when the sampling process is performed in two
or more stages.

Methods for the analysis of multistage group testing surveys, should take into
account the possibility that the sampling process is informative. Informative sampling is
likely to occur if the fields or localities are sampled using PPS, where the size of the field
or locality is related to the presence of transgenic corn. This is likely since most
commercial corn fields are very large and they have a higher probability that the target
transgene will be found compared to smaller non commercial fields. Ignoring the
informative process, using conventional analysis, has been shown to severely bias the
parameter estimates since the sample distribution differs from that of the population. For
this reason, it is important to use appropriate methodology for optimal sampling and
analysis of multistage group testing surveys that takes into account the complex structure
6

of the sampling process, the informative process and the sensitivity and specificity of the
diagnostic test. However, such methodology is not currently available.

In this dissertation, I will derive optimal sampling plans for group testing in two and
three stages under a budget constraint. In the context of the two-stages sampling I will get
the optimal values of the number of fields (clusters) (m) and pools per clusters () given
a pool size (s), the cost of enrolling fields to the study (,), the cost of enrolling a pool of
size s to the study (, ) and the total available budget (C). I will also consider the dual
problem of finding m and g to minimize total costs subject to a restriction on the variance
of the estimated proportion -(./). In the three-stages sampling process I will get the
optimal values of the number of localities (l), the number of fields (m) and pools per field
(g) given s, 0 , , , , C (or - (./)) and the cost of enrolling localities to the study (1). I
will also develop methods to obtain optimal pooling size (s) and sampling plans
simultaneously.

Also, I will develop methods of analysis for group testing data under two-stages with
unequal selection probabilities and informative sampling. To be able to incorporate the
unequal selection probabilities and weights I will use a pseudo-maximum likelihood
(PML) approach. The PML approach uses the appropriate weights to estimate the
population likelihood equation that would be obtained in case of a census. However, for
mixed models the PML approach is challenging since the different sampling weights for
each stage appear nonlinearly in different places in the PML estimating equations. With
pooled data the appropriate incorporation of the weights is more difficult since the
individual level information is not available-only that of each pool. Also, the use of the
mixed model PML approach is not straightforward since we may need to use scaled
weights. Use of raw weights when the sampling process is informative often will produce
more bias than ignoring the weights altogether.

1.3 Research Objectives

Throughout this study, the following research objectives will be addressed:


7

1. To develop methods for the design of two and three stage group testing sample
surveys under equal and unequal sample sizes per stage and assuming a non
informative sampling process.
2. To develop methods for the analysis of two stage group testing sample surveys
under probability proportional to size and presence of informative sampling.
3. Apply the methods to detecting and estimating the prevalence of transgenic corn
plants in México.

1.4 References

Carreón-Herrera, N. I. (2011). Detección de transgenes en variedades nativas de maíz en


dos regiones del estado de Puebla. Unpublished doctoral dissertation, Colegio de
Postgraduado, Campus Puebla, Puebla, México. Available at:
http://www.biblio.colpos.mx:8080/xmlui/bitstream/handle/10521/439/Carreon_Her
rera_NI_MC_EDAR_2011.pdf?sequence=1
CCA. (2004). Maíz y biodiversidad: los efectos del maíz transgénico en México:
conclusiones y recomendaciones. Comisión para la Cooperación Ambiental.
Obtenido el 29 de julio de 2007 desde la base de datos:
http://www.cec.org/maize/index.cfm?varlan=espanol.
Cleveland, D. A., Soleri D., Aragón, C. F., Crossa, J. y Gepts, P. (2005). Detecting
(Trans) Gene Flow To Landraces In Centers Of Crop Origin: Lessons From The
Case Of Maize In México. Environ Biosafety Research, 4:197-208.
Dyer, G.A., Serratos-Hernández, J.A., Perales, H.R., Gepts, P., Piñeyro-Nelson, A.,
Chavez, A., Salinas-Arreortua, N., Yúnez-Naude, A., Taylor, J.E. and Alvarez-
Buylla, E.R. (2009). Dispersal of transgenes through maize seed systems in
México. PLoS ONE 4(5):e5734.
Dodd, R., E. Notari, S. Stramer. (2002). Current prevalence and incidence of infectious
disease markers and estimated window-period risk in the American Red Cross
donor population. Transfusion 42: 975-979.
ETC Group (Action Group on Erosion, Technology and Concentration) (2003a).
Contamination of Bt Genetically Modified Maize in México Much Worse Than
Feared. ETC Group, México City, Méxicowww.etcgroup.org.
ETC Group (Action Group on Erosion, Technology and Concentration) (2003b). Open
Letter from International Civil Society Organizations on Transgenic Contamination
in the Centers of Origin and Diversity. ETC Group, México City, México
www.etcgroup.org.
8

Herrera L. A.; Joffre y G., A.; Álvarez, M. A.; Huerta, E.; Arriaga, L.; Soberón, J.; Ortiz,
S.; Aldama, A.; Cotero, M. A.; Serratos, J. A. (2002). Primer informe sobre análisis
de la presencia de maíz transgénico en Oaxaca y Puebla. Secretaría de Agricultura,
Ganadería, Desarrollo Rural, Pesca y Alimentación/ Dirección General de Sanidad
Vegetal/Comité Intersecretarial de Bioseguridad de Organismos Genéticamente
Modificados. No publicado, 20 p.
Piñeyro-Nelson, A., van Heerwaarden, J., Perales, H.R., Serratos-Hernández, J.A.,
Rangel, A. et al. (2009).Transgenes in Mexican maize: molecular evidence and
methodological considerations for GMO detection in landrace populations. Mol.
Ecol. 18: 750-761.
Dorfman, R. (1943). The detection of defective members of large populations. The
Annals of Mathematical Statistics 14 (4): 436-440.
Dyer, G.A., Serratos-Hernández, J.A., Perales, H.R., Gepts, P., Piñeyro-Nelson, A.,
Chávez, A., Salinas-Arreortua, N., Yúnez-Naude, A., Taylor, J.E., and Álvarez-
Buylla, E.R. (2009). Dispersal of Transgenes through Maize Seed Systems in
México. PLoS ONE, 4(5): e5734.
Federer, W. (1991) Statistics and society. Data collection and interpretation. New York,
Marcel and Dekker.
Pfeffermann, D., Da Silva Moura, F.A., and Do Nascimento Silva, P.L. (2006). Multi-
level modelling under informative sampling. Biometrika, 93, (4): 943-959.

Hernández-Suárez, C.M., Montesinos-López, O. A., McLaren, G. and Crossa, J. (2008).


Probability models for detecting transgenic plants. Seed Science Research, 18: 77-
89.
Kay, S. and Paoletti, C. (2002). Sampling strategies for GMO detection and/or
quantification. Institute for Health and Consumer Protection (IHCP) Food
Products Unit: GMO: Food and Enviroment. 4 (2).
Kay, S. and Van den Eede, G. (2001). The limits of GMO detection. Nature
Biotechnology 19: 405.
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A.A.
(2005a). Absence of detectable transgenes in local landraces of maize in Oaxaca,
México (2003-2004). Proc. Natl. Acad. Sci. USA, 102:12338–12343.
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A. A.
(2005b), Correction. Proc. Natl. Acad. Sci. USA, 102: 18242.
Ortiz-García, S., Ezcurra, E., Schoel, B., Acevedo, F., Soberón, J. and Snow, A.A.
(2005c). Reply to Cleveland et al.'s "Detecting (trans)gene flow to landraces in
centers of crop origin: lessons from the case of maize in México." Environ.
Biosafety Res., 4: 209-215.
9

Otero–Arnaiz, A. (2007). La importancia de tener una Red de Monitoreo (ambiental) de


OGM en México. Dirección General de Investigación en Ordenamiento Ecológico y
Conservación de los Ecosistemas. Trabajo presentado en el Primer Taller de
Monitoreo de Organismos Genéticamente Modificados. Obtenido el 26 de enero de
2010 desde la base de datos:
http://www2.ine.gob.mx/bioseguridad/descargas/1ertallermonitoreo_adriana_otero.
pdf
Peck, C. 2006. Going after BVD. Beef 42: 34-44.
Quist, D. and Chapela, I.H. (2001). Transgenic DNA introgressed into traditional maize
landraces in Oaxaca, México. Nature, 414: 541–543.
Quist, D. and Chapela, I. H. (2002). Quist and Chapela reply. Nature, 416: 602.
Remlinger, K., J. Hughes-Oliver, S. Young, R. Lam. (2006). Statistical design of pools
using optimal coverage and minimal collision. Technometrics 48:133-143.
Remund, K.M., Dixon, D.A., Wright, D.L., Holden, L.R. (2001). Statistical
considerations in seed purity testing for transgenic traits. Seed Science Research
11:101-120.
Suslow, T.V., Thomas B. R. Bradford K. J. (2002). Biotechnology Provides New Tools
for Plant Breeding. Agricultural biotechnology in California series publication
8043. (http://sbc.ucdavis.edu/files2/18120.pdf).
USDA/GIPSA (2000a) Sampling for the detection of biotech grains. Washington, DC,
United States Department of Agriculture. Obtenido el 7 de julio de 2010 desde la
base de datos: http://archive.gipsa.usda.gov/biotech/sample2.htm.
USDA/GIPSA (2000b) Practrical application of sampling for the detection of biotech
grains. Washington, DC, United States Department of Agriculture. Obtenido el 7 de
julio de 2010 desde la base datos: http://archive.gipsa.usda.gov/ biotech/ sample1.
htm.
Verstraeten, T., B. Farah, L. Duchateau, R. Matu. (1998). Pooling sera to reduce the cost
of HIV surveillance: a feasibility study in a rural Kenyan district. Tropical
Medicine and International Health 3:747-750.
Yamamura, K. and Hino, A. (2007). Estimation of the proportion of defective units by
using group testing under the existence of a threshold of detection.
Communications in Statistics - Simulation and Computation, 36: 949-957.
10

Chapter 2: Optimal sample sizes for group testing in


two stage sampling
Abstract
Optimal sample sizes under a budget constraint for estimating a proportion in a two-stage
sampling process have been derived using individual testing. However, when group
testing is used, these optimal sample sizes are not appropriate. In this study, optimal
sample sizes at the cluster and individual levels are derived for group testing. First,
optimal allocations of clusters and individuals are obtained under the assumption of equal
cluster sizes. Second, we obtain the relative efficiency (RE) of unequal versus equal
cluster sizes when estimating the average population proportion .2. By multiplying the
sample of clusters obtained assuming equal cluster size by the inverse of the RE, we
adjust the sample size required in the context of unequal cluster sizes. Also we show the
adjustments needed for the correct allocation of clusters and individuals in order to
estimate the required budget and achieve a certain power or precision.

Key words: group testing, optimal sample size, relative efficiency, power, precision.

2.1 Introduction

Group testing is becoming increasingly popular because it can substantially reduce the
required number of diagnostic tests compared to individual testing. Dorfman (1943)
proposed the original group testing method in which g pools of size s are randomly
formed from a sample of n individuals selected from the population using simple random
sampling (SRS). Dorfman’s method has been extended in many ways. For example, there
are group testing regression models for fixed effects, for mixed effects, for multiple-
disease group testing data, with imperfect diagnostic tests [with sensitivity (3 ),
specificity ( ) < 1, or with dilution effect], non-parametric group testing methods among
others (Zhang et al., 2013; Chen, et al., 2009; Hernández-Suárez et al., 2008; Yamamura
and Hino, 2007).
11

Group testing methods have been used to detect diseases in potential donors (Dodd et
al., 2002), to detect drugs (Remlinger et al., 2006), to estimate and detect the prevalence
of human (Verstraeten et al., 1998), plant (Tebbs and Bilder, 2004) and animal (Peck,
2006) diseases, to detect and estimate the presence of transgenic plants (Hernández-
Suárez et al., 2008; Yamamura and Hino, 2007), to solve problems in information theory
(Wolf, 1985) and even in science fiction (Bilder, 2009). When individuals are not nested
within clusters, the issue of the number of pools the sample should have to achieve a
certain power or precision for estimating the proportion of interest .2 has been solved
(Yamamura and Hino, 2007; Hernández-Suárez et al., 2008; Montesinos-López et al.,
2010, 2011). In practice, however, populations often have a multilevel structure with
individuals nested within clusters that may themselves be nested within higher order
clusters. For example in the detection of transgenic corn in México, sample plants are
nested in fields, which are nested in geographical areas. For such surveys, at least two
stages may arise, and outcomes within the same cluster tend to be more alike than
outcomes from different clusters. To account for such correlated outcomes, more clusters
are needed to achieve the same precision as SRS which generates outcomes that are
independent (Moerbeek, 2006).

Multistage surveys are often justified because it is difficult or impossible to obtain a


sampling frame or list of individuals or it may be too expensive to take a SRS. For
example, it would not be possible to take a SRS of corn plants in México due to travel
costs between fields. Instead of using a SRS, multistage or clusters sampling methods
would typically be employed in this situation. Sampling units of two or more sizes are
used where larger units called clusters or primary sampling units (PSUs) are selected
using a probability sampling design. Then some or all of the smaller units, called
secondary sampling units (SSUs) are selected from each PSU in the sample. In the
example of sampling for transgenic corn, PSU=field and SSU=plant. This design would
be less expensive to implement than a SRS of individuals due to the reduction in travel
costs. Also cluster sampling does not require a list of households or persons for the entire
country. Instead, a list is constructed for the PSUs selected to be in the sample (Lohr,
2008).
12

In a non group testing context, optimal sample size gives the most precise estimate of
the proportion of interest and the largest test power or precision given a fixed sampling
budget (Van Breukelen et al., 2007). It can also be defined as the cheapest sample size
that gives a certain power or precision of the estimate of interest (Van Breukelen et al.,
2007). It is less costly to sample a few clusters with many individuals per cluster than
many clusters with just a few individuals per cluster because sampling in an already
selected cluster may be less expensive than sampling in a new cluster (Moerbeek et al.,
2000). However, simulation studies in a non-group testing context indicate that it is is
more important to have a larger number of clusters than a larger number of individuals
per cluster (Maas and Hox, 2004). In a group testing context, no work has been published
on the optimal sample size in two stage sampling given a specified sampling budget.
Thus new methods are needed to determine the required number of clusters and pools
per cluster given a certain budget for obtaining a desired precision for estimating the
proportion of interest using group testing.

Often optimal sample size calculations for multistage sampling completely assumes
equal cluster sizes (equal number of individuals per cluster). However, in practice, there
are large discrepancies in cluster sizes, and ignoring this imbalance in cluster size could
have a major impact on the power and precision required for the parameter estimates. For
this reason, sample size formulas have to be adjusted for varying cluster sizes. One
approach used to compensate for this lost of efficiency is to develop correction factors to
convert the variance of equal cluster size into the variance of the unequal cluster size
(Moerbeek et al., 2001; Van Breukelen et al., 2007, 2008; Candel et al., 2010). This
correction factor is normally constructed as the inverse of the relative efficiency (RE) that
is calculated as the ratio of the variances of the parameter of interest of equal versus
unequal cluster sizes. This RE concept has been used in mixed-effects models for
continuous and binary data to study loss of efficiency due to varying cluster sizes in a
non-group testing context for the estimation of fixed parameters and for variance
components (Van Breukelen et al., 2007, 2008; Candel et al., 2008, 2010). In the group
13

testing framework, the RE concept has not been used to adjust optimal sample sizes
under the assumption of equal cluster sizes.

In this study, we obtain optimal sample sizes in two stages in a group testing context
using a multilevel logistic group testing model where we assume clusters are randomly
sampled, from a large number of clusters. First, under the assumption that cluster sizes do
not vary, we derive analytical expressions for the optimal allocation of clusters and
individuals under a budget constraint. These analytical expressions were derived by
linearization using a first-order marginal quasi-likelihood to approach the multilevel
logistic group testing model. Although equal sample sizes per cluster are generally
optimal for parameter estimation, they are rarely feasible. For this reason, we derived an
approximate formula for the relative efficiency of unequal versus equal cluster sizes for
adjusting the required samples sizes for estimating the proportion in a group testing
context. The approximate RE obtained is a function of the mean, the variance of cluster
size and the intraclass correlation. The proposed expressions are also useful to estimate
the budget required to achieve a certain power or precision when the goal is to achieve a
confidence interval of a certain width or to obtain a prespecified power for a given
hypothesis.

This article is organized as follows: Section 2.2 gives the random logistic model for
individual testing. Section 2.3 describes the random logistic model for group testing.
Section 2.4 provides an approximate marginal variance of the proportion. In Section 2.5,
we derive the optimal sample size assuming equal cluster size. Section 2.6 describes the
behavior of the optimal sample sizes obtained from equal cluster sizes. Section 2.7 gives
a correction factor for unequal cluster sizes. Section 2.8 makes a comparison of the
relative efficiency and its Taylor approximation. Section 2.9 gives an example for
estimating the proportion of transgenic plants. Section 2.10 provides tables for sample
size determination, and finally Section 2.11 contains the discussion and conclusions.

2.2 Random logistic model for individual testing


14

In the context of individual testing, the standard random logistic model is obtained by
conditioning on all fixed and random effects, and assuming that the responses A)B are
independent and Bernoulli distributed with probabilities .) and that these probabilities
are not related to any covariable (Moerbeek et al, 2001a). Then the linear predictor using
a logit link is equal to

C) = DE*F(.) ) = ln G K = L + N)
HI
JHI
(1)

where C) is the linear predictor that is formed from a fixed part (L ) and a random part
(N) ), which is Gaussian iid with mean zero and variance OP, . Therefore, equation (1) can
be written in terms of the probability of a positive individual as:

.) = .) (L , OP ) = [1 + expT−(L + N) )V]J (2)

The mixed logit model for binary responses can be written as the probability .) plus
a level 1 residual denoted )B :
A)B = .) + )B
where )B has zero mean and variance WA)B XN) Y = .) (1 − .) ) [Goldstein (1991, 2003),
Rodriguez and Goldman (1995), Candy (2000), Moerbeek et al. (2001) and Candel et al.
(2010), Skrondal and Rebe-Hesketh, 2007)]. This model is widely used for the estimation
of optimal sample sizes when the variance components are assumed know (Goldstein,
1991, 2003; Rodriguez and Goldman, 1995; Candy, 2000; Moerbeek et al., 2001a).

2.3 Random logistic model for group testing

Suppose that, within the ith field, each plant is randomly assigned to one of the ) pools;
and let A)BZ = 0 if the kth plant in the jth pool in field i is negative, or A)BZ = 1 otherwise
for * = 1,2, … , , [ = 1,2, … , ) and \ = 1,2, … , )B as the pool size. Note that A)BZ is not
observed, except when pool size is 1. Thus define the random binary variable ])B that
takes the value of ])B = 1 if the jth pool in field i tests positive and ])B = 0; otherwise.
Therefore, the two-level generalized linear mixed model (Breslow and Clayton, 1993;
Rabe-Hesketh, 2006) for the response ])B is exactly the same as that given for individual
testing in equation (1). Conditional on the random effect [N) ], the statuses of pools within
15

field i are independent and assuming that the statuses of pools from different fields are
independent, the probability that the jth pool in field i is given as

^W])B = 1|N) Y = .) = 3 + (1 − 3 − ` ) ∏Zd


b
(1 − .)BZ )
` Ic
(3)

where 3 and ` denote the sensitivity and specificity of the diagnostic test, respectively.
3 and ` are assumed constant and close to 1 (Chen, et al., 2009). For simplicity in
planning the required sample we will assume an equal pool size, s, in all clusters, and
under this assumption equation (3) reduces to:
^W])B = 1|N) Y = .)` = 3 + e(1 − .) )b (4)

Where e = (1 − 3 − ` ). The mixed group testing logit model for binary responses can
be written as the probability .) plus a level 1 residual denoted )B :
` `

])B = .) + )B
` `
(5)

Where .)` as given in Equation(4) and )B


`
has zero mean and variance -W])B XN) Y =

.)` W1 − .)` Y. Now let f = (L , OP ) denote the vector of all estimable parameters. The
multilevel likelihood is calculated for each level of nesting. First, the conditional
likelihood for pool j in field i is given by:
lIc
g)B (h iN) ) = j.)` k [1 − .)` ]JlIc (6)

By multiplying the conditional likelihood (Equation 6) by the density of N) and


integrating out the random effects, we get the marginal (unconditional) likelihood.
g(h|m) = ∏r
)dTn ∏Bd g)B (h|N) ) o(N) )pN) V,
I q

where o(N) ) is the density function of N) . Unfortunately, this unconditional likelihood is


intractable. There are various ways of approximating the marginal likelihood function.
Two of them are: (1) to use integral approximations such as Gaussian quadrature; and (2)
to linearize the non-linear part using Taylor series expansion (TSE) (Moerbeek et al.,
2001a; Breslow and Clayton, 1993). The marginal form of the generalized linear mixed
model (GLMM) is of interest here, since it expresses the variance as a function of the
marginal mean.

4.4 Approximate marginal variance of the proportion


16

The marginal model can be fitted by integrating the random effects out of the log-
likelihood and maximizing the resulting marginal log-likelihood or, alternatively, by
using an approximate method based on TSE (Breslow and Clayton, 1993). Next, .) is
`

approximated using a first-order TSE around N) = 0, as


u
(N) − 0)
tHI
.) ≈ .) XP d + v
` `
PI
I PI d

.)` ≈ .)` XP d + e(1 − .) )bJ .) (1 − .) )|PId (N) )


I

.) ≈ .2 + e(1 − .2)bJ .2(1 − .2)N)


` `
(7)
where .2 ` = .)` XP d =  + e(1 − [1 + exp(−L )]J )b and .2 = .) |PI d =
I
[1 + exp(−L )]J, since N) are independen and identically distributed (iid) and we use
the fact that
u u u
tHI tHI tHI tHI
= = = .) (1 − .) ) and = e(1 − .) )bJ
tHI tHI
PI HI tP tPI twI HI
,
I

Now, by substituting Equation (7) in Equation (5) we can approximate Equation (5)
by

])B ≈ .2 ` + e(1 − .2)bJ .2(1 − .2)N) + )B


`
(8)

Therefore, the approximate marginal variance based on a first order TSE of the
responses of a pool is equal to:

-x W])B Y ≈ Te(1 − .2)bJ V, T.2(1 − .2)V, OP, + .2 ` (1 − .2 ` )


{

Where the variance of )B was approximated by .2 ` (1 − .2 ` ). Note that ]̅ =


` ∑|
c}~ ∑c lIc
rq
is

a moment estimator of (.)` ) and its variance is equal to:

-x (]̅) ≈
 )V„ †„  u (JH
 u)
+
)
Tb€(JH  (JH
‚ƒ~ V„ TH H
r rq
(9)

Recall that we will select a sample of m fields, assuming that the same number of
pools per field will be obtained, i.e.,  = ̅ . Since the probability of success is not a
constant over trials but varies systematically from field to field, the parameter .) is a
random variable with a probability distribution. Therefore, it is reasonable to work with
17

the expected value of .) across fields to determine sample size. To approximate (.) ) we
take advantage of the relationship between ]̅ and W.) Y:
`

]̅ = W.) Y = (3 + e(1 − .) )b ) = (3 ) + (e(1 − .) )b ) = 3 +e(‡)


`
(10)

Where ‡ = (1 − .) )b . Using a first order TSE around N) =0, we can approximate ‡ as


ˆ‡
‡ ≈ ‡|N* =0 + v ( N − 0)
N) N =0 *
*

‰ + (1 − .2)bJ .2(1 − .2)N)


‡≈‡ (11)
‰ = ‡|P d = (1 − [1 + exp(−L )]J )b = (1 − .2)b and we use the fact that
where ‡ I

= = = .) (1 − .) ) and = (1 − .) )bJ


tŠ tŠ tHI tHI tHI tŠ
PI HI tPI tPI twI HI
,

Then

‰
(‡) ≈ ‡

‰ , and so
But doing TSE of first order we can obtain that (1 − (.) ))b ≈ (1 − .2)b =‡

(‡) ≈ (1 − (.) ))b

That is, we approximate (‡) = [(1 − .) )b ] by [1 − (.) )]b . This implied that
W.) Y ≈ 3 +e(1 − (.) ))b , and since ]̅ is an estimator for W.) Y, then an estimator
` `

for (.) ) can be obtained from

3 +e(1 − (.) ))b ≈ ]̅

Therefore, an estimator for (.) ) is


~ ~
Œ
Œ‹ ) ≈ 1 − 
u ‚
Ž Jl“ ‚
(. ‘ Y
’ = 1−G K
Ž JWH
€ €

Œ‹ ), can be approximated from the variance of ]̅


The variance of this estimator, (.
~

(.)` ) of the function (”) = 1 − G € K


Ž J• ‚
(Equation 9) with a first order TSE around .

After some algebra we get:


18
,
ˆ(” )
Œ‹ )Y ≈ –
-W(. — ˜ -x (]̅)
ˆ” •dWHu Y
I

~
tq(•̂ )  Ž J• ‚ J 
= G K = . However, since (.) ) doesn’t have a close
 `
t• b € €  )‚ƒ~
b€(JH
Where

exact form, we replace this by .2 ` and we obtain

Œ‹ )Y = -(./) ≈
-W(.
„∗
+ =
† š(›) ( †„∗ œš(›))[(q“J)œ]
r rq rq“
(12)

„
ƒ„ u
(Ž3JH
 u )‚ H
OP,∗ = T.2(1 − .2)V, OP, , -(ž) = , .2 ` = 3 + e(1 − .2)b and
 (JH
 u)
b„ (€)„/‚
Where

Ÿ = OP,∗ /[OP,∗ + -(ž))] is the intraclass correlation coefficient that measures the amount
of variance between clusters (fields).

2.5 Optimal sample size assuming equal cluster size

2.5.1 Minimizing variance subject to a budget constraint


Now assume we have a fixed sampling budget for estimating the average population
proportion .. The question of interest is how to allocate clusters (m) and pools per cluster
(g) to estimate the proportion .2 with minimum variance, subject to the budget constraint:

C= + , ( > 0, ,  ≥ 2, D = 1,2) (13)

where C is the total sampling budget available,  is the cost of obtaining a pool of s
plants from a field, and , is the cost of obtaining a cluster. The optimal allocation of
units can be obtained using Lagrange multipliers. By combining Eq. (12) and (13), we
obtain the Lagrangean

g(,g,£) = g = -(./)+ £[C−( + , )] (14)

where -(./) given by Equation (12) is the objective function that will be minimized with
respect to m and g, subject to the constraint given in equation (13) and £ is the Lagrange
multiplier. The partial derivatives of Eq. (14) with respect to £,  and g are

=0=C−( + , ); then  = §


t¤ ¦
t¥ „ œq§~
19

− £ ; then £ = −
t¤ š(›) š(›)
tq q„ r q„ r„ §~
=0=−

 )V„ †„
− − £[+, ]
t¤  (JH
TH š(›)
tr r„ r„ q
=0=−

Solving these equations results in the optimal values for m and g (See appendix 2.A):

=§ , where  = ©§„ H(JH)


¦ § ªš(›)
„ œq§~
(15)
~ †

First, we will calculate the number of pools per field, , rounded to the nearest
integer. Using this value, we calculate the number of fields to sample,  rounded up to
the nearest integer. Note that Eq. (15) is a generalization of the optimal sample sizes for
continuous data for two-level sampling given by Brooks (1955) and Cochran (1977).

2.5.2 Minimizing the budget to obtain a certain width of the confidence interval
Many times the researcher is interested in choosing the number of clusters and pools per
cluster to minimize the total budget, C, to obtain a specified width («) of the confidence
interval (CI) of the proportion of interest. Assuming that the distribution of ./ is
approximately normal with a mean .2 and a fixed variance -x (./), then the (1 −
¬)100% Wald confidence interval of .2 is given by ./ ∓ ]J¯/, ª-x (./), where ]J¯/,
is the quantile 1 − ¬/2 of the standard normal distribution. Therefore, the observed width
of the CI is equal to ° = 2]J¯/, ª-x (./), and since we specified the required width of
the CI to be «, this implies that -(./) = «, /4]J¯/,
,
. Here the optimization problem is to
minimize the sampling budget as given in equation (13) has to be minimized under the
condition that -(./) (Eq. 12) is fixed. That is, we want to minimize: C =  + ,
subject to -(./) = - . Again using Lagrange multipliers the corresponding Lagrangean
is:g(,g,£) = g =  + , + £[-(./) − - ]. Now the partial derivatives of L with
respect to £,  and g are

 )V„ †„
− - ; then  = [T.2(1 − .2)V, OP, + ]/-
t¤  (JH
TH š(›) š(›)
t¥ r rq q
=0= +

=0= − £ q„ r − £ ; then £ =


t¤ š(›) q„ r„ §~
tq š(›)
20

=0= + , − [T.2(1 − .2)V, OP, + ]


t¤ ¥ š(›)
tr r„ q

Solving these equations for the optimal values gives (See appendix 2.B):

 = [T.2(1 − .2)V, OP, + ]/- , where  = ©§„ H(JH)


š(›) § ªš(›)
q
(16)
~ †

Note that the numbers of pools per cluster, g, required when we minimize the cost
subject to -(./) = - is the same as when minimizing -(./) (Eq. 14) subject to a budget
constraint. However, the expression for obtaining the required number of clusters, m, is
different. In this case the value of - = «, /4]J¯/,
,
is substituted into equation (16) and
„
1l~ƒ²/„
the expression for the required number of clusters, is  = ´T.2(1 − .2)V, OP, +
³„

µ. Another way of obtaining the same solution to this problem is given in Appendix
š(›)
q

C.

It is useful to consider the problem without a budget constraint. For a fixed number
of pools per cluster (), with CI width of «, we can get the required number of clusters,

m, making 2]J¯/, © + = ω and solving for . The required, m, is equal


 (JH
qTH  )V„ †„ š(›)
qr rq

to:
„
1l~ƒ²/„
= ´T.2(1 − .2)V, OP, + µ
š(›)
³„ q
(17)

Equation (17) which is the same expression as derived in equation (16) for the
required number of clusters for minimizing the total budget subject to a variance
constraint. However, equation (17) produces optimal allocation of clusters, m, only when

we replace the values of  = ©§„ H(JH)


§ ªš(›)
in equation (17).
~ †

2.5.3 Minimizing the budget to obtain a certain power


Assume a threshold is defined a priori, and our main interest is to test ¶ : .2 = .2 vs
¶ : .2 > .2 . For example, the European Union (Anonymous, 2003) requires that the
proportion of genetically modified (GM) seed impurities be lower than 0.005 in a seed
lot. Here the issue of interest is to determine a sampling plan (i.e. m and g) budget
21

required for this test to have a specified power (1 − ¸) and significance level ¬, when
δ = |.2 − .2 | . For performing a test with a type I error rate of α and a type II error rate
of ¸ when .2 = .2 under ¶ , the following is required to hold:
]J¯ = (./ − .2 )/ªºx (.
»)
 and ]J¼ = (.
/ − .2 )/ªºx (.
»)


Here ºx (.
»)
 is the variance of .
/ but under the value of the null hypothesis. Both
]J¯ and ]J¼ have a standard normal distribution since the variance components are
assumed known. According to Cochran (1977) and Moerbeek (2000), these equation
result in the relation:

-2 = (l
|›|„
~ƒ² œl~ƒ½ )
„
, (18)

If we change the alternative hypothesis to ¶ : .2 < .2 equation (18) is still valid, but
if we change to a two-sided tests ¶ : .2 ≠ .2 , ]J¯ in equation (18) is replaced by ]J¯/, .
This is because we want the required budget for this test to have the specified power
(1 − ¸) and significance level ¬ when δ = |.2 − .2 |

Similarly, we are interested in minimizing the total budget to obtain a specified


power (1 − ¸). This implies that the -(.
»)
 is a fixed quantity and equal to equation (18).

Therefore, the problem is exactly the same as minimizing the budget to obtain a certain
width of the confidence interval) but with a value of - equal to equation (18), since we
want to minimize: min(À =  + , ) subject to -(./) = -,. Therefore the optimal
allocation of cluster and pools per clusters are also given in equation (16) but using
„
u ƒ„ u u
WŽ3JH
Á Y‚  Á (JH
Á )
equation (18) in place of -, -(ž ) = in place of -(ž), .2 in place of
H
b„ (Ž3œŽ`J)„/‚

.2 and therfore .2` = 3 + (1 − 3 − ` )(1 − .2 )b since these values need to be

´T.2 (1 −
(l~ƒ² œl~ƒ½ )„
calculated under the null hypothesis. This implies that  = |›|„

.2 )V, OP, + µ and  = ©§„ H


š(›Á ) § ªš(›Á )
q ~ Á (JH
Á) †
.

Again, assuming no budget constraint and a given number of pools per cluster, . We
can solve for the required number of clusters, m, to achieve a power level (1 − ¸) for a
22

desired ž. To get the required m we need to make ºx (.


»)
 = (l
|›|„
~ƒ² œl~ƒ½ )
„ and solve for

. Therefore, solving for m from equation (18) indicates that the required number of
clusters () is equal to:

´T.2 (1 − .2 )V, OP, +


(l~ƒ² œl~ƒ½ )„
= µ
š(›Á )
|›|„ q
(19)

Also here equation (19) is the same as that obtained for m from equation (16) but

with - = (l
|›|„
~ƒ² œl~ƒ½ )
„ . For this reason, equation (19) produce optimal values if we use

 = ©§„ H
§ ªš(›Á )
~ Á (JH
 Á) †
.

2.6 Behavior of the optimal sample size for equal cluster sizes

Figure 2.1a presents some graphs that demonstrate the behavior of the optimal sample
size for equal cluster sizes for values of OP, = 0.25. Most of the time the optimal sample
size requires fewer clusters (m) than pools per cluster (g) since the ratio (m/g ) is usually
less than 1. However, for values of OP, ≥ 0.65 and . > 0.04 m/g> 1, and more clusters
(m) than pools per cluster (g) are required. Figure 2.1a illustrates that when the variability
between clusters OP, is greater than the variability within clusters, -(ž), more clusters
than pools per cluster are needed when the remaining parameters are fixed.

Figure 2.1b illustrates the behavior of the ratio (m/g) as a function of the cost of
enrolling clusters in the study , . As , increases, the ratio (m/g) decreases, which is
expected since the cost of including a cluster is increasing relative to the cost of enrolling
pools which does not change. Figure 2.1c shows that the number of clusters, m, decreases
as the expected width of the CI increases («) which makes sense, since a narrow
expected width («) of the CI implies that the estimation process is more precise, and vice
versa. In Figure 2.1d we can see that the required number of clusters, m, increases when a
larger power is required.

2.7 Correction factor for unequal cluster sizes


23

Although equal cluster sizes are optimal for estimating the proportion of interest, they are
rarely encountered in practice. Variation in the actual size of the clusters (fields,
localities, hospital, schools, etc.), non-response, and dropout of individuals (among
others) generate unequal cluster sizes in the study (Van Breukelen et al., 2007). Cluster
size variation increases bias and produces a considerable loss of power and precision in
the parameter estimates. For this reason, we will calculate the relative efficiency of
unequal versus equal cluster sizes for adjusting the optimal sample size of the last section
under the assumption of equal cluster sizes. The relative efficiency of equal versus
unequal cluster sizes for the estimator of the proportion of interest, Ã(./), is defined as:
 XÆÇÈÉÊË Y
šÄÅWH
Ã(./) =  XÆÉÌÇÈÉÊË Y
šÄÅWH
(20)

where -x W./XÍÎÏÐÑÒ Y denotes the variance of the proportion estimator given a design
with equal cluster sizes, -x W./XÍÐÓÎÏÐÑÒ Y denotes a similar value for an unequal cluster
size design, but with the same number of clusters m and the same total number of pools
(Ô = ∑r
)d ) ) as with the equal cluster size design. Then Ã(.
/) is equal to:

Ã(./) = ∑| ∑r
( †„∗ œš(›)/q“)/r
= )d[ ]
q“œ¯  qI
„∗
I}~[( † œš(›)/qI ]/r
„ q“ r qI œ¯
(21)

where OP,∗ = T.2(1 − .2)V, OP, and ¬ = -(ž)/OP,∗ . Note that equation (21) is equal to that
derived for the RE of equal versus unequal cluster sizes in cluster randomized and
multicenter trials given by Van Breukelen et al. (2007) to recover the loss of power when
estimating treatment effects using a linear model. Here we use RE to repair the loss of
power or precision when estimating the proportion using a random logistic model for
group testing. Since our RE was expressed as that derived by Van Breukelen et al.
(2007), we use their approach to obtain a Taylor series approximation of equation (21),
expressing the RE as a function of the intraclass correlation Ÿ, and the mean and standard
deviation of cluster size. It is important to point out that equation (21) is expressed in
terms of pools instead of individuals, as in the formula of Van Breukelen et al. (2007).
Therefore, we assumed that the cluster sizes ) (* = 1,2, , … , ) are realizations of a
random variable U having mean Õq and standard deviation Oq . According to Van
Breukelen et al. (2007), equation (21) can be considered a moment estimator of
24

Ã(./) = ( )
q“œ¯ Ö
q“ ֜¯
(22)

If we define £ = (Õq /(Õq + ¬)), and the coefficient of variation of the random
variable U by À- = Oq /Õq , then by using derivations similar to those reported by Van
Breukelen et al. (2007, pp. 2601-2602; see Appendix D), we obtained the following
second-order Taylor approximation of the expectation part of equation (22) ( )≈
Ö
֜¯

£T1 − À- , £(1 − £)V. The second-order Taylor approximation of equation (21) is:
Ã(./)× ≈ T1 − À- , £(1 − £)V (23)
It is evident that Ã(./)× does not depend on the number of clusters m, but rather on
the distribution of cluster sizes (mean and variance) and intraclass correlations. When
OP,∗ → 0 (and thus Ÿ → 0) or OP,∗ → ∞ (and thus Ÿ → 1), we have à → 1. For 0 <
OP,∗ < ∞ (and thus 0 < Ÿ < 1), we can see that à < 1, implying that equal cluster sizes
are optimal. For practical purposes, we will denote Ã(./)× = Ã× . To correct for the loss
of efficiency due to the assumption of equal cluster sizes, one can simply divide the
number of clusters ( ) given in equation (15 or 16) by the expected RE resulting from
equation (23). Also, it is evident that the number of clusters will increase the budget to
À ∗ = À( ), whereas the optimal number of pools per cluster () does not change.

ڐÛ

2.8 Comparison of the relative efficiency and its Taylor approximation

To compare the RE of equation (21), its Taylor approximation (Eq. 23) was performed
for four cluster size distributions: uniform, unimodal, bimodal, positively skewed. Three
different cluster sizes, Ä , P , § , with frequencies oÄ , oP , o§ , were evaluated (see Table
2.1). For each of the four distributions, both REs [asymptotic (Eq. 21) and Taylor
approximation (Eq. 23)] were computed and plotted as a function of the intraclass
correlation (the values used were from 0.0 to 0.3).
Figure 2.2 shows that for the four distributions (uniform, unimodal, bimodal and
positively skewed), the RE drops from 1 at Ÿ = 0 to minimum at Ÿ somewhere between
Ÿ = 0.05 and 0.1, and then increases, returning to 1 for Ÿ = 1. Lower RE values are
observed when there is more cluster size variation (as in the case of bimodal distribution
with larger values of À- > 0.70). For this reason, by comparing the four distributions,
25

we can see in Figure 2 that the positively skewed distribution gives the highest RE,
followed by the unimodal, uniform and bimodal distributions. These results are in line
with results reported by Van Breukelen et al. (2007, 2008) and Candel et al. (2010) for
studies of cluster randomized trials for normal data and binary results in a non-group
testing context.

Figure 2.2 also shows that the Taylor approximation (equation 23, denoted as Ã× ) of
the RE given in equation (21) is acceptable in most cases. However, it is clearly affected
by the distribution of the cluster sizes, the number of clusters, the number of pools per
cluster and the value of the intraclass correlation.

2.9 Estimating the proportion of transgenic plants––An example

Next we illustrate how to achieve the optimal allocation of fields and pools for
minimizing the variance (using equation 15), and for estimating the required budget for a
desired CI width and the budget required to obtain a certain power (using equation 16).
Carreón-Herrera et al., (2011) collected corn grain in 14 localities in the Sierra
Nororiental and 22 localities in the Mixteca Baja, in the State of Puebla, México. They
collected a total of 58 kg grain. Forty seven samples were obtained from farmers and 11
from DICONSA stores. From the 58 samples, 36 had white seed, 10 yellow, 8 blue and 4
red. They used PCR to detect the promoter of Cauliflower Mosaic Virus (CaMV-35S)
which indicates presence of transgenic corn. For each sample they reported the
percentage of the CaMV-35S promoter in the sample. The standard of 0.01% was used
as the lower limit of reference for the detection of the CaMV-35S. The percentages of
the CaMV-35S promoter reported varied between 0.01% and 0.25%. However,
Landavazo et al., (2006) reported a lower value (0.000012% median of the 5 fields
studied) for the percentage of the CaMV-35S promoter in a study conducted in the
neighbor state of Oaxaca. Assuming that we want to conduct another study in this region
of Puebla we can assume that expected proportion of transgenes is equal to .2 =

=0.0013, while the variance between clusters O , = ( ) .


.,ÜJ., ÅÄÝq3 ,
, Þ
For
26

binomial data, the range relevant to the six-sigma approximation is the difference
between the maximum and minimum plausible logit (Stroup, 2012). Since we know the
lowest (.2¤ = 0.00000012) and highest (.2ß = 0.0025) plausible probabilities we can

calculate the logit D¤ = log ´JHၠµ = log ´J.,µ =-6.9208 and Dß = log ´JH⁠µ =

H ., 
H
á â

log ´ µ =-2.60097, then the range is equal to x( =-2.6009+6.9208=4.3196.


.,Ü
J.,Ü

Therefore, the OP, ≅ (4.3196/6), = 0.5184. Based on a literature review we decided to


use a pool size of 10 plants per pool, 3 = 0.999, ` = 0.997.
C = 20,000 (total budget for the study). , = 850 cost of enrolling fields to the study,
and  = 70 cost of enrolling pools composed of  = 10 plants in the study. Next we
obtained the required sample sizes for minimizing the variance, for achieving a certain
width of the CI and for obtaining certain power.

Minimizing the variance. Computing .2 ` = 0.999 + (1 − 0.999 − 0.997)(1 −


0.0013) = 0.01587 and
„ „
ƒ„ u
(.çççJ.Üèé)~Áƒ„ (.Üèé)(J.Üèé)
-(ž) = =
 uY‚
W Ž JH  (JH
H  u)
b„ ( Ž œ Žu J)„/‚ „ (.çççœ.ççéJ)„/~Á
=0.000161 results in

=© = 47.32856 ≈ 47
èÜ √.Þ
é .0(J.0)√.Üè1

=§ = 4.8042 ≈ 5
¦ ,
„ œq§~ èÜœ(Ü)(é)
=

This means that we need to select five fields at random from the population of fields,
with 47 pools in each field. Thus the total number of plants to select in each field is
ë = 47ë10 = 470 plants, which will be allocated at random to form the 47 pools.

Now, if the cluster sizes are unequal, how do we compensate for the loss of
efficiency due to varying cluster sizes? Assuming, that the mean and standard deviation
of cluster sizes are Õq = 177 and Oq = 81.5, respectively. Then À- = = 0.4605,
è.Ü
éé

¬= = T.0(J.0)V„ .Üè1 =5, so £ = (177/(177 + 5) =0.9602. Therefore,


š(›) .Þ
„∗
†

Ã× = T1 − (0.4605, )(0.9725)(1 − 0.9725)V =0.9943 and for practical purposes,


adjustment for unequal cluster sizes is not needed. However, to illustrate the method full
27

efficiency can be restored by taking  = =4.8316≈ 5 clusters with  = 47 pools


1.è1,
.çç10

and the new total budget will increase to À ∗ = =20114.65.


,
.çç10

Specified CI width. Now suppose that the researcher requires a 95% confidence interval
estimate, with a desired width for the proportion of transgenic plants that is equal to
° = (.2Ö − .2¤ ) ≤ « =0.0025. Therefore, ]J.Ü/, = 1.96 and - = «, /4]J¯/,
,
=

= 0.000000401. Using the same values of , 3 , ` , OP, , .2, , and  as given


.,܄
1∗.çބ

for minimizing the variance, equation (16) gives  = © = 47


èÜ √.Þ
é .0(J.0)√.Üè1

while the number of clusters is equal to:

0.000161
 = [T0.0013(1 − 0.0013)V, (0.5184) + ]/0.000000401 = 10.5802 ≈ 11
47

Since the value of  does not change, we need 470 plants per field, but now we need 11
fields to reach the required width of the 95% CI. However, this sample size is only valid
for equal cluster sizes. If needed adjustment for unequal cluster sizes it is carried out by
∗ = .
r
ڐÛ

Therefore the budget has to be equal to: À = (47)(11)(70) + 11(850) = 45540.


This implies that the required total budget for obtaining a 95% CI for estimating the
proportion (.2) with a desired width of 0.0025 is 2.264 times larger than the previous
budget (20114.65).

Now we determine the required number of clusters when there is no budget


constraint, and assuming g=10 (pools per cluster). Using equation (17) and assuming the
same values for «, , ¬, 3 , ` , OP, , .2, that were given for minimizing the variance, we
have

4(1.96), 0.000161
= íT0.0013(1 − 0.0013)V, (0.5184) + î = 41.7783 ≈ 42
0.0025 , 10

This implies that we need a sample of 42 clusters, each containing 10 pools, assuming
equal cluster sizes. Using unequal cluster sizes and assuming the same mean and standard
28

deviation of cluster sizes, we need ∗ = = 42.0178 ≈ 43 clusters. Of course, in


1.ééè0
.çç10

this case, the total budget will be higher than the previously specified budget.

Specified power. Now suppose that we need to know the budget and sample size
required for testing ¶ : .2 = 0.0013 vs ¶ : .2 > 0.0013 at a ¬ = 0.05 significance
level with a power (1 − ¸) = 0.9 (90%) for detecting ž ≥ 0.002 and using the same

parameters values (, 3 , ` , OP, , , and ) as before. Then, - = -2 = (.Þ1ܜ =


.,„
.,è,)„

0.0000004671. Since -(ž ) = -(ž) = 0.000161, .2 = .2 , then

 = © é ≈ 47 and the required number of clusters is equal to:


èÜ √.Þ
.0(J.0)√.Üè1

0.000161
´T0.0013(1 − 0.0013)V, (0.5184) +
= 47 µ = 9.2136 ≈ 10
0.000105

Here, too, we need 470 plants per field but now we need 10 fields to reach the required
power of 90%. To compensate for the unequal cluster sizes and assuming the same mean
and standard deviation of the cluster sizes (Õq = 177and Oq = 81.5), we have to multiply

 = 9.2136 by the correction factor (1/0.9943) which gives us ∗ = .çç10 =9.2664≈


ç.,0Þ

10 clusters. Here the number of clusters remains the same due to rounding, but this is not
always the case.

Also here the required budget is À = (10)(47)(70) + 10(850) = 41400 which


implies that the required total budget is 2.058 times larger than the budget for minimizing
the variance of the proportion (20114.65). However, this case guarantees a power of 90%
for ž ≥ 0.002.

Now consider the problem without a budget constraint with 10 pools per cluster (g)
and solving for the required number of clusters (m) using the same values of
, 3 , ` , OP, , ¬, (1 − ¸), ž = .2 − .2 as above gives

(1.645 + 1.282), 0.000161


= íT0.0013(1 − 0.0013)V, (0.5184) + î
0.002, 10
= 44.6386
29

This means that to perform the study, we need 45 clusters with 10 pools per cluster if the
cluster sizes are equal, and ∗ = =44.8945≈ 45 clusters using unequal cluster
11.Þ0èÞ
.çç10

sizes.

2.10 Tables for determining sample size

This section contains tables that help to calculate the optimal sample size. Table 2.2 gives
the optimal allocation of clusters and pools when the goal is to estimate the proportion
(.2) with minimum variance using group testing with pool size (s=10). The cost function
is C=m + , with C=10000, with six values of
OP, = 0.15, 0.25, 0.45, 0.65, 0.85, 1.05; three values of  = 50, 100, 200 and , = 800.
To illustrate how to use Table 2.2, assume that the proportion of interest is .2 = 0.035,
and that the variance between clusters is OP, = 0.25. Assume the researcher estimates the
cost of enrolling clusters in the study, as , = 800, the cost of enrolling pools of size
s=10 is  = 100 and the total budget for conducting the study is C=10000. Since in this
case  = 100 , we will refer to the second subsection. We find the value of .2 = 0.035 in
the first column and the value of OP, = 0.25 in columns four and five. The values of the
intersection between the value of .2 = 0.035 (first column) and the value of OP, =
0.25 (columns 5 and 6) are the optimal number of pools per cluster (g=11) and the
number of clusters (m=6) required.

Table 2.3 gives the optimal allocations of clusters (m) and pools per cluster (g) to
estimate .2 with a certain width of the confidence interval under the cost function
C=m + , , when  = 50 and , = 800 and significance level ¬ = 0.05. Three
values of OP, = 0.15, 0.25, 0.5 form the three subsections of Table 2.3. To illustrate
assume that .2 = 0.035, OP, = 0.25,  = 50, , = 800, and the desired width of the
confidence interval is equal to « =0.015. The optimal m and g are obtained where the
value of .2 = 0.035 (first column) intersects with the value of « =0.015 (columns 5 and
6) in the second subsection corresponding to OP, = 0.25. Therefore the optimal numbers
of pools per cluster (g) and of clusters (m) and are 16 and 4, respectively.
30

Table 2.4 should be used when testing a hypothesis, that is, when we want to test
¶ : .2 = .2 relative to ¶ : .2 > .2 . Using this table, for six values of power
(0.70, 0.75, 0.80, 0.85, 0.90, 0.95), significance level (¬ = 0.05), pool size (s=10),
 = 50 and , = 800, OP, = 0.5, ž = 0.01,0.03,0.05 and ten values of .2 from 0.005,
0.015 to 0.095 with increments of 0.01, we can obtain optimal allocation of clusters (m)
and of pools per cluster (g) using the the cost function C= + , . To illustrate,
assume that .2 = 0.035, OP, = 0.5,  = 50, and , = 800 , the desired power is 1 − ¸
=0.8, the significance level is ¬ = 0.05 and δ = 0.03. Using the second subsection
(ž = 0.03), we find the value of .2 = 0.035 (first column) and 1 − ¸ =0.8 (columns 5
and 6) and at the point where they intersect, we find the required number of pools per
cluster (g=11) and the number of clusters (m=7) to achieve a power of 80%.

2.11 Discussion and conclusions

In the present paper, we derived optimal sample sizes for group testing in a two-stage
sampling process under a budget constraint. We assumed that the budget for enrolling
individuals and clusters in the study is fixed and that we know the variance components.
The optimal sample sizes were derived using Lagrange multipliers and produced
formulae similar to the methods of Brooks (1955), Cochran (1977; p. 285), and Moerbeek
et al. (2000) based on minimizing the error variance. This optimal allocation of clusters
and pools was derived assuming equal cluster sizes, which are a good approximation
when financial resources are scarce. However, in practice the equality of cluster sizes is
rarely satisfied and we derived a correction factor (inverse of the relative efficiency) to
adjust the optimal sample sizes under equal cluster sizes. It is important to point out that
this correction factor does not affect the number of required pools per cluster (g), but only
the number of required clusters (m) and the total budget (C).

To determine the optimal sample sizes for equal or unequal cluster sizes, we start by
specifying the needed power or precision; we can then calculate -(./) as well as the
needed budget (C), and later obtain the optimal number of clusters () and pool per
cluster () needed. This is really important since the researcher will usually plan his/her
31

research in terms of power or precision under a budget constraint. The examples given
show how the researcher can estimate the budget needed to reach the desired power or
precision for the parameter estimate. However, it should be noted that these are
approximate optimal sample sizes since they were derived assuming that the proportion
(.2) is distributed approximately normal.

If sample sizes for precision or power are required without a budget constraint,
expressions (17) and (19) can be used for precision and power, respectively. However,
these sample sizes are not optimal since the value of  is determined by the researcher
according to his beliefs.

Also, remember that our optimal sample sizes were derived using a first-order TSE
approach under the assumption that the variance components are known. For this reason,
it is expected that the optimal sample sizes would be slightly biased, since in many Monte
Carlo simulations for estimating fixed and random effects and determining optimal
sample size for clustered randomized trials, this method produces biased results
(Goldstein and Rabash, 1996; Moerbeek et al., 2001; Moerbeek and Maas, 2005; Candel
et al., 2010). However, these approximate sample sizes should be reasonable and can be
calculated easily. However, further study is required on the performance of the proposed
optimal sample sizes.

2.12 References

ANONYMOUS(2003): Regulation (EC) 1829 of the European Parliament and the


European Council of 22 September 2003 on genetically modified food and feed.
Official Journal of the European Union L 268.
Bilder, C. (2009). Human or Cylon? Group Testing on Battlestar Galactica. Chance
22(3):46-50.
Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear
mixed models. Journal of the American Statistical Association, 88(421): 9-25.
Brooks SH. 1955. The estimation of an optimum subsampling number. J. Am. Stat.
Assoc. 50:398-415 .
32

Candel, M.J.J.M., Van Breukelen, G.J.P., Kotova, L., & Berger, M.P.F. (2008).
Optimality of unequal cluster sizes in multilevel studies with small sample sizes.
Communications in Statistics: Simulation and Computation 37:222-239.
Candel, M. J., & Van Breukelen, G. J. (2010). Sample size adjustments for varying
cluster sizes in cluster randomized trials with binary outcomes analyzed with
second-order PQL mixed logistic regression. Statistics in medicine, 29(14): 1488.
Candy, S.G. (2000). The application of generalized linear mixed models to multi-level
sampling for insect population monitoring. Environmental and Ecological Statistics
7(3):217-238.
Carreón-Herrera, N. I. (2011). Detección de transgenes en variedades nativas de maíz en
dos regiones del estado de Puebla. Unpublished doctoral dissertation, Colegio de
Postgraduado, Campus Puebla, Puebla, Mexico. Available at:
http://www.biblio.colpos.mx:8080/xmlui/bitstream/handle/10521/439/Carreon_Her
rera_NI_MC_EDAR_2011.pdf?sequence=1
Chen, P., Tebbs, J., and Bilder, C. (2009). Group Testing Regression Models with Fixed
and Random Effects. Biometrics 65(4):1270-1278.
Cochran, W.G. 1977. Sampling techniques. New York, Wiley. 3rd ed.
Dodd, R. Y., Notari, E. 4., & Stramer, S. L. (2002). Current prevalence and incidence of
infectious disease markers and estimated window-period risk in the American Red
Cross blood donor population. Transfusion, 42(8): 975-979.
Dorfman, R. (1943). The detection of defective members of large populations. The
Annals of Mathematical Statistics 14(4):436-440.
Goldstein, H. (1991). Nonlinear multilevel models, with an application to discrete
response data. Biometrika 78(1):45-51.
Goldstein, H., Rasbash, J. (1996). Improved approximations for multilevel models with
binary responses. Journal of the Royal Statistical Society, Soc. A 159(3):505-513.
Goldstein, H.. (2003). Multilevel Statistical Models. Third Edition. London, Edward
Arnold.
Hernández-Suárez, C.M., Montesinos-López, O.A., McLaren, G., & Crossa, J. (2008).
Probability models for detecting transgenic plants. Seed Science Research 18:77-
89.
Landavazo Gamboa, D. A., Calvillo Alba, K. G., Espinosa Huerta, E., González Morelos,
L., Aragón Cuevas, F., Torres Pacheco, I., ... & Mora Avilés, M. A. (2006).
Caracterización molecular y biológica de genes recombinantes en maíz criollo de
Oaxaca. Agricultura técnica en México, 32(3): 267-279.
33

Lohr, S.L. (2008). Coverage and sampling. In E.D. de Leeuw, J.J. Hox, & D.A. Dillman
(eds.). International Handbook of Survey Methodology. New York, NY: Lawrence
Erlbaum Associates.
Maas, C.J.M., Hox, J.J. (2004). Robustness issues in multilevel regression analysis.
Statistica Neerlandica 58:127-137.
Moerbeek, M. (2006). Power and money in cluster randomized trials: when is it worth
measuring a covariate? Statistics in Medicine 25(15):2607-2617.
Moerbeek, M., van Breukelen, G. J., & Berger, M. P. (2000). Design issues for
experiments in multilevel populations. Journal of Educational and Behavioral
Statistics, 25(3): 271-284.
Moerbeek, M., Van Breukelen, G. J., & Berger, M. P. (2001a). Optimal experimental
designs for multilevel logistic models. Journal of the Royal Statistical Society:
Series D (The Statistician), 50(1):17-30.
Moerbeek, M., van Breukelen, G. J. P., & Berger, M. P. F. (2001b), “Optimal
experimental Designs for Multilevel Models with Covariates,”Communications in
Statistics, Theory and Methods, 30:2683–2697.
Montesinos-López, O.A., Montesinos-López, A., Crossa, J., Eskridge, K., & Hernández-
Suárez, C.M. (2010). Sample size for detecting and estimating the proportion of
transgenic plants with narrow confidence intervals. Seed Science Research 20:123-
136.
Montesinos-López, O.A., Montesinos-López, A., Crossa, J., Eskridge, K., & Sàenz-
Casas, R.A. (2011). Optimal sample size for estimating the proportion of transgenic
plants using the Dorfman model with a random confidence interval. Seed Science
Research 21(3):235-246.
Mood, A.M., Graybill, F.A., & Boes, D.C. (1974). Introduction to the Theory of
Statistics (3rd Edition). McGraw-Hill.
Peck, C. (2006). Going after BVD. Beef 2006 42:34-44.
Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data.
Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4):
805-827.
Remlinger, K., Hughes-Oliver, J., Young, S., & Lam, R. (2006). Statistical design of
pools using optimal coverage and minimal collision. Technometrics 48:133-143.
Rodríguez, G., & Goldman, N. (1995). An assessment of estimation procedures for
multilevel models with binary responses. Journal of the Royal Statistical Society.
Series A 158(1):73-89.
34

Skrondal, A., & Rabe-Hesketh, S. (2007). Redundant overdispersion parameters in


multilevel models for categorical responses. Journal of Educational and Behavioral
Statistics 32:419-430.
Stroup, W. W. (2012). Generalized linear mixed models: Modern concepts, methods and
applications. CRC Press.
Tebbs, J., & Bilder, C. (2004). Confidence interval procedures for the probability of
disease transmission in multiple-vector-transfer designs. Journal of Agricultural,
Biological, and Environmental Statistics 9(1):79-90.
Van Breukelen, G.J.P., Candel, M.J.J.M., & Berger, M.P.F. (2007). Relative efficiency
of unequal versus equal cluster sizes in cluster randomized and multicenter trials.
Statistics in Medicine 26:2589-2603.
Van Breukelen, G.J.P., Candel, M.J.J.M., & Berger, M.P.F. (2008). Relative efficiency
of unequal cluster sizes for variance component estimation in cluster randomized
and multicentre trials. Statistical Methods in Medical Research 17:439-458.
Verstraeten, T., Farah, B., Duchateau, L., & Matu, R. (1998). Pooling sera to reduce the
cost of HIV surveillance: a feasibility study in a rural Kenyan district. Tropical
Medicine and International Health 3:747-750.
Wolf, J. (1985). Born-again group testing: multi access communications. IEEE
Transactions on Information Theory 31(2):185-191.
Yamamura, K., & A. Hino. (2007). Estimation of the proportion of defective units by
using group testing under the existence of a threshold of detection.
Communications in Statistics - Simulation and Computation 36:949-957.
Zhang, B., Bilder, C., & Tebbs, J. (2013). Regression analysis for multiple-disease group
testing data. To appear in Statistics in Medicine 32(28): 4954–4966.
35

Table 2.1. Cluster size distribution used for calculating relative efficiency.
Cluster size Cluster frequencies
Distribution Ä P § oÄ oP o§ CV
 = 18, ̅ = 22,  = 10
Uniform 4 22 40 6 6 6 0.668
Unimodal 4 22 40 2 14 2 0.386
Bimodal 4 22 40 8 2 8 0.771
Positively skewed 8 26 44 8 6 4 0.643
 = 48, ̅ = 20,  = 10
Uniform 5 20 35 16 16 16 0.612
Unimodal 5 20 35 8 32 8 0.433
Bimodal 5 20 35 22 4 22 0.718
Positively skewed 10 24 42 24 16 8 0.583
oÄ number of clusters of size Ä (small), oP number of clusters of size P (medium), o§
number of clusters of size § (large); À- = coefficient of variation. Two numbers of
clusters were studied:  = 18 with average pools per cluster ̅ = 22, and  = 48 with
average pools per cluster ̅ = 20. In both cases, the pool size was  = 10.
36

Table 2.2. Optimal sample sizes (ï and ð) for group testing for two stages given a pool
 ).
size that minimize the variance of the proportion (ñ
 = 50 and , = 800
OP, = 0.15 OP, = 0.25 OP, = 0.45 OP, = 0.65 OP, = 0.85 OP, = 1.05
.2 g m g m g m g m g m g m
0.005 65 3 50 4 37 4 31 5 27 5 24 5
0.015 32 5 25 5 19 6 15 7 14 7 12 8
0.025 25 5 19 6 14 7 12 8 10 8 9 8
0.035 21 6 16 7 12 8 10 8 9 9 8 9
0.045 19 6 15 7 11 8 9 8 8 9 7 9
0.055 17 6 14 7 10 8 8 9 7 9 7 9
0.065 17 7 13 7 10 8 8 9 7 9 6 9
0.075 16 7 12 8 9 8 8 9 7 9 6 10
0.085 15 7 12 8 9 9 7 9 6 9 6 10

 = 100 and , = 800


0.095 15 7 12 8 9 9 7 9 6 9 6 10

0.005 46 2 35 3 26 3 22 4 19 4 17 4
0.015 23 4 18 4 13 5 11 6 10 6 9 7
0.025 17 4 13 5 10 6 8 7 7 7 7 7
0.035 15 5 11 6 9 7 7 7 6 8 6 8
0.045 13 5 10 6 8 7 6 7 6 8 5 8
0.055 12 5 10 6 7 7 6 8 5 8 5 8
0.065 12 6 9 6 7 7 6 8 5 8 4 9
0.075 11 6 9 6 6 7 5 8 5 8 4 9
0.085 11 6 8 7 6 7 5 8 5 8 4 9
0.095
 = 200 and , = 800
11 6 8 7 6 8 5 8 4 9 4 9

0.005 32 2 25 2 19 3 16 3 14 3 12 4
0.015 16 3 12 4 9 4 8 5 7 5 6 5
0.025 12 4 9 4 7 5 6 6 5 6 5 6
0.035 10 4 8 5 6 5 5 6 4 6 4 7
0.045 9 4 7 5 5 6 5 6 4 7 4 7
0.055 9 4 7 5 5 6 4 7 4 7 3 7
0.065 8 5 6 5 5 6 4 7 3 7 3 8
0.075 8 5 6 5 5 6 4 7 3 7 3 8
0.085 8 5 6 6 4 6 4 7 3 7 3 8

With pool size (s=10), cost function C=m + , with C=10000 and , = 800. For
0.095 8 5 6 6 4 6 4 7 3 7 3 8

ten values of .2, three values of  and six values of OP, .


37

Table 2.3. Optimal sample sizes (ï and ð) for confidence interval estimation using
group testing in two stages given a pool size.
OP, = 0.15
« =0.01 « =0.03 « =0.015 « =0.07 « =0.09 « =0.11
.2 g m g m g m g m g m g m
0.005 65 3 65 2 65 2 65 2 65 2 65 2
0.015 32 16 32 2 32 2 32 2 32 2 32 2
0.025 25 35 25 4 25 2 25 2 25 2 25 2
0.035 21 61 21 7 21 3 21 2 21 2 21 2
0.045 19 93 19 11 19 4 19 2 19 2 19 2
0.055 17 133 17 15 17 6 17 3 17 2 17 2
0.065 17 171 17 19 17 7 17 4 17 3 17 2
0.075 16 221 16 25 16 9 16 5 16 3 16 2
0.085 15 278 15 31 15 12 15 6 15 4 15 3
0.095 15 333 15 37 15 14 15 7 15 5 15 3
OP, = 0.25
0.005 50 4 50 2 50 2 50 2 50 2 50 2
0.015 25 22 25 3 25 2 25 2 25 2 25 2
0.025 19 50 19 6 19 2 19 2 19 2 19 2
0.035 16 89 16 10 16 4 16 2 16 2 16 2
0.045 15 134 15 15 15 6 15 3 15 2 15 2
0.055 14 189 14 21 14 8 14 4 14 3 14 2
0.065 13 255 13 29 13 11 13 6 13 4 13 3
0.075 12 331 12 37 12 14 12 7 12 5 12 3
0.085 12 406 12 46 12 17 12 9 12 6 12 4
0.095 12 487 12 55 12 20 12 10 12 7 12 5
OP, = 0.5
0.005 35 7 35 2 35 2 35 2 35 2 35 2
0.015 18 35 18 4 18 2 18 2 18 2 18 2
0.025 13 86 13 10 13 4 13 2 13 2 13 2
0.035 11 154 11 18 11 7 11 4 11 2 11 2
0.045 10 237 10 27 10 10 10 5 10 3 10 2
0.055 10 327 10 37 10 14 10 7 10 5 10 3
0.065 9 446 9 50 9 18 9 10 9 6 9 4
0.075 9 565 9 63 9 23 9 12 9 7 9 5
0.085 8 725 8 81 8 29 8 15 8 9 8 6

With pool size (s=10), cost function C=m + , subject to -(./) = « /4]J¯/,
0.095 8 873 8 97 8 35 8 18 8 11 8 8
, ,
with
 = 50, , = 800 and significant level ¬ = 0.05. For ten values of .2, six values of the
expected width of the CI («), and three values of OP, .
38

Table 2.4. Optimal sample sizes (ï and ð) for power estimation using group testing in
two stages given a pool size.
ž = 0.01
1−¸ 1−¸ 1−¸ 1−¸ 1−¸
1 − ¸ =0.70
.2
=0.75 =0.80 =0.85 =0.90 =0.95
g m g m g m g m g m g m
0.005 35 2 35 3 35 3 35 3 35 4 35 5
0.015 18 11 18 13 18 15 18 17 18 20 18 25
0.025 13 27 13 30 13 35 13 40 13 48 13 61
0.035 11 47 11 54 11 62 11 72 11 86 11 108
0.045 10 73 10 83 10 96 10 111 10 132 10 167
0.055 10 100 10 115 10 132 10 153 10 182 10 230
0.065 9 137 9 157 9 180 9 209 9 249 9 314
0.075 9 173 9 198 9 228 9 265 9 315 9 398
0.085 8 222 8 254 8 292 8 339 8 404 8 511
0.095 8 268 8 306 8 351 8 409 8 487 8 615
ž = 0.03
0.005 35 2 35 2 35 2 35 2 35 2 35 2
0.015 18 2 18 2 18 2 18 2 18 3 18 3
0.025 13 3 13 4 13 4 13 5 13 6 13 7
0.035 11 6 11 6 11 7 11 8 11 10 11 12
0.045 10 9 10 10 10 11 10 13 10 15 10 19
0.055 10 12 10 13 10 15 10 17 10 21 10 26
0.065 9 16 9 18 9 20 9 24 9 28 9 35
0.075 9 20 9 22 9 26 9 30 9 35 9 45
0.085 8 25 8 29 8 33 8 38 8 45 8 57

ž = 0.05
0.095 8 30 8 34 8 39 8 46 8 55 8 69

0.005 35 2 35 2 35 2 35 2 35 2 35 2
0.015 18 2 18 2 18 2 18 2 18 2 18 2
0.025 13 2 13 2 13 2 13 2 13 2 13 3
0.035 11 2 11 3 11 3 11 3 11 4 11 5
0.045 10 3 10 4 10 4 10 5 10 6 10 7
0.055 10 4 10 5 10 6 10 7 10 8 10 10
0.065 9 6 9 7 9 8 9 9 9 10 9 13
0.075 9 7 9 8 9 10 9 11 9 13 9 16
0.085 8 9 8 11 8 12 8 14 8 17 8 21
0.095 8 11 8 13 8 15 8 17 8 20 8 25
With pool size (s=10), cost function C=m + , subject to -(./) = (l
|›|„
~ƒ² œl~ƒ½ )
„ with
 = 50, , = 800, OP, = 0.5. and significant level ¬ = 0.05. For ten values of .2, six
values of power (1 − ¸), and three values of ž.
39

a b

15
5 2
σb = 0.25
2
σb = 0.85 c1 = 25
2 2 c1 = 100
σb = 0.45 σb = 1.05
4

c1 = 175
2

10
σb = 0.65 c1 = 250
3

c1 = 325
m g

m g
2

5
1
0

0
0.02 0.04 0.06 0.08 0.10 200 400 600 800 1000

proportion (~
π) c2

c
100

2
σb = 0.25 100 2
σb = 0.25
2 2
80

80
σb = 0.45 σb = 0.45
2 2
σb = 0.65 σb = 0.65
60

60

2 2
σb = 0.85 σb = 0.85
m

2 2
σb = 1.05 σb = 1.05
40

40
20

20
0

0.02 0.04 0.06 0.08 0.10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

desired width (w) power (1 − γ)

Figure 2.1. Ratio of the number of clusters and number of pools per cluster.
a) Ratio of the number of clusters and number of pools per cluster m/g as a function of
the proportion (.2), for À = 10000,  = 250, , = 800,  = 10, 3 = 0.98, ` =
0.96 and five different values of OP, ; b) Ratio of m/g as a function of , , for À = 10000,
OP, = 0.5, .2 = 0.05,  = 10, 3 = 0.98, ` = 0.96 and several values of  ; c) Required
number of clusters, m, as a function of the desired confidence interval width («), for
 = 50, , = 1600, .2 = 0.05,  = 10, 3 = 0.98, ` = 0.96, and five different values
of OP, ; d) Required number of clusters, m, as a function of the desired power (1 − ¸), for
 = 50, , = 1600, .2 = 0.04, p = 0.015,  = 10, 3 = 0.98, ` = 0.96, ¬ = 0.05 and
five different values of OP, .
40

a b

1.00

1.00
0.90

0.90
RE

RE
0.80

0.80
RE , m = 18 , g = 22 RE , m = 18 , g = 22
REt , m = 18 , g = 22 RE t , m = 18 , g = 22
RE , m = 48 , g = 20 RE , m = 48 , g = 20
REt , m = 48 , g = 20 RE t , m = 48 , g = 20
0.70

0.70
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Intraclass correlation (ρ ) Intraclass correlation (ρ )

c d
1.00

1.00
0.90

0.90
RE

RE
0.80

0.80

RE , m = 18 , g = 22 RE , m = 18 , g = 22
REt , m = 18 , g = 22 RE t , m = 18 , g = 22
RE , m = 48 , g = 20 RE , m = 48 , g = 20
REt , m = 48 , g = 20 RE t , m = 48 , g = 20
0.70

0.70

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30

Intraclass correlation (ρ ) Intraclass correlation (ρ )

intraclass correlation (Ÿ) for four distributions of cluster size: a) uniform, b) unimodal, c)
Figure 2.2. Relative efficiency of unequal versus equal cluster sizes as a function of the

bimodal, and d) positively skewed distribution.


41

Appendix 2.A. Derivation of the optimal solution for minimizing -(./) subject to
C= + , ( > 0, ,  ≥ 2, D = 1,2).
By combining Eq. (12) and (13), we obtain the Lagrangean

g(,n,£) = g = -(./)+ £[C−( + , )] (14)

where -(./) =
 )V„ †„
+ . £ is the Lagrange multiplier. The partial derivatives of
q“TH
 (JH š(›)
q“r rq“

Eq. (14) with respect to £,  and g are

=0=C−( + , ); then  = §


t¤ ¦
t¥ „ œq§~

=0=− q„ r − £ ; then £ = − q„ r„ §


t¤ š(›) š(›)
tq ~

 )V„ †„
− − £[+, ]
t¤  (JH
TH š(›)
tr r„ r„ q
=0=−

⟺ q„ r„ § [+, ]=
 )V„ †„
+ r„ q, since £ = − q„ r„ §
š(›)  (JH
TH š(›) š(›)
~ r„ ~

⟺ -(ž)+ -(ž), = ,  [T.2(1 − .2)V, OP, + ],


š(›)
q

⟺ -(ž), = ,  T.2(1 − .2)V, OP,

⟺  = ©§„ H(JH)
§ ªš(›)
~ †
42

Appendix 2.B. Derivation of the optimal solution for minimizing C= + ,
subject to -(./) = - .
By combining Eq. (12) and (13), we obtain the Lagrangean

g(,n,£) = g =  + , + £[-(./) − -] (14)

where -(./) =
 )V„ †„
+ . Now the partial derivatives of L with respect to £, 
q“TH
 (JH š(›)
q“r rq“

and g are

 )V„ †„
− - ; then  = [T.2(1 − .2)V, OP, + ]/-
t¤  (JH
TH š(›) š(›)
t¥ r rq q
=0= +

=0= − £ q„ r ; then £ =
t¤ š(›) q„ r„ §~
tq š(›)

=0= + , − r„ [T.2(1 − .2)V, OP, + ]


t¤ ¥ š(›)
tr q

⟺  + , = [
 )V„ †„ š(›)
], since £ =
q„ r„ §~  (JH
TH q„ r„ §~
š(›) r„ r„ q š(›)
+

⟺  + , = š(›)~ [T.2(1 − .2)V, OP, + ],


q„ § š(›)
q

⟺ , = T.2(1 − .2)V, OP,


q„ §~
š(›)

⟺=©
§„ ªš(›)
 (JH
§~ H ) †
43

Appendix 2.C. Alternative derivation of the optimal solution for minimizing


C= + , subject to -(./) = - .
If the sampling budget is C, the allocation of units as given in equation (15) results in
a minimal value of -(./) which in terms of cost is equal to:

-(./) = [T.2(1 − .2)V, OP, + ]


§„ œq§~ š(›)
¦ q
(C1)

where  = © . The solution to this problem can be derived directly since this
§„ ªš(›)
 (JH
§~ H ) †

budget C is also the minimal budget to obtain that particular value of -(./). If there were
a smaller budget with other allocations yielding the same -(./), then our allocation (15)
given C would not be optimal. This is true because the variable g appears in equation
(C1) in the same manner as in equation (12), so that the ó` of equation (15) is also the
value of g which will minimize the cost of the sample if the variance of the estimate of
the proportion () is fixed. Thus it also minimizes the cost variance product (Brooks,
1955). Thus, if a value of -(./) equal - = «, /4]J¯/,
,
is required, the minimal budget to
obtain this -(./) follows by setting -(./) [as given in equation (C1)] equal to - =

«, /4]J¯/, . Solving for budget C gives À = (, +  ) ´T.2(1 − .2)V, OP, + µ /- and
, š(›)
q

finally the corresponding optimal allocation of units follows from equation (15). Since

=§ and substituting À = (, +  ) ´T.2(1 − .2)V, OP, + µ /- we obtain
¦ š(›)
„ œq§~ q

 = [T.2(1 − .2)V, OP, + ]/- and  = ©


š(›) §„ ªš(›)
q  (JH
§~ H ) †
.
44

Appendix 2.D. Taylor series approximation (Eq. 23) of the RE in equation (Eq. 21)
given by Van Breukelen et al. (2007)

Taylor series approximation (23) is derived from the RE of equation (21) in four steps.

Step 1. Let the ) values be independent realizations of a random variable cluster size ô
with expectation ÕÝ and standard deviation ÕÝ . Equation (19) is a moment estimator of
Ã(./) = “  G K
q“œ¯ Ö
q ֜¯
(D1)
where ¬ = (1 − Ÿ)/Ÿ ≥ 0.

Step 2. Define p = (ô − ÕÝ ), then the last term in (D1) can be written as:

ô ÕÝ + p ÕÝ + p 1
 ’ =  ’ = T ’ ’V
ô+¬ ÕÝ + ¬ + p ÕÝ + ¬ 1 + (p/(ÕÝ + ¬))
The last term is a Taylor series [Mood et al. (1974), p. 533, equation (34)]:
G K = ∑ø Bd( )
 Jõ B
œ(õ/(ö÷ œ¯)) ö÷ œ¯
If −(ÕÝ + ¬) < p < (ÕÝ + ¬) to ensure convergence.

Since p = ô − ÕÝ and ¬ ≥ 0, this convergence condition will be satisfied, except for a


small probability ^(ô > 2ÕÝ + ¬) for strongly positively skewed cluster size
distributions combined with large Ÿ(= small ¬). Thus we have:

 G֜¯K = TGö÷œ¯K ∑ø
Bd(ö )B V
Ö ö œõ Jõ
÷ œ¯
(D2)
÷

Step 3. If we ignore all terms p B with [ > 2, and rearrange terms in (D2), we will have

 G֜¯K = £T1 − À- , £(1 − £)V


Ö
(D3)
where £ = (Õq /(Õq + ¬)) ∈ (0,1]. Assuming ̅ = Õq and À- = Oq /Õq is the coefficient
of variation of the random variable U.

Ã(./)× ≈ T1 − À- , £(1 − £)V


Step 4. Plugging (D3) into (D1) gives:
(D4)

Ignoring in (D2) only those p B terms with [ > 4 instead of 2, will give
Remark

Ã(./)× ≈ 1 − T(1 − £)[£À- , − £À- 0 skew + £0 À- 1 (kurt + 3)]V (D5)

skew=the 3rd central moment of the U divided by OÝ0 , and kurt=the 4th moment of U
where skew and kurt are the skewness and kurtosis of the cluster size distribution, that is,

divided by OÝ1 , minus 3 (see, for example, Mood et al., 1974, p.76).
45

Chapter 3: A regression group testing model for a two-


stage survey under informative sampling for detecting
and estimating the presence of transgenic corn

Abstract

Group testing regression methods are effective for estimating and classifying binary
responses and can substantially reduce the required number of diagnostic tests. For this
reason, these methods have been used for the detection and estimation of transgenic corn
in México. However, there is no appropriate methodology when the sampling process is
complex and informative. In these cases researchers often ignore the stratification and
weights which can severely bias the estimates of the population parameters. In this paper
we developed group testing regression models for the analysis of surveys conducted
using two stages with unequal selection probabilities and informative sampling. Weights
are incorporated into the likelihood function using the pseudo-likelihood approach. A
simulation study demonstrates that the proposed model considerably reduces the bias in
estimation compared to other methods that ignore the weights. Finally, we give an
example to clarify the use of the proposed method and SAS code for the analysis.

Key words: complex survey, group testing, informative sampling, transgenic corn.

3.1 Introduction

Group testing is a technique for screening samples for an attribute when samples are
grouped into pools (or batches), and each pool is tested for presence of the attribute
where all samples in the pool are cleared of having the attribute if a pool test negative.
When the proportion of samples with the attribute is less than 10%, group testing is very
attractive because it produces significant saving in the number of diagnostic test required
and time expended, and helps to preserve the anonymity of tested subjects. Group testing
was first used by Dorfman (1943) for detecting soldiers with syphilis during World War
46

II and has been used to estimate the prevalence of a wide variety of diseases in humans,
animals and plants (Verstraeten et al., 1998; Cardoso et al., 1998; Kacena et al., 1998a,
1998b; Chen, 2009; Muñoz-Zanzi et al., 2000 and Bilder et al., 2004). It has also been
used for analyzing biomarker data (Zhang et al., 2012), detecting drugs (Xie, 2001),
solving problems in information theory (Wolf, 1985) and even in science fiction (Bilder,
2009).

Group testing regression methods are available for fixed and mixed (fixed + random)
effects. Farrington (1992) proposed a very simple model for group testing with fixed
effects. Vansteelandt et al. (2000) proposed a model with covariates that uses only initial
group responses for estimation. The group testing model of Xie (2001) and that of
Vansteelandt et al. (2000) allow the inclusion of retesting information of subsets of
positive pools. Chen et al. (2009) propose four goodness-of-fit for group testing
regression models. However, these three papers are only valid when no random effects
are included in the linear predictor. For this reason, Chen et al. (2009) give group testing
regression models for imperfect diagnostic tests (with sensitivity and specificity less than
1) with fixed and random effects, that produce the most accurate estimates when the
sampling process is in clusters. More recently, McMahan et al. (2013) also provide group
testing regression models for mixed effects in the presence of dilution effects. Delaigle
and Meister (2011) and Delaigle and Hall (2012) give non-parametric group testing
regression models.

All the group testing regression methods developed so far are based on the
assumption that selection probabilities are the same for all clusters and individuals, and
weights are not required for the estimation process. Thus these methods are only valid
when clusters are of the same size and simple random samples of clusters and individuals
are taken. Also, they do not take into account stratification at the cluster or individual
levels. In the non-group testing context for two-level linear (or linear mixed) and
generalized linear mixed models, Graubard and Korn (1996), Korn and Graubard (2003),
Pfeffermann et al. (1998), Grilli and Pratesi (2004), and Rabe-Hesketh and Skrondal
(2006) have discussed proper use of sampling weights. However, there is no work on
47

incorporation of sampling weights in group testing regression models. For example, for
detecting and estimating the presence of transgenic corn, researchers use complex
sampling schemes (eg. two or three stages with clusters and stratification and unequal
selection probabilities) for collecting plants in the field, and to save money and time, they
use group testing diagnostic tests on samples of s plants (Piñeyro-Nelson, et al., 2009).
However, due to the lack of an appropriate methodology for analyzing data they ignore
the complex sampling design, which violates the basic assumptions underlying multilevel
models.

Use of appropriate group testing methodology for a complex survey can result in
substantial savings without a significant loss of precision and can be used for estimating
the prevalence of a rare attribute such as transgenic corn or human diseases.

A sampling process is informative when the sampling probabilities are related to the
values of the outcome variable after conditioning on the model covariates (Pfeffermann,
2006). For example, assume a two-stage sampling design is used for estimating the
prevalence of transgenic corn in México with fields as the primary sampling units and
plants as secondary sampling units. If fields were sampled with a probability that is
proportional to field size (PPS), the sample of fields will tend to contain mostly large
fields, and if field size is related to the prevalence but is not included among the model
covariates, the sample of fields will not accurately represent the fields in the population
and the sampling is informative (Pfeffermann, 2006). In the context of estimating
transgenic corn in México, this makes sense because most commercial corn fields are
larger and more likely to contain transgenic corn than noncommercial corn fields. More
examples of an informative sampling scheme being ignored in the inference process can
be found in Kasprzyk et al. (1989), Skinner et al. (1989) and Pfeffermann (1993, 1996).
In general terms, informative sampling results when the probability density of the sample
data is different from the density for the population before sampling. Ignoring the
sampling process in such cases may yield severely biased estimates of population model
parameters, possibly leading to false inferences.
48

In theory, the effect of sample selection can be controlled by including all of the
design covariates. However, this is often not practical, because the design variables may
not be available or known or because there may be too many of them, making fitting and
validation of such models formidable (Pfeffermann and Sverchkov, 2007). One approach
for dealing with informative sampling that commonly produces good results, is to include
design (sampling) weights to account for unequal selection probabilities. Because the
weights are incorporated in the likelihood function, this approach is called pseudo-
maximum likelihood (PML). Another approach for dealing with this problem is the
sample model; it consists of extracting the model for the sample data given the selected
sample (Pfeffermann, 2006). However, with this approach it is sometimes not possible to
extract the probability density function (pdf) for the sample data. For this reason, the
PML approach is still the most popular approach and produces good results.

Before the PML approach, estimating logistic regression model parameters and
standard errors for complex sample survey data was based on weighted least squares
(WLS) estimation as proposed by Grizzle et al. (1969). The WLS estimation method was
originally programmed for logistic regression in the GENCAT software package (Landis
et al., 1976) and remains available as an option in programs such as SAS PROC
CATMOD. However, Binder (1981, 1983) presented the PML framework for fitting
logistic regression and other generalized linear models to complex sample survey data as
a technique for estimating model parameters. The PML approach to parameter estimation
was combined with a linearized estimator of the variance-covariance matrix for the
parameter estimates, which accounted for complex sample design features. Further
development and evaluation of the PML approach was presented in Roberts et al. (1987),
Morel (1989), and Skinner et al. (1989). The PML approach is now the standard method
for logistic regression modeling in all of the major software systems that support the
analysis of complex sample survey data (Heeringa et al., 2010).

The prominent feature of this approach is that it utilizes the sampling weights to
estimate the likelihood equations that would have been obtained in the case of a census.
However, for mixed models, the PML approach needs the sampling weights for sampled
49

elements (level 1) and clusters (level 2). Because the level 1 and level 2 weights appear in
separate places within the PML estimator function, it is not sufficient to know the product
of the level 1 and level 2 weights, as happens in conventional analyses. Also, level 1
weights have to be scaled to produce precise estimates of the variance components. For
this reason, some rescaling methods have been proposed. Pfeffermann et al. (1998) and
Korn and Graubard (2003), in the context of linear mixed models, point out that scaling
the weights at level 1 produces estimates of the variance components (particularly the
random-intercept variance) with little bias even in small samples.

The goal of this paper is to generalize the group testing methodology to surveys
conducted in two stages with stratification and different cluster sizes when the sampling
is informative. We solve this problem by using the PML approach, by incorporating
sampling weights at both levels to estimate the population likelihood equations that
would have been obtained in the case of a census. The article is organized as follows.
Section 3.2 presents the sampling design and generalized linear model. Section 3.3
describes the probability of selection under stratification at the cluster and individual
levels. Section 3.4 gives the level 1 scaling methods. Section 3.5 shows how to
incorporate the weights in the pseudo-maximum likelihood (PML). Section 3.6 presents
the details of the simulation study we performed. Section 3.7 presents the results and
discussions. Section 3.8 contains an application of the proposed method and section 3.9
gives the conclusions.

3.2 Sampling design and generalized linear model

Suppose that a whole population of M clusters [level 2 units, primary sampling units or
fields] with Ô) elementary units [level 1 units, subjects or plants] is to be sampled
following a two-stage sampling scheme. At the first stage m<M fields are selected with
inclusion probabilities .) (i=1,2,…,M) which are correlated with the cluster random
effect (N) ). At the second stage, () plants are selected within the ith field with
probabilities .B|) (j=1,2,…, Ô) ) that may be correlated with the outcomes after
conditioning on the regressors ë)B . Since the cluster random effect and the response
variable are viewed as random under the model, so are the selection probabilities under
50

informative sampling. The unconditional sample inclusion probabilities are then .)B =
.B|) .) .

Given that group testing will be used, plants must be assigned to pools in some way
where each pool is tested for a transgene. Suppose that the () plants from the ith field are
randomly assigned to one of the ) pools such that there are )B plants in pool j from field
i. Further, let A)BZ = 1 if the kth plant in the jth pool of field i is transgenic and A)BZ = 0
otherwise, for * = 1,2, … , , [ = 1,2, … , ) and \ = 1,2, … , )B . Since we are using group
testing and we will only observe the response of each pool, define the random variable
])B = 1, if the jth pool in the ith field tests positive for transgenes and ])B = 0 otherwise.
Therefore the two-level generalized linear mixed model for the response ])B can be
specified with the linear predictor of a generalized linear mixed model (Breslow and
Clayton, 1993; Rabe-Hesketh, 2006)

C)BZ = L + L ë)B + N) (1)

Here L is the intercept, ë)B is a ë1 covariate vector associated with fixed effects at the
individual level, L is the slope; N) is the random effect of the ith field or cluster, that is
Gaussian iid with mean zero and variance OP, . The conditional distribution of A)BZ is
`
û (EüDD*()BZ ) and assuming the logit link function DE(J`Icý ) gives
Icý

)BZ = )BZ (L , L , ON ) = exp(C)BZ )/(1 + expWC)BZ Y) (2)


Chen et al. (2009) assumed that conditional on the random effect [N) ] the probability
of a positive pool taking into account the sensitivity (3 ) and specificity (` ) of the
diagnostic test is given as

^W])B = 1|N) Y = 3 + (1 − 3 − ` ) ∏ d
b
Ic
(1 − )BZ ) (3)
3 is the probability of a positive test given that a plant is transgenic (i.e., the ability of a
test to correctly identify transgenic plants). ` is the probability of a negative test given
that the plant is not transgenic (i.e., the ability of the test to correctly identify non
transgenic plants). 3 and ` are assumed constant and near 1. Now let f = (L , L , ON )
51

denote the vector of all estimable parameters. The multilevel likelihood is calculated for
each level of nesting. First, the conditional likelihood for pool j in field i is given by
lIc
g)B (h iN) ) = j^W])B = 1|N) Yk [1 − ^W])B = 1|N) Y]JlIc (4)

where ^W])B = 1|N) Y is defined in Eq. 3. Next, to obtain the independent contribution of a
field to the likelihood, field-level random effects are integrated out

g) (h) = nJø ∏Bd g)B (h|N) )e(N) )pN) ,


ø I q

where e(N) ) is the Ô(0, OP, ), with the final likelihood being the product of field
likelihoods:
r

g = þ g) ( h)
)d
Finally, combining the expression for all the fields (clusters), the overall marginal
likelihood is
g = ∏r
)dnJø ∏Bd g)B (h|N) )e(N) )pN) 
ø
I q
(5)
Maximum Likelihood Estimation (MLE) has been used to find the estimate of f, but
first adaptative Gauss-Hermite quadrature (Pinheiro and Bates, 2000) is used to
approximate the integral in (5) and then the approximated log-likelihood is maximized
with respect to f using the Newton-Raphson procedure or related algorithms (Chen et al.,
2009). This approach works well when the vector of random effects is small. The MLE of
f can also be obtained using a Monte Carlo Expected Maximization (MCEM) algorithm
(McCulloch, 1997).

When survey data have been collected under a complex sample design,
straightforward application of MLE procedures is no longer possible, for several reasons.
First, the probabilities of selection of each cluster or individual are generally no longer
equal. Sampling weights are thus required to estimate the finite population values of
logistic regression model parameters. Second, the stratification and clustering of complex
sample observations violates the assumption of independence of observations that is
crucial to the standard MLE approach for estimating the sampling variances of the model
parameters and choosing a reference distribution for the likelihood ratio test statistic
(Heeringa et al., 2010). Also, when the sampling weights are related to the values of the
52

model’s outcome variable after conditioning on the model covariates, sampling is


informative and the observed outcomes are no longer representative of the population
outcomes. Thus the appropriate model for the sample data is different from the model for
the finite population (Pfeffermann and Sverchkov, 2009).

3.3 Probability of selection

As mentioned earlier, when the sampling design is informative, maximizing the


likelihood function given in equation 5 to obtain the MLEs of the parameters of interest
may be seriously biased. For this reason, it is of paramount importance to incorporate
design weights in the likelihood function. Considering two strata [i.e.,  = ( + , )]
at the cluster level and that  clusters from each strata are sampled with probabilities
that are proportional to their sizes Ô) (number of units in ith cluster at hth strata), then
the probability of selection of a cluster is

.) =  Ô) / ∑)d Ô) (6)





Also, assuming that in each cluster the individuals are classified in two strata [i.e.,
()∗ = (()∗ + (),∗ )], ℎ ∗= 1,2; and that a number of units ()∗ are subsequently
sampled from each cluster at each strata, which implies that the probabilities of selection
are

.B|)∗ = ()∗ /Ô)∗ (7)


Such designs are self-weighting in the sense that all units have the same unconditional
probability of selection,

.)∗B = .B|)∗ .) = ()∗  ÔB /Ô)∗  Ô)
)d

The “raw” design weights are obtained as the inverse of the probabilities of selection
( ) = 1/.) , B|)∗ =1/.B|)∗ and )∗B = 1/.)∗B ). However, these “raw” weights
need to be scaled to be used under a mixed model approach to avoid significant bias in
the parameter estimates (Pfeffermann, et al. 1998). For this reason, some scaling methods
have been proposed. In general, most scaling methods produce better estimates than using
unweighted analyses. However, for the purpose of this research, we only concentrate our
53

attention on the three methods of scaling, which are reported as providing the least biased
estimates in general. Due to the two-stage sampling process, we will have scaled weights
for the two levels.

3.4 Level 1 scaling methods

Pfeffermann et al. (1998) and Korn and Graubard (2003) showed that scaling of the
weights is very important to obtain estimators with little bias even in small samples.
However, they also state that it is not relevant for cluster weights, since multiplying the
loglikelihood by a constant does not change the PML estimates (it simply inflates the
information matrix by that constant). However, scaling level 1 weights on the small
sample behavior of the PML estimator is vital (Grilli and Pratesi, 2004). The most
popular types of scaling are method A (or type 2) and method B (or type 1) (Pfeffermann
et al. 1998; Grilli and Pratesi, 2004; Rabe-Hesketh, 2006) and method D (Rabe-Hesketh,
2006). These three types of scaling methods are used in the simulation study performed
in Section 7. At level 1 (elementary units) under method A (type 2), the scaled weight is
obtained as

B|)∗ = B|)∗ / B|)∗ (8)
∑c c|I∗
where B|)∗ = ÝI
and () is the number of sample units in cluster i. With this

scaling method, the new within weights add up to the cluster sample size ∑B ∗
B|)∗ = () .
The scaled weight for method B (type 1) for level 1 is given by

B|)∗ = B|)∗ /()

(9)
„
∑c
where ()∗ is the effective cluster sample size for cluster i, ()∗ =∑ c|I∗
. With this scaling
c c|I∗

method, the new within-cluster weights add up to the effective cluster sample size ()∗ .
Simulations in Pfeffermann et al. (1998) suggest that method B works better than method
A for informative weights. Such a scaling factor has also been used by Clogg and Eliason
(1987) in a different context. Instead of scaling the level 1 weights, Graubard and Korn
(1996) suggested a ‘method D’ which does not use any weights at level 1. This method D
scales cluster weights as:
54
ÝI

) = B|)∗ ) ,
Bd

and level 1 weights are ∗


B|)∗ = 1. This method seems appealing for pooled samples
because we are mixing the information of s individuals. This implies that the weight of
the pool is not required. Korn and Graubard (2003) pointed out that moment estimators of
the variance components using these weights are approximately unbiased under non-
informative sampling at level 1. The three methods proposed have an intuitive meaning,
but do not always produce good results (Pfeffermann, et al., 1998). Also, it is important
to recall that we have conditional weights ( B|)∗ ) at the individual level, however, since
we are pooling the material of s plants per pool, the weights for each pool can be
incorporated in three ways: using the average weight of the individuals forming a
particular pool, using the individual weights and using the sum of the s individual
weights to form the pool weight.

3.5 Incorporating the weights in the PML

The PML approach is required when the sampling mechanism is informative. However,
incorporating the weights in the likelihood is complicated by the fact that the population
log-likelihood is not a simple sum of elementary unit contributions, but rather a function
of sums across level 2 and level 1 units. In addition, the implementation of the PML
approach requires knowing the inclusion probabilities at both levels. Using only second
level weights or only first level weights may yield poor results (Grilli and Pratesi, 2004).
The weighted pseudo-likelihood taking into account stratification at both levels is equal
to
, r ,
ø qI
g = þ þ[ þþ Tg)B (h|N) )V e(N) )pN) ]
∗ ∗
c|I∗ I
Jø  d Zd
d )d ∗

where ∗
B|)∗ is the scaled weight for pool [ in stratum h, field i and substratum ℎ∗ , and

) is the field weight in stratum ℎ. Here the weights enter the log-pseudolikelihood as if
they were frequency weights, representing the number of times that each unit should be
replicated to estimate the likelihood that would have been obtained in a census. Also, it
is clear from the form of the likelihood that we cannot simply use one set of weights
55

based on the overall inclusion probabilities but must use separate weights at each level,
which implies that the self-weighting property of multistage designs is lost. The log-
pseudolikelihood is given as

ℓ(f) = ∑,d ∑)d ℓ, WA(,) ; fY


r ∗
) (11)

Where ℓ, WA(,) ; fY = log nJø exp´∑,∗ d ∑Bd B|)∗ logWTg)B (h|N) )Yµ e(N) )pN) .
ø I q ∗

Maximization of the weighted log-likelihood (11) involves computing several integrals


that do not have a closed-form solution, so a numerical approximation technique is
required. A standard solution to this problem is provided by using Gaussian quadrature
(e.g., Rabe-Hesketh et al., 2002; 2005; 2006). However, since this method is based on a
summation over an appropriate set of points it is only efficient when the dimensionality
of the integrals is low. The NLMIXED procedure of SAS (SAS Institute, 1999) is a
general procedure for fitting nonlinear random effects models using adaptive Gaussian
quadrature. For this reason, it will be implemented for maximizing the expression (11).
Another very important point is that inserting the weights in the log-likelihood implies
the use of a design consistent estimator of the population score function.

The NLMIXED procedure of SAS (SAS Institute, 2011) has various optimization
techniques to carry out maximization. The default, used in the simulations in section 4, is
a dual quasi-Newton algorithm, using the Cholesky factor of an approximate Hessian
(SAS, Institute 2011). Though the NLMIXED procedure does not include an option for
PML estimation, Grilli and Pratesi, (2004) show how to insert the level 1 and 2 weights
in the likelihood as explained in the application (section 8).

3.5.1 Sandwich estimator of the standard errors


The asymptotic covariance matrix of the maximum likelihood estimator (f) is given as
ÀEº(f) =  J  J (12)

Here  is the expected Fisher information and

ˆℓ(A; f) ˆℓ(A; f)
 ≡  —
ˆf ˆf  d Á
56

The expected Fisher information  is estimated by the observed Fisher information 


at the maximum likelihood estimates. This is why the sandwich does not collapse, since
the pseudo-likelihood is not exactly the distribution of the population responses. The
estimator  is obtained by exploiting the fact that the pseudo-likelihood is a sum of
independent cluster contributions so that
,
ˆℓ(A; f) r ˆℓ, WA(,) ; fY
=  ∗
ˆf )d
)
ˆf
d
,
r
≡   (f)
)d
d

We then estimate  by:


,  r
=   WfY  WfY


d  − 1 )d
,  r
≡    ′
d  − 1 )d
where  is the weighted score vector of the top level unit i in cluster h. The sandwich
estimator described in this section has been implemented in NLMIXED of SAS 9.3.

3.6 Simulation study

A Monte Carlo experiment was carried out to assess the performance of PML estimation
and the sandwich estimator under group testing. This experiment reflects the two-stage
scheme explained in section 2. First, the finite population values with dichotomous
responses were generated from the two-level superpopulation model with linear
predictor: C)B = L + N) , with i=1,2,…,M; N) ~Ô(0, OP, ), response variable

)B |N) ~N*(x A() ), j=1,2,…, Ô) ; and logit link: DE GJ`I K, we used L = −4.4631,
`
I

OP, = 0.9888 as our true model parameter values. Therefore we simulate the individual
responses, )B according with a Bernoulli distribution with mean )B = 1/(1 +
exp(−L − N) ). The number of clusters (level 2 units) that conform the finite population
are  = 300. These clusters where stratified into two strata by generating a normal
random variable, x) ~Ô(0,1), independent of )B , from which if |x) | > 1 cluster i was
assigned to stratum 1 otherwise to stratum 2. This stratification of clusters, resulted in 83
57

clusters belonging to stratum 1 and 217 to stratum 2. The size of each cluster (Ô) ) was
determined by Ô) = 350exp(N) ), with N) generated from Ô(0, OP, ), truncated below by
−0.1OP and above by 0.3OP . Therefore, the values of Ô) in our finite population have a
mean of 389.89 and a range between 317 and 472 individuals. We adopted an informative
sampling process at both levels. For this reason,  clusters were selected with a
probability proportional to a ‘measure of size’ ) , i.e., .) =  ) / ∑)d ) , where


the measured ) was determined in the same way as Ô) but with N) replaced by N) , the
random effect at level 2. Also, the individuals in each cluster were partitioned into two
individual level strata such that if exp(1.6 + 0.1 ∗ )B + )B ) > 5.73, the individual was
assigned to stratum 1 otherwise it was assigned to stratum 2, where
)B ~ xx(1,0.16). Simple random samples of 0.5() and 0. 5(), were selected
from the respective strata. The variable ) was used in place of the variable Ô) (in Eq.
6) and stratification at the individual level was performed in order to simulate a sampling
process that is informative at both levels. It is important to point out that if we want an
experiment that is informative only at level 2 (cluster level), stratification at level 1
(individual level) is not required. However, if we desire a process that is not informative,
we need to use Ô) = 350exp(N) ) in place of ) (Eq. 6) for calculating the probability
of selection of each cluster, and stratification at the individual level is not required.

Our simulation represents a situation where the design variables are strongly
associated with the responses ()B ), which is common, since sampling probabilities
depend on observed design variables. For example, in a field survey for detection and
estimation of genetically modified corn plants, sampling fields at stage 1 could be
stratified by irrigation (yes/no) or producer type (small or commercial), while sampling
plants at stage 2, strata could be based on plant or soil characteristics (e.g., moisture
levels, spatial heterogeneity, fertility levels, etc.) which would correlate with the plant
level residuals.
58

To gain a clear understanding of the role of weighting methods in the accuracy of the
results, six estimation methods were used for each simulated data set: (1) unweighted
maximum likelihood, (2) PML using raw weights at the cluster level, (3) PML using raw
weights at both levels, (4) PML using raw weights at the cluster level and scaling method
A at the individual level, (5) PML using raw weights at the cluster level and scaling
method B, and (6) PML using method D that only uses weights at the cluster level.

The two-stage sampling design for the finite population described above proceeded
as follows: 100 plants were selected from each cluster using simple random sampling (50
from stratum 1 and 50 from stratum 2), and we used three different cluster samples: 6 (2
from stratum 1 and 4 from stratum 2), 12 (4 from stratum 1 and 8 from stratum 2) and 24
(8 from stratum 1 and 16 from stratum 2). Also we compared three sample sizes at
individual levels (40, 80 and 120) per cluster with 24 clusters (8 from stratum 1 and 16
from stratum 2). In addition, we compared the estimates of using GLIMMIX and
NLMIXED of SAS 9.3 for 100 plants per cluster and 24 clusters. For each combination
of level 1 (plants) and level 2 (fields or cluster) samples, we simulated 600 data sets and
estimated parameters using the weighting methods proposed. We observed that the
sampling fraction at cluster level was 0.25 for stratum 1 and 0.75 for stratum 2.
Computations were mostly performed in NLMIXED of SAS 9.3.

3.7 Results

The results of the simulation with and without covariates are given in Tables 3.1 to 3.5.
Where the true values of the parameters are L = −4.4631 and OP, = 0.9888. We
reported the mean and standard deviation for the estimated parameters resulting from the
600 simulations.

For a sample of six clusters, we can see in Table 3.1 that ignoring the weights at both
levels (method 1) produces a considerable overestimation of the L parameter, and an
underestimation of the second level standard deviation (OP ). Method 3 using raw weights
at both levels, underestimated the fixed parameter, L, and significantly overestimated the
second level standard deviation (OP ). However, using only raw weights at the cluster
59

level and not weighting at the individual level (method 2) overestimated L and
underestimated OP , but to a lesser degree than ignoring the two weights (method 1).
Scaling the weights produces better results than method 1 (ignoring the weights) and
method 3 (raw weights at both levels). Method 2 and the three scaled methods 4, 5 and 6
produces the least biased results of all methods. The estimates of L were very close to
the true values even using a small sample of clusters. OP was still underestimated. In
general under the six methods studied, the group testing method with pool sizes of 10 and
5 produced results almost identical to individual testing.

In Table 3.2, with m= 12 the behavior of methods 1 and 3 is very similar to that
observed in Table 3.1, with only a slight improvement in the estimated second level
standard deviation. Now method 2 produced results for the fixed parameter, L, that are
very close to the true values; however, the estimation of OP is slightly underestimated
with pooled data. Again, method 3 produced the worse results with a large
underestimation of L and an overestimation of OP . Now method 4 and 5 produced a
slight underestimation of L and OP , but the estimation of the second level standard
deviation is better that when using 6 clusters (Table 3.1). Method 6 also produced a slight
overestimation of L and a slight underestimation of OP . Here it is clear that the scaled
methods produced the least biased results and that using the group testing method
produced results similar to those of individual testing.

In Table 3.3, we observe similar behavior as in Tables 3.1 and 3.2 for the six
weighting methods. However, due to the larger sample of clusters, the results were
generally less biased, which was expected because we increased the cluster sample by 24
compared to Table 3.2. This implied that the best results were produced with scaled
methods 4, 5 and 6 with 36 clusters (Table 3.3) and 12 clusters (Table 3.2). Method 4 is
the best, but not by much, compared to methods 5 and 6. Here also, using pooled samples
produced results nearly identical to individual testing. Using a sample of 48 clusters,
there is no clear improvement in the parameter estimates compared to using a sample of
36 clusters (Table 3.3). Data are not shown. Also, in this table we can observe that the
60

ratio of the standard deviation of the parameter estimates in the simulation to the average
standard errors converge to 1 as the sample size increase, this implied that the sandwich
estimator is correct. Based on these results, scaled weights decrease the bias in the
estimation of both parameters compared to no scaling; however, even when the weights
are scaled, the results are biased, but to a much lesser degree. Also, using the group
testing method produces results as precise as those of individual testing, but with the
advantage that we can considerably reduce the number of required diagnostic tests.

Tables 3.1-3.3 illustrate that the behavior of the parameters estimates


(L and OP ) improves with increasing the number of clusters which is expected. Now for
a fixed sample of clusters (Table 3.4,  = 24) we study the behavior of the parameter
estimates (L and OP ) for three different sample sizes at the individual level (() ). We can
see that both parameters are considerable biased even with individual testing when the
number of individuals per cluster was equal to () = 40. A considerable improvement
occurs when the number of individual per cluster is equal to () = 80. In this scenario the
estimate of L is close to the true value, but the estimate of the second level standard
deviation is still biased with group testing and the problem is even worse with a pool size
s=10. Finally using 120 individuals per cluster produces less bias results that using 40 or
80, but still the second level standard deviation is underestimated when using group
testing.

SAS procedures NLMIXED and GLIMMIX can be used to perform the analysis. In
Table 3.5, we compare the results of these two procedures for incorporating weights into
the pseudo-likelihood for weighting methods 3, 4 and 5. Weights at individual levels are
not required for methods 1, 2 and 6 and we did not compare this methods. In Table 3.5
we can see that the estimates of the fixed effects are almost identical in both procedures
with slightly better results in GLIMMIX, however, the estimate of the variance
component are a little better using NLMIXED, but in general terms both procedures are
doing a good job.
61

In Table 3.6, we compare the estimates of four types of sampling: informative at both
levels, informative at the field (cluster) level, informative at individual level, and non-
informative. The goal of this comparison is to make clear when the inclusion of weights
at both levels is required and when it is not.

We can see that when the sampling process is informative at both levels, method 1
produces a serious overestimation of the L parameter and an underestimation of the
second level standard deviation. However, including raw weights at both levels
underestimated L and overestimated OP . Including only the raw cluster weights at
cluster level improves the estimates, but they still have considerable bias. However,
including scaled weights at individual level and raw weights at cluster level produced
results that are less biased and very close to the true values, when the sampling process is
informative (at any level) while method 4 produced the best results but not very different
from those of method 5 (Table 3.6).

When the design is informative only at the cluster level, method 1 produced highly
biased results (serious overestimation of L and underestimation of OP ). Using the raw
weights at both levels is not recommended because the estimates are still highly biased
(Table 3.6). When only the raw cluster weights (method 2) are included, there is still
considerable bias, and using scaled weights at individual level and raw weights at cluster
level (methods 4 and 5) produced almost identical results and with small bias. This
implied that using scaled weights is preferred.

When the design is informative only at individual level using the raw weights is not a
good choice since the results still are highly biased. However, methods 1, 2, 4, 5 and 6
produce results with small bias (Table 3.6). This can be attributed to the way the
informative sampling process was induced at each level. Although on the subject of
regression of binary responses in a two-stage sampling, Grilli and Pratesi (2004) reported
that when the sampling process is informative at the individual level, it is important to
incorporate scaled weights. However, in their simulation they used a L = 0 and a second
level standard deviation of 0.632 with the probit link.
62

Finally, when the sampling process is not informative, from Table 3.6 we see that
method 3 (raw weights at both levels) again underestimates L and overestimates OP .
However, methods 1, 2, 4, 5 and 6 produced estimates of both parameters very close to
the true values L and OP , which corroborate that when the sampling process is not
informative the weights are not required. In general, method 4 produced the best results.

3.7.1 Simulation with a covariate at the individual level


Results so far has been based on a model without covariates. In this section we performed
other simulations to study the performance of the model with a covariate at the individual
level. The model is the same as model (1) and described in section 6, except that for the
inclusion of a covariate at the individual level, we generated data from a normal
distribution with mean zero and variance 0.64, and values of L = −4.7598, L =
0.8290 and OP = 0.9820 were used (Code for the analysis is given in Appendix 3.A).
We can see in Table 3.7 that the use of scaled weights (methods 4, 5 and 6) is effective
for removing the bias due to informative sampling. However, it is important to point out
that the results are somewhat more biased than in the no covariate case. In general, the
overall performance of the scaled weights is satisfactory.

3.8 Application

To illustrate the implementation of the analysis using NLMIXED of SAS 9.3, we present
data simulated for a finite population of eight clusters within two regions (strata) with
two substrata (which could be the levels of fertility (FL) in each field). Although on a
smaller scale, these population data represent a typical population in which we can use
PPS and stratified sampling for selecting the study units (Table 3.8). This population has
a total 163 plants distributed in 16 subgroups derived by combining region, field and FL
levels. The total number of plants per region, field and subgroups is also given in Table
3.8.

Suppose that a stratified sampling of two fields within each region and stratified
sampling within each field at a fixed sample size of six plants per stratum are selected
(.) = 2Ô) /Ô , .B|)∗ = 6/Ô)∗ ), where Ô = ∑)d Ô) . Table 3.9 shows the sample
 
63

that resulted from using this sampling procedure. Note that the total sample size is equal
to 48 plants which is obtained by multiplying 2 regions x 2 fields x 2FL x 6 plants. First
we will explain how the field raw weights are calculated. From Table 3.8 we can see that
the total number of plants in region 1 is Ô = 84, while the total number of plants in
clusters 2 and 3 are Ô, = 18 and Ô0 = 21, respectively. Therefore, the selection

probabilities are ., = = 0.4286, .0 = = 0.5 and the corresponding


,(è) ,(,)
è1 è1

= = 2.33 and = = 2.0. The remaining weights


 
sampling weights are , .1,èÞ 0 .Ü

for the other fields are calculated in a similar manner.

The conditional raw weights for each plant in the first row in Table 3.9
corresponding to region 1, field 2 and FL=1 are calculated as follows. Here Ô, is the
total number of plants per combination of region 1, field 2 and FL 1, which is equal to 10
(see Table 3.8). Therefore, .B|, = = 0.6, and = .Þ = 1.67. For the
Þ 
 B|,

conditional weight of the second row in Table 3.9, Ô,, = 8 (corresponding to field 2,
region 1 and FL 2). Therefore, .B|,, = è = 0.75, and = .éÜ = 1.33. To verify
Þ 
B|,,

that the calculations of the raw weights are correct, it is important to calculate the raw
unconditional weights of each individual which is the product of the raw cluster weight
multiplied by the conditional raw weights ( )B∗ = ) ∗ )|)∗ ), and the sum of the
individual total weights should be exactly (or very close) the total number of individuals
in the population. In this example, each calculated weight given in Table 3.9 is for six
individuals in the sample, since the six plants that appear in each row of Table 3.9 have
the same weight. This means that the sum of the unconditional raw weights
( )B∗ ) should be 163 (the total number of individuals in the population given in Table
3.8), the unconditional raw weights for each row in Table 3.9 should be multiplied by 6
(last column of Table 3.9), since our total sample size is 48 individuals. Therefore, by
obtaining the sum for the last column in Table 3.9, we verified that the sum is exactly
163. For more details calculating raw weights, the reader should consult Lohr (2010),
since the calculation of these weights is done using conventional methods.
64

Since raw conditional weights are not the best option, as was previously observed,
we will now show how to scale the weights. For scaling method A (x B|)∗ ), first we
obtain the average of the raw conditional weights ( B|)∗ ) in each cluster; then we divide
each conditional weight by this average. For example, for field 2, the average conditional
= 1.5 (this was calculated as weighted average since
Þ∗.ÞéœÞ∗.00
,
raw weight is equal to:

6 elements of each strata in each cluster have the same weights). Therefore, the scaled
weight using this method for the first conditional weight (row 1 in Table 3.9) is equal to
x = = 1.11. The corresponding conditional scaled A weight for the second
.Þé
B|, .Ü

row is equal to: x = = 0.89. The conditional scaled A weights for the other
.00
B|,, .Ü

observations in the sample are obtained in exactly the same way. One way to check that
these scaled conditional A weights are correct is that the sum of scaled conditional A
weights in each cluster must be the same as the obtained sample size in each cluster (in
this case, 12, for Field 2 this is equal to [6(1.11) + 6(0.89)] = 12), and the sum of all
the scaled weights must be the same as the total sample size (in this case, 48).

To obtain the conditional scaled B weights, we first must calculate the sum of all the
„
∑c
B/)∗ ) ()∗ =∑
c|I∗
conditional raw weights (∑B in each cluster, then obtain and,
c c|I∗

finally, the scaled B weights as ∗


B|)∗ = B|)∗ /() .

For cluster two, the ∑B B|, =6∗

1.67 + 6 ∗ 1.33 = 18 and ∑B = 6 ∗ 1.67, + 6 ∗ 1.33, = 27.35 then (,∗ = =


, ,é.0Ü
B|, è

1.52. Therefore, N B|, =


.Þé
.Ü,
= 1.10, while N B|212 = 1.33
1.52
= 0.88. Finally, the scaled D

= ∑Bd For field 2, this is equal to p =6∗


∗ I Ý
weights are obtained as ) B|)∗ ) . ,

1.67 ∗ 2.33 + 6 ∗ 1.33 ∗ 2.33 = 42. Scaled method D is calculated in the same way for
the other clusters.

Now we have the complete sample (48 plants) and its corresponding weights. Since
we will use group testing to classify the plants (positive or negative), suppose that we
form pools of size 3 at random in each cluster. For simplicity let’s assume that from each
row (containing 6 plants) in Table 3.9 we form two pools; the first three go to pool 1, the
65

second three to pool 2, and so on. Since we are forming the pools with the elements of the
subgroup that resulted from the combination of region, field and FL, we will get exactly
the same estimates if we use the average (of the three weights that form each pool) or
individual weights. However, since we can form pools of size s at random with the
elements of each cluster, this means that the weights in each pool are not always the
same. Therefore in Table 3.10 we present the data with the results in terms of pools and
with the average weights for each method. Here, of course, we are assuming that the
diagnostic test used to classify each pool is perfect (3 = ` = 1). In Table 3.10 we show
how to arrange the data resulting from any two-stage stratified cluster survey for analysis
using group testing.

For analysis, the data should be prepared as in Table 3.10. That is, we need a column
for region, a column for field (cluster), a column for FL, a column for the pool number
(from 1 to 16 in this case), a column for the binary response of each pool (yp), a column
for the level 1 raw conditional weight by pool (wijp), a column for the level 2 raw weight
(wip) and three more columns for the scaled weights in terms of pools (awp, bwp and
dwp). All different weights given in Table 3.10 are in terms of pools because the average
of the three weights is used. Note that the finite population contains 8 clusters (fields 1 to
8), but in this sample only 4 out of the 8 were selected: 2, 3, 5 and 6. We are interested in
estimating the marginal probability of a particular transgene being present in the whole
finite population and in the probability of this transgene being present in each cluster.

Table 3.11 gives the SAS NLMIXED code to perform the analysis with the
information in Table 3.10. Note we form a data set in SAS using the names of the input
variables as in the columns in Table 3.10 (Region, field, FL, pool, yp, wijp, wip, awp,
bwp and dwp).

The relevant output of this code is shown in two tables. The first table is called
" = −2.5057, denoted as b_0 estimate, and OP =
Parameter Estimates (L
0.6230 denoted as sd estimate). Therefore, the expected proportion of transgenic corn for
66

the average field, (.) |N) = 0) is: . = = Wœ3#`(,.ÜÜé)Y =0.0755. This


 
"Á YK
Gœ3#`WJ$

means that the estimated probability of finding transgenic plants in the whole population
is 7.55%. According with Breslow and Clayton (1993) we can approximate the marginal
estimate of the proportion of transgenic plants in the entire population as: . =

= Wœ3#`((œ.01Þ#.Þ,0„ =0.0869. This means that


 
Gœ3#`WJ(œ.01Þ "Á YK
„ ƒÁ.% $
†)
)ƒÁ.% ,.ÜÜé)Y

the estimated proportion of transgenic plants in the whole population is 8.69%. The
second table, called Blups per field, contains the predicted proportions for each field.
The predicted values are 0.112327 for cluster 2; 0.08450 for cluster 3; 0.05999 for cluster
5 and 0.05863 for cluster 6. Since these are Blups, they should be interpreted as the
predicted probability that a particular transgene is present in each field. We can see that
field 2 has the highest probability of having the presence of transgenes (0.112327) and
field 6, the lowest (0.05863). Results from the program in Table 3.11 shows that these
results were run using weighting method 5, since we put the scaled B weights (bwp) in
the augmented log-likelihood, and the raw cluster weights (wip) in the replicate
statement. However, if you wish to run the analysis with a different weighting method
given in section 7, just replace bwp with the appropriate weigth. For example with
method 3 (raw weigths at both levels) you need to keep wip and replace bwp by wijp.
One limitation of the code in Table 3.11 is that it does not take into account any
covariates (See Appendix 2.A for SAS code if you have covariates).

3.9 Conclusions

In this paper, we present a generalization of the mixed regression group testing


methodology for a complex survey in two stages with stratification and clusters of
different sizes, when the sampling process is informative. The estimation process was
performed using the average weights per pool for simplicity, which implied that the pools
should be randomly formed inside each cluster. Our results are in line with those reported
by Pfeffermann et al. (1998, 2006) (for normal response), by Grilli and Pratesi, (2004)
and Rabe-Hesketh and Skrondal (2006) [for binary outcomes in the non-group testing
context]. We found that when the sampling process is informative, weights at both levels
67

should be included. However, we need to use scaled weights because using the raw
weights produce more bias than ignoring the weights altogether. Also, it is important to
point out that if the sampling process is not informative, the weights at both levels should
be ignored and the analysis can be performed using any of the previously developed
packages for mixed group testing regression models. However, the NLMIXED or
GLIMMIX code given in this paper allows running the analysis with the six weighting
methods proposed. From a practical point of view, if you get very similar results by
ignoring the weights and using the three scaled weights (methods 4, 5 and 6), you should
choose method 1 (ignoring the weights) because this means that your sampling process is
not informative.

Also, it is important to stress that when covariates are not included in the linear
predictor, the results when using group testing (with pool sizes of 5 and 10) are almost
the same as when using individual testing. This means that in this application the group
testing regression method is as precise as individual regression. This result implies that
group testing can be a useful approach for conducting complex surveys with small pools
sizes (≤ 10) and forming the pools in each cluster. Also, including covariates at the
individual level produced results very similar to those obtained without pooling (Table
3.7). However, more simulations need to be performed to see how well this methodology
works with a larger pool size and more covariates.

Although we can include individual weights this is not relevant when pools are
formed with members of each cluster since the two approaches produce the same results.
But always we recommend to form pools with members from the same cluster since the
log-likelihood function needs the information per cluster. For this reason, the correct
analysis with group testing in a two-stage sampling informative process, requires the
pools, their corresponding outcomes (positive or negative) and the raw weights at both
levels (one cluster weight and the conditional weights of the individuals forming a pool).
These raw weights at individual level then need to be scaled to produce weighting
methods 4, 5 and 6, and finally the NLMIXED or GLIMMIX code given in Table 3.11
and Appendix 2.B can be used to perform the analysis. The resulting output using the
68

NLMIXED or GLIMMIX code given in Table 3.11 or Appendix 2.B produces an


estimate of L that can be used to estimate the marginal proportion of the characteristic
of interest (as shown in the application). The code also produces Blups (predicted
proportions) allowing researchers to obtain estimates not only for the whole population
but for each cluster as well. Finally, the methodology developed here can be used to
estimate any binary response using a complex informative sampling process. The overall
utility of using our estimation approach is that it can save considerable resources when
group testing is used in conjunction with complex sampling designs.

3.10 References

Bilder, C.R. (2009). Human or Cylon? Group testing on Battlestar Galactica. Chance,
22(3):46-50.

Binder, D.A. (1981). On the Variances of Asymptotically Normal Estimators from


Complex Surveys. Survey Methodology, 7:157–170.

Binder, D.A. (1983). On the Variances of Asymptotically Normal Estimators from


Complex Surveys. International Statistical Review, 51:279–292.

Breslow, N.E., Clayton, D.G. (1993). Approximate Inference in Generalized Linear


Mixed Models. Journal of the American Statistical Association, 88 (421): 9–25.

Cardoso, M., Koerner, K., Kubanek, B. (1998). Mini-pool screening by nucleic acid
testing for hepatitis B virus, hepatitis C virus, and HIV: preliminary results.
Transfusion, 38:905–907.

Chen, P., Tebbs, J., and Bilder, C. (2009). Group Testing Regression Models with Fixed
and Random Effects. Biometrics, 65:1270-1278.

Clogg, C.C., and Eliason, S.C. (1987). Some common problems in log-linear analysis.
Sociological Methods and Research, 16:8-44.

Delaigle, A., and Hall, P. (2012). Nonparametric regression with homogeneous group
testing data. Annals of Statistics, 40:131-158.

Delaigle, A., and Meister, A. (2011). Nonparametric Regression Analysis for Group
Testing Data. Journal of the American Statistical Association, 106:640-650.

Dorfman, R. (1943). The Detection of Defective Members of Large Populations. The


Annals of Mathematical Statistics, 14:436-440.
69

Farrington, C.P. (1992). Estimating prevalence by group testing using generalized linear
model. Statistics in Medicine, 11:1591-1597.

Graubard, B., Korn, E. (1996). Modeling the sampling design in the analysis of health
surveys. Statist. Meth. Med. Res., 5:263–281.

Grilli, L., Pratesi, M. (2004). Weighted estimation in multi-level ordinal and binary
models in the presence of informative sampling designs. Survey Methodology,
30:93–103.

Grizzle, J.E., Starmer, C.F., and Koch, G.G. (1969). Analysis of Categorical Data by
Linear Models. Biometrics, 25:489–504.

Heeringa, S., West, B.T, and Berglund, P.A. (2010). Applied survey data analysis. Boca
Raton: Taylor & Francis.

Kacena, K., Quinn, S., Howell, M., Madico, G., Quinn, T., Gaydos, C. (1998). Pooling
urine samples for ligase chain reaction screening for genital Chlamydia trachomatis
infection in asymptomatic women. Journal of Clinical Microbiology, 36:481–485.

Kasprzyk, D., Duncan, G.J., Kalton, G. and Singh, M.P. (eds.) (1989). Panel Surveys.
New York: John Wiley & Sons, Inc.

Korn, E., Graubard, B. (2003). Estimating the variance components by using survey data.
J. Roy. Statist. Soc. Ser. B65 (Part 1), 175–190.

Landis, J.R., Stanish, W.M., Freeman, J.L., and Koch, G.G. (1976). A Computer Program
for the Generalized Chi-Square Analysis of Categorical Data Using Weighted Least
Squares (GENCAT). Computer Programs in Biomedicine, 6:196–231.

Lohr, S. L. (2010). Sampling: design and analysis. Thomson Brooks/Cole.

McCulloch, C.E. (1997). Maximum likelihood algorithms for generalized linear mixed
models. Journal of the American Statistical Association, 92(437):162-170.

McMahan, C., Tebbs, J., and Bilder, C. (2013). Regression models for group testing data
with pool dilution effects. Biostatistics, 14(2):284-298.

Morel, G. (1989). Logistic Regression under Complex Survey Designs. Survey


Methodology, 15:203–223.

Muñoz-Zanzi, C. A., Johnson, W. O., Thurmond, M. C., & Hietala, S. K. (2000). Pooled-
sample testing as a herd-screening tool for detection of bovine viral diarrhea virus
persistently infected cattle. Journal of veterinary diagnostic investigation, 12(3):
195-203.
70

Pawitan, Y. (2001). In All Likelihood: Statistical Modelling and Inference Using


Likelihood. Oxford University Press, New York.

Pfeffermann, D. and Sverchkov, M. (2009). Inference under informative sampling. In:


Handbook of Statistics 29B; Sample Surveys: Inference and Analysis. D.
Pfeermann and C.R. Rao (eds.). North Holland. pp. 455-487.

Pfeffermann, D. (1993). The Role of Sampling Weights when Modeling Survey Data.
International Statistical Review, 61:317-337.

Pfeffermann, D., Skinner, C.J., Holmes, D.J., Goldstein, H., Rasbash, J. (1998).
Weighting for unequal selection probabilities in multi-level models. J. Roy. Statist.
Soc., Ser. B, 60:23–56.

Pfeffermann, D. and Sverchkov, M. (2007). Small area estimation under informative


probability sampling of areas and within the selected areas. Journal of the American
Statistical Association, 102, (480):1427-1439.

Pfeffermann, D., Da Silva Moura, F.A., and Do Nascimento Silva, P.L. (2006). Multi-
level modelling under informative sampling. Biometrika, 93(4):943-959.

Pinheiro, J.C. and Bates, D.M. (2000). Mixed-Effects Models in S and SPLUS. New
York, USA: Springer.

Piñeyro-Nelson, A., van Heerwaarden, J., Perales, H.R., Serratos-Hernández, J.A.,


Rangel, A. et al. (2009). Transgenes in Mexican maize: molecular evidence and
methodological considerations for GMO detection in landrace populations. Mol.
Ecol. 18:750-761.

Rabe-Hesketh, S. and Skrondal, A. (2006). Multilevel modelling of complex survey data.


Journal of the Royal Statistical Society, Series A, 169:805-827.

Roberts, G., Rao, J.N.K., and Kumar, S. (1987). Logistic Regression Analysis of Sample
Survey Data. Biometrika, 74:1–12.

SAS Institute. (2011). SAS 9. 3 Output Delivery System: User's Guide. SAS Institute.

Skinner, C.J., Holt, D., and Smith, T.M.F. (1989). Analysis of Complex Surveys. New
York: John Wiley & Sons, Inc.

Tebbs, J., Bilder, C. (2004). Confidence interval procedures for the probability of disease
transmission in multiple-vector-transfer designs. Journal of Agricultural,
Biological, and Environmental Statistics, 9(1):79-90.
71

Vansteelandt, S., Goetghebeur, E., and Verstraeten, T. (2000). Regression models for
disease prevalence with diagnostic tests on pools of serum samples. Biometrics,
56:1126–1133.

Verstraeten, T., Farah, B., Duchateau, L., Matu, R. (1998). Pooling sera to reduce the cost
of HIV surveillance: a feasibility study in a rural Kenyan district. Tropical
Medicine and International Health, 3:747-750.

Wolf, J. (1985). Born-again group testing-multi access communications. IEEE


Transactions on Information Theory, 31(2):185-191.

Xie, M. (2001). Regression analysis of group testing samples. Statistics in Medicine,


20:1957–1969.

Zhang, Z., Liu, A., Lyles, R.H., and Mukherjee, B. (2012). Logistic regression analysis of
biomarker data subject to pooling and dichotomization. Statist. Med., 31:2473–
2484.
72

Table 3.1. Simulation means and standard deviations using 6 clusters.


Weighting method

L Mean
s Parameter 1 2 3 4 5 6

OP Mean
-3.3659 -4.4265 -5.1901 -4.5078 -4.4936 -4.4328

L Std
0.8846 0.9926 1.699 0.9582 0.9433 0.9984

OP Std
1 0.5633 1.2444 2.1899 1.1884 1.1731 1.3031

L Std/SE
0.4698 0.807 1.5986 0.7563 0.7467 0.9163

OP Std/SE
1.1168 1.6880 1.8224 1.6691 1.6605 1.7112
1.2021 1.4907 1.7402 1.4736 1.4562 1.5048
Nsim
L Mean
600 600 600 600 600 600

OP Mean
-3.402 -4.425 -5.177 -4.509 -4.4921 -4.4224

L Std
0.8135 0.9191 1.6532 0.8823 0.8608 0.9047

OP Std
5 0.5481 1.2187 2.1382 1.165 1.1383 1.2424

L Std/SE
0.4608 0.8063 1.5639 0.7627 0.7291 0.8485

OP Std/SE
1.1028 1.6955 1.7895 1.6741 1.6686 1.7319
1.2685 1.4268 1.7374 1.3828 1.3609 1.4755
Nsim
L Mean
600 600 599 600 600 600

OP Mean
-3.4497 -4.4248 -5.1949 -4.5151 -4.5008 -4.4228

L Std
0.7364 0.8143 1.6337 0.7928 0.7731 0.799

OP Std
10 0.5303 1.18 2.1393 1.1333 1.1184 1.2117

L Std/SE
0.4875 0.7923 1.58 0.7512 0.7437 0.8547

OP Std/SE
1.1168 1.7119 1.8122 1.6731 1.6661 1.7431
1.2021 1.2971 1.7512 1.2369 1.2077 1.3333
Nsim 600 600 600 600 600 600

(L = −4.4631 true value) and the second level standard deviation (OP = 0.9944 true
Simulation means and standard deviations (std) of point estimators of the intercept

value). Cluster sample m= 6 (2 from stratum 1 and 4 from stratum 2) under PPS.
Elementary units size () = 100 (50 from stratum 1 and 50 from stratum 2) under SRS.
Pool size (s). Nsim denotes number of simulations performed. Std/SE denotes ratio of
standard deviation of parameter estimates in the simulation to the average standard errors.
Method 1 unweighted maximum likelihood. Method 2 PML using raw weights at the
cluster level, Method 3 PML using raw weights at both levels, Method 4 PML using raw
weights at the cluster level and scaling method A at the individual level, Method 5 PML
using raw weights at the cluster level and scaling method B, and Method 6 PML using
method D with weights at the cluster level.
73

Table 3.2. Simulation means and standard deviations using 12 clusters.


Weighting method

L Mean
s Parameter 1 2 3 4 5 6

OP Mean
-3.3396 -4.4148 -5.0757 -4.5075 -4.4932 -4.4008

L Std
0.9929 1.0302 1.6524 1.0054 0.9916 1.0152

OP Std
1 0.3525 0.8888 1.5492 0.8582 0.8457 0.8872

L Std/SE
0.2851 0.5714 1.1885 0.5441 0.5351 0.5723

OP Std/SE
0.9838 1.4901 1.5058 1.4585 1.4528 1.5201
1.0666 1.3963 1.4932 1.3605 1.3516 1.4223
Nsim
L Mean
600 600 599 600 600 600

OP Mean
-3.3742 -4.4134 -5.0789 -4.5078 -4.4933 -4.4016

L Std
0.9114 0.9601 1.6253 0.932 0.9175 0.9443

OP Std
5 0.3411 0.8829 1.5378 0.8517 0.8391 0.8963

L Std/SE
0.2871 0.5961 1.1837 0.5711 0.5614 0.6312

OP Std/SE
0.9861 1.4910 1.4177 1.4593 1.4538 1.5279
1.0618 1.3665 1.4711 1.3342 1.3260 1.4059
Nsim
L Mean
600 600 599 600 600 600

OP Mean
-3.418 -4.4114 -5.0783 -4.5095 -4.4946 -4.3994

L Std
0.829 0.8666 1.5934 0.8449 0.8276 0.8433

OP Std
10 0.3324 0.8724 1.5154 0.8412 0.8284 0.8867

L Std/SE
0.318 0.6256 1.1597 0.5945 0.5853 0.6668

OP Std/SE
0.9949 1.4812 1.4136 1.4482 1.4438 1.5155
1.0726 1.2917 1.4705 1.2463 1.2341 1.3407
Nsim 600 600 599 600 600 600

(L = −4.4631 true value) and the second level standard deviation (OP = 0.9944 true
Simulation means and standard deviations (std) of point estimators of the intercept

value). Cluster sample m= 12 (4 from stratum 1 and 8 from stratum 2) under PPS.
Elementary units size () = 100 (50 from stratum 1 and 50 from stratum 2) under SRS.
Pool size (s). Nsim denotes number of simulations performed. Std/SE denotes ratio of
standard deviation of parameter estimates in the simulation to the average standard errors.
Method 1 unweighted maximum likelihood. Method 2 PML using raw weights at the
cluster level, Method 3 PML using raw weights at both levels, Method 4 PML using raw
weights at the cluster level and scaling method A at the individual level, Method 5 PML
using raw weights at the cluster level and scaling method B, and Method 6 PML using
method D with weights at the cluster level.
74

Table 3.3. Simulation means and standard deviations using 36 clusters.


Weighting method

L Mean
s Parameter 1 2 3 4 5 6

OP Mean
-3.3529 -4.362 -4.8967 -4.46 -4.4484 -4.3549

L Std
1.0187 1.0032 1.5317 0.983 0.9712 0.9929

OP Std
1 0.1709 0.4276 0.6816 0.4143 0.4093 0.4347

L Std/SE
0.1349 0.2602 0.5198 0.2451 0.24 0.2635

OP Std/SE
0.8134 1.1188 1.1322 1.1152 1.1153 1.1377
0.9048 1.0673 1.1126 1.0597 1.0568 1.0821
Nsim
L Mean
600 600 600 600 600 600

OP Mean
-3.3823 -4.3608 -4.8995 -4.4605 -4.4485 -4.3542

L Std
0.9398 0.9545 1.5111 0.9312 0.9179 0.9452

OP Std
5 0.1656 0.4244 0.6779 0.4111 0.4061 0.4316

L Std/SE
0.1408 0.2712 0.5207 0.2579 0.2531 0.2741

OP Std/SE
0.8170 1.1201 1.1327 1.1171 1.1169 1.1394
0.9604 1.0831 1.1169 1.0832 1.0802 1.0951
Nsim
L Mean
600 600 600 600 600 600

OP Mean
-3.428 -4.3632 -4.9066 -4.4665 -4.4538 -4.3563

L Std
0.8637 0.8854 1.4927 0.8657 0.848 0.872

OP Std
10 0.1611 0.4192 0.6727 0.4059 0.4013 0.4269

L Std/SE
0.156 0.2947 0.5199 0.278 0.2775 0.3066

OP Std/SE
0.8296 1.1194 1.1317 1.1148 1.1160 1.1417
0.9836 1.1042 1.1176 1.0893 1.0955 1.1445
Nsim 600 600 600 600 600 600

(L = −4.4631 true value) and the second level standard deviation (OP = 0.9944 true
Simulation means and standard deviations (std) of point estimators of the intercept

value). Cluster sample m= 36 (12 from stratum 1 and 24 from stratum 2) under PPS.
Elementary units size () = 100 (50 from stratum 1 and 50 from stratum 2) under SRS.
Pool size (s). Nsim denotes number of simulations performed. Std/SE denotes ratio of
standard deviation of parameter estimates in the simulation to the average standard errors.
Method 1 unweighted maximum likelihood. Method 2 PML using raw weights at the
cluster level, Method 3 PML using raw weights at both levels, Method 4 PML using raw
weights at the cluster level and scaling method A at the individual level, Method 5 PML
using raw weights at the cluster level and scaling method B, and Method 6 PML using
method D with weights at the cluster level.
75

Table 3.4. Comparison using three different elementary units size (' ) selected under
SRS.
() = 40 () = 80 () = 120
Weighting methods Weighting methods Weighting methods

L Mean
s Parameter 4 5 6 4 5 6 4 5 6

OP Mean
-4.450 -4.429 -4.367 -4.483 -4.468 -4.388 -4.4727 -4.4614 -4.3645

L Std
0.969 0.948 0.998 1.002 0.988 1.020 1.0093 0.9981 1.0159

OP Std
1 0.609 0.599 0.641 0.536 0.528 0.561 0.5301 0.5232 0.5413
0.390 0.385 0.417 0.333 0.326 0.353 0.3213 0.3144 0.339
Nsim
L Mean
600 600 600 600 600 600 600 600 600

OP Mean
-4.426 -4.403 -4.342 -4.477 -4.462 -4.381 -4.477 -4.466 -4.367

L Std
0.841 0.809 0.880 0.933 0.916 0.951 0.9577 0.945 0.966

OP Std
5 0.602 0.592 0.637 0.530 0.522 0.557 0.5225 0.516 0.535
0.452 0.454 0.477 0.351 0.345 0.380 0.3307 0.324 0.352
Nsim
L Mean
600 600 600 600 600 600 600 600 600

OP Mean
-4.413 -4.389 -4.322 -4.477 -4.461 -4.376 -4.483 -4.471 -4.371

L Std
0.690 0.648 0.734 0.847 0.825 0.849 0.888 0.873 0.892

OP Std
10 0.591 0.579 0.627 0.526 0.517 0.554 0.517 0.511 0.529
0.531 0.527 0.548 0.394 0.390 0.442 0.359 0.353 0.385
Nsim
600 600 600 600 600 600 600 600 600

(L = −4.4631 true value) and the second level standard deviation (OP = 0.9944 true
Simulation means and standard deviations (std) of point estimators of the intercept

value). Cluster sample m= 24 (8 from stratum 1 and 16 from stratum 2) under PPS. Pool
size (s). Nsim denotes number of simulations performed.Method 4 PML using raw
weights at the cluster level and scaling method A at the individual level, Method 5 PML
using raw weights at the cluster level and scaling method B, and Method 6 PML using
method D with weights at the cluster level.
76

Table 3.5. Comparison of NLMIXED vs GLIMMMIX for weighting methods 3, 4 and 5.


Weighting methods
s Parameter 3 3 4 4 5 5

L Mean
NLMIXED GLIMMIX NLMIXED GLIMMIX NLMIXED GLIMMIX

OP Mean
-4.9005 -4.9678 -4.4477 -4.486 -4.4352 -4.4738

L Std
1.5425 1.5458 0.9843 0.9597 0.9707 0.9475

OP Std
1 0.8762 0.9205 0.5102 0.5208 0.5041 0.5139
0.6712 0.7042 0.3088 0.3094 0.3066 0.3027
Nsim
L Mean
596 600 600 600 600 600

OP Mean
-4.9036 -4.9703 -4.4464 -4.4855 -4.3141 -4.4727

L Std
1.5187 1.5239 0.9203 0.903 0.9329 0.8885

OP Std
5 0.8442 0.9147 0.5051 0.5157 0.5213 0.5088
0.6535 0.7065 0.3343 0.326 0.3452 0.3211
Nsim
L Mean
599 600 600 600 600 600

OP Mean
-4.9188 -4.9751 -4.4493 -4.4872 -4.3162 -4.4739

L Std
1.5089 1.508 0.8443 0.8276 0.8485 0.8095

OP Std
10 0.8692 0.9066 0.4993 0.5104 0.5131 0.5034
0.6755 0.7024 0.3679 0.3621 0.3784 0.3588
Nsim 600 600 600 600 600 600

(L = −4.4631 true value) and the second level standard deviation (OP = 0.9944 true
Simulation means and standard deviations (std) of point estimators of the intercept

value). Cluster sample  = 24 (8 from stratum 1 and 16 from stratum 2) under PPS.
Elementary units size () = 100 (50 from stratum 1 and 50 from stratum 2) under SRS.
Pool size (s). Nsim denotes number of simulations performed. Method 3 PML using raw
weights at both levels, Method 4 PML using raw weights at the cluster level and scaling
method A at the individual level and Method 5 PML using raw weights at the cluster
level and scaling method B.
77

Table 3.6. Comparison of informative sampling at both levels, at the cluster level and
non-informative.
Weighting methods

L Mean
s Parameter 1 2 3 4 5 6

OP Mean
-3.3646 -4.3682 -4.9311 -4.4645 -4.4519 -4.3626

L Std
0.9456 0.9621 1.5347 0.9367 0.9225 0.9541

OP Std
5 0.2267 0.5387 0.8731 0.5214 0.515 0.551
Informative at
L Mean
0.1846 0.3516 0.6673 0.3343 0.3295 0.3567
both levels
OP Mean
-3.41 -4.367 -4.936 -4.4665 -4.4533 -4.3605

L Std
0.8658 0.8809 1.5137 0.8561 0.838 0.8633

OP Std
10 0.2192 0.5345 0.8682 0.5172 0.5108 0.5475

L Mean
0.2072 0.3847 0.6658 0.3691 0.3667 0.4063

OP Mean
-3.3582 -4.367 -4.9401 -4.4615 -4.4489 -4.3563

L Std
0.9378 0.964 1.5528 0.9383 0.9238 0.9477

OP Std
Informative at 5 0.2158 0.5411 0.9204 0.5229 0.5163 0.5483
the cluster
L Mean
0.1704 0.3462 0.7023 0.3302 0.3258 0.359
level
OP Mean
-3.4028 -4.3672 -4.946 -4.4652 -4.4522 -4.3562

L Std
0.858 0.8857 1.5344 0.8625 0.8449 0.8611

OP Std
10 0.2093 0.5357 0.9133 0.5173 0.5105 0.5435

L Mean
0.1852 0.3822 0.6996 0.3649 0.3617 0.4077

OP Mean
-4.3527 -4.3653 -4.9111 -4.4617 -4.4488 -4.3578

L Std
0.8974 0.9084 1.5165 0.8717 0.8529 0.8939

OP Std
Informative at 5 0.3071 0.3177 0.4891 0.3108 0.3075 0.3142
Individual
L Mean
0.3014 0.3045 0.3797 0.329 0.335 0.3189
level
OP Mean
-4.3579 -4.3698 -4.9178 -4.4703 -4.4568 -4.3628

L Std
0.8357 0.8428 1.4994 0.8106 0.7871 0.8289

OP Std
10 0.3013 0.3118 0.4855 0.3047 0.3013 0.3089

L Mean
0.3173 0.3245 0.3842 0.3411 0.35 0.3366

OP Mean
-4.4326 -4.4432 -5.0532 -4.527 -4.441 -4.4394

L Std
5 0.9032 0.9124 1.7172 1.0108 0.8722 0.9046

OP Std
Non 0.3124 0.3207 0.5256 0.3541 0.3337 0.3199
informative
L Mean
0.3404 0.3544 0.4039 0.3546 0.3971 0.3548

OP Mean
-4.4336 -4.4443 -5.0511 -4.5315 -4.4458 -4.4405

L Std
10 0.909 0.9194 1.7652 1.0299 0.8759 0.9092

OP Std
0.3177 0.326 0.5352 0.3623 0.3402 0.3252

Simulation means and standard deviations (std) of point estimators of the intercept (L = −4.4631 true
0.3696 0.3845 0.4289 0.3856 0.4282 0.3872

value) and the second level standard deviation (OP = 0.9944 true value). Cluster sample m= 24 (8 from
stratum 1 and 16 from stratum 2) under PPS. Elementary units size () = 100 (50 from stratum 1 and 50
from stratum 2) under SRS. Pool size (s). 600 simulations were performed for each scenario. Method 1
unweighted maximum likelihood. Method 2 PML using raw weights at the cluster level, Method 3 PML using raw
weights at both levels, Method 4 PML using raw weights at the cluster level and scaling method A at the individual
level, Method 5 PML using raw weights at the cluster level and scaling method B, and Method 6 PML using method D
with weights at the cluster level.
78

at the individual level (L = −4.7598, L = 0.8290 and OP = 0.9820 true values).
Table 3.7. Simulation means and standard deviations (std) of the model with a covariate

Weighting method
s Parameter 1 2 3 4 5 6
L Mean -3.6895 -4.6399 -5.2553 -4.7382 -4.7253 -4.6362
L Mean 0.8444 0.8223 0.8391 0.8225 0.8222 0.8266
OP Std 0.9494 0.9731 1.568 0.9395 0.9267 0.9722
1 L Std 0.2301 0.5562 0.9252 0.538 0.5298 0.556
L Std 0.1297 0.199 0.2064 0.2004 0.1993 0.2002
OP Std 0.1806 0.3389 0.6829 0.3263 0.3191 0.3513
Nsim 600 600 600 600 600 600
L Mean -3.7669 -4.678 -5.9855 -4.684 -4.6393 -4.6657
L Mean 0.9157 0.8185 0.8073 0.8278 0.8276 0.8228
OP Std 0.8888 0.9227 0.7719 0.9127 0.8822 0.8817
5 L Std 0.249 0.5762 0.5789 0.5751 0.5765 0.5789
L Std 0.2589 0.4308 0.4075 0.436 0.4352 0.4335
OP Std 0.1853 0.3514 0.4974 0.3519 0.3947 0.4264
Nsim 600 600 600 600 600 600
L Mean -3.8508 -4.7389 -6.0572 -4.7451 -4.7061 -4.7233
L Mean 0.9309 0.7341 0.7198 0.7359 0.737 0.7252
OP Std 0.8084 0.8445 0.7068 0.8028 0.7964 0.7901
10 L Std 0.3006 0.6063 0.5992 0.6092 0.6069 0.6128
L Std 0.4554 0.734 0.7109 0.7493 0.75 0.7447
OP Std 0.2064 0.3904 0.4948 0.4257 0.425 0.4688

Cluster sample m= 24 (8 from stratum 1 and 16 from stratum 2) under PPS. Elementary
Nsim 600 600 600 600 600 600

units size () = 100 (50 from stratum 1 and 50 from stratum 2) under SRS. Pool size (s).
Nsim denotes number of simulations performed. Method 1 unweighted maximum
likelihood. Method 2 PML using raw weights at the cluster level, Method 3 PML using
raw weights at both levels, Method 4 PML using raw weights at the cluster level and
scaling method A at the individual level, Method 5 PML using raw weights at the cluster
level and scaling method B, and Method 6 PML using method D with weights at the
cluster level.
79

Table 3.8. Population data with two regions, eight fields (clusters) and two strata per field
(levels of fertility).
Region Field FL Binary response (y) Ô)∗ Ô) Ô
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 13
1 1 2 0 0 0 0 0 0 0 0 0 . . . . 9 22
1 2 1 0 0 0 1 0 0 0 0 1 0 . . . 10
1 2 2 0 0 0 0 0 0 0 0 . . . . . 8 18
1 3 1 0 0 0 0 0 0 0 0 . . . . . 8
1 3 2 0 0 0 0 0 1 0 0 0 0 0 0 0 13 21
1 4 1 0 0 0 0 0 0 0 0 0 0 0 0 . 12
1 4 2 0 0 0 0 0 0 0 0 0 0 0 . . 11 23 84
2 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 13
2 5 2 0 0 0 0 0 0 . . . . . . . 6 19
2 6 1 0 0 0 0 0 0 0 0 0 0 0 . . 11
2 6 2 0 0 0 0 0 0 0 0 0 . . . . 9 20
2 7 1 0 0 0 0 0 0 0 . . . . . . 7
2 7 2 0 0 0 0 0 0 0 0 0 0 0 0 . 12 19
2 8 1 0 0 0 0 0 0 0 0 . . . . . 8
2 8 2 0 0 0 0 0 0 0 0 0 0 0 0 0 13 21 79

The binary response of each plant is y, Ô)∗ denotes total plants per combination of
Total plants 163 163 163

region, field and fertility level (FL), Ô) denotes total plants per field in each
stratum, and Ô denotes total plants per region.

Table 3.9. Sample resulting by getting two fields within each region under PPS and doing
stratified sampling within each field at a fixed sample size of six plants per stratum under
SRS.
) B|)∗ )B∗ x B|)∗ N B|)∗ p ) 6
∗ B|)∗
Region Field FL Response per
plant (y)
1 2 1 2.33 1.67 3.89 1.11 1.10 42 0 0 1 0 0 1 23.33
1 2 2 2.33 1.33 3.11 0.89 0.88 42 0 0 0 0 0 0 18.67
1 3 1 2.00 1.33 2.67 0.76 0.72 42 0 0 0 0 0 0 16.00
1 3 2 2.00 2.17 4.33 1.24 1.17 42 0 0 0 1 0 0 26.00
2 5 1 2.08 2.17 4.50 1.37 1.20 39.5 0 0 0 0 0 0 27.03
2 5 2 2.08 1.00 2.08 0.63 0.56 39.5 0 0 0 0 0 0 12.47
2 6 1 1.98 1.83 3.62 1.10 1.09 39.5 0 0 0 0 0 0 21.72
2 6 2 1.98 1.50 2.96 0.90 0.89 39.5 0 0 0 0 0 0 17.78
80

Table 3.10. Sample prepared in terms of pools for analysis.


Region Field FL pool yp wijp wip Awp bwp dwp
1 2 1 1 1 1.67 2.33 1.11 1.10 42.00
1 2 1 2 1 1.67 2.33 1.11 1.10 42.00
1 2 2 3 0 1.33 2.33 0.89 0.88 42.00
1 2 2 4 0 1.33 2.33 0.89 0.88 42.00
1 3 1 5 0 1.33 2.00 0.76 0.72 42.00
1 3 1 6 0 1.33 2.00 0.76 0.72 42.00
1 3 2 7 0 2.17 2.00 1.24 1.17 42.00
1 3 2 8 1 2.17 2.00 1.24 1.17 42.00
2 5 1 9 0 2.17 2.08 1.37 1.20 39.50
2 5 1 10 0 2.17 2.08 1.37 1.20 39.50
2 5 2 11 0 1.00 2.08 0.63 0.56 39.50
2 5 2 12 0 1.00 2.08 0.63 0.56 39.50
2 6 1 13 0 1.83 1.98 1.10 1.09 39.50
2 6 1 14 0 1.83 1.98 1.10 1.09 39.50
2 6 2 15 0 1.50 1.98 0.90 0.89 39.50
2 6 2 16 0 1.50 1.98 0.90 0.89 39.50
This sample resulted from getting two fields within each region using PPS and stratified
sampling within each field at a fixed sample size of 6 plants per stratum under SRS.PPS
means probability proportional to size; SSR means simple random sampling.
81

Table 3.11. NLMIXED statements for two-stage group testing regression under PPS.
proc nlmixed data= surveypool model ypool ~ general(loglink);
qpoints=10 random u1 ~ normal([0],[sd*sd])
cfactor=10000 empirical; subject=cluster;
parms b_0=-3.0 sd=1; s=3; Se=1; replicate wip;/*level two
Sp=1; weights*/
bounds sd >=0; estimate 'bo' b_0;
prd=1; do i = 1 to s; estimate 'sd' sd;
eta_0=b_0+u1;pi_0=1/(1+exp(- ods output
eta_0)); ParameterEstimates=betasnn10
prd=prd *(1 - pi_0); end; ConvergenceStatus=CS10;
ppool= Se – (1-Sp)*prd; predict pi_0 out=pi_0;
*Conditional log likelihood; run;
if (yp=1) then zz=ppool; proc means data=pi_0;
else if (yp=0) then zz=1-ppool; by cluster; var Pred;
if (zz>1e-8) then ll=log(zz); output out =Bpre(drop = _TYPE_
else ll=-1e100; _FREQ_);
*Augmented loglikelihood; run;proc print
loglink=bwp*ll;/*level one data=Bpre(where=(_STAT_='MEAN'))
weights*/ ; run;
82

Appendix 3.A. NLMIXED code for regression group testing for a complex survey in
two stages using average weights and one covariate at the individual level. Here we
use the conditionals weights under method A.
proc nlmixed data=poollisto qpoints=10 cfactor=10000 empirical;
parms b_0=-4.7 b_1=0.8 sd=1; k=10;
bounds sd >=0;
array XI_ind[*] XI_ind1-XI_ind10;
prd=1;
do i = 1 to k;
eta_0=b_0 + b_1*XI_ind[i] + u1*sd;
pi_0=1/(1+exp(-eta_0));
prd = prd *(1 - pi_0) ; end;
ppool= 1 - prd;
if (ypool=1) then zz=ppool;
else if (ypool=0) then zz=1-ppool;
if (zz>1e-8) then ll=log(zz); else ll=-1e100;
*Aumented logelikelihood;
loglink=awpool*ll; /*inclusion of level 1 weights */
model ypool ~ general(loglink);
random u1 ~ normal([0],[1]) subject=cluster; /*Cluster is the subjet 2
level units*/
replicate ww2pool;/*inclusion of level 2 weights */
estimate 'bo' b_0;
estimate 'b1' b_1;
estimate 'sd' sd;
ods output ParameterEstimates=betasnn10 ConvergenceStatus=CS10;
run;
83

Appendix 3.B. GLIMMIX code code for regression group testing for a complex
survey in two stages using average weights without covariate. Here we use the
conditionals weights under method B.
data surveypoolN;
set surveypool;
do sub=1 to floor(wip); /*Inclusion of level 2 weigths*/
clusternew=cluster*10000+sub;
output; end; run;

proc sort data= surveypoolN out= surveypoolN;


by replicate clusternew; run;
proc glimmix data= surveypoolN method=quad(qpoints=10);
class clusternew;
model ypool(event='1')= /solution dist=binary;
random intercept / subject=clusternew;
weight bwp; /*Inclusion of level 1 weigths*/
prd=1; s=3; Se=1; Sp=1;
do i = 1 to s;
p1= exp(_linp_)/(1 + exp(_linp_));
prd = prd*(1 - p1);
ppool=Se-(1-Sp)*prd;
end;
_MU_ =1 - ppool;
output out=BlupsField pred(blup ilink)=predFieldp
lcl(blup ilink)=predFieldLp ucl(blup ilink)=predFieldUp;
run;
proc sql stimer;
create table psublups as select distinct cluster, predFieldp,
predFieldLp,predFieldUp, wip from BlupsField;
quit;

data blupfield; set psublups; s=3;


predField=1-(1-predFieldp)**(1/k);
predFieldL=1-(1-predFieldLp)**(1/k);
predFieldU=1-(1-predFieldUp)**(1/k);
drop predFieldp predFieldLp predFieldUp;
run;
proc print data=blupfield; run;
84

Chapter 4: Three-stage optimal sampling plans for


group testing data given a pool size

Abstract
In surveys, sample size planning is important in achieving precise estimates at a low cost.
However, this issue is not adequately addressed for group testing data obtained from a
three-stage sampling process. In this study, we obtained the optimal allocation of
localities (l), fields (m) and pools per field (g) in a three-stage group testing survey for a
given pool size (s). These optimal values were obtained under the assumption of equal
locality and field sizes. To handle the unequal sample size case, we derived the relative
efficiency (RE) of unequal versus equal locality and field sizes to estimate the proportion.
By multiplying the sample of localities and fields obtained assuming equal cluster size by
the inverse of the corresponding REs, we adjusted the sample size required in the context
of unequal localities and field sizes. We also show the adjustments needed for correctly
allocating localities and fields in order to estimate the required budget and achieve a
certain power or precision.

Key words: three-stage, group testing, optimal sample size, relative efficiency, power,
precision.

4.1 Introduction

In this chapter we extend the optimal group testing sampling plans for two stages
obtained in chapter 2 to three stage sampling process. Surveys conducted to estimate the
proportion of transgenic maize in México have a three stage structure, where: (1)
researchers take a sample of localities (primary sampling units=PSU) of the frame of
localities; (2) in each locality they take a sample of fields from the frame of fields
(secondary sampling unit=SSU); (3) in each field they take a sample of plants
(elementary sampling units); (4) with the sample of plants per field, they form pools of
size (s); and (5) a diagnostic test is performed for each pool (Piñeyro-Nelson et al., 2009).
85

This multi-stage sampling is common in large national surveys because it is the most
cost-efficient sampling design where the population of interest consists of
subpopulations, also called clusters, that are used for selection. In practice, multistage
samples are preferred, because the interviewing or testing costs are greatly reduced if
these individuals are geographically or organizationally grouped. Such sample designs
reflect the organization of the natural and social worlds. Also, multistage surveys do not
require a list of all elementary units, since the sample is selected in stages, often taking
into account the hierarchical (nested) structure of the population. However, multistage
sampling design leads to dependent observations, and failing to deal with this properly in
the statistical analysis may lead to erroneous inferences.

In this article, section 4.2 gives the random logistic model used for individual testing.
Section 4.3 describes the random logistic model used for group testing. Section 4.4
provides an approximate marginal variance of the proportion. In section 4.5, we derive
sample sizes without constraints. Section 4.6 gives the optimal samples (l, m, g) given a
pool size (s) under constraints and assuming equal cluster size, while section 4.7 provides
tables for sample size determination assuming equal cluster sizes. Section 4.8 gives
adjustments for unequal locality and field sizes. Section 4.9 gives an example for
estimating the proportion of transgenic plants, and finally the discussion and conclusions
are presented in section 4.10.

4.2. Random logistic model for individual testing

In the context of individual testing, the standard random logistic model is obtained by
conditioning on all fixed and random effects. The responses A)BZ are independent and
Bernoulli distributed with probabilities .)B , assuming that these probabilities are not
related to any covariable (Moerbeek et al., 2001a). Thus the linear predictor using a logit
link is equal to
H
C)B = DE*FW.)B Y = ln JHIc ’ = L + x) + N)B (1)
Ic

where C)B is the linear predictor that is formed from a fixed part (L ) and two random
parts (x) and N)B ), where x) ~Ô(0, OÄ ) and N)B ~Ô(0, OP ). Also we assume that the
86

random components are mutually independent. Therefore, equation (1) can be written in
terms of the probability of a positive individual as:

.)B = .)B (L , OÄ , OP ) = [1 + exp(−WL + x) + N)B Y)]J (2)

The mixed logit model for binary responses can be written as the probability .)B plus
a level 1 residual denoted )BZ :
A)BZ = .)B + )BZ (3)
where )BZ has zero mean and variance -WA)BZ Xx) , N)B Y = .)B W1 − .)B Y (Goldstein, 1991,
2003; Goldstein and Rabash, 1996; Breslow and Clayton, 1993; Rodriguez and Goldman,
1995; Candy, 2000; Moerbeek et al., 2001a; Candel et al., 2010; Skrondal and Rebe-
Hesketh, 2007). This model is widely used for estimating optimal sample sizes since the
variance components are assumed to be known (Goldstein, 1991, 2003; Rodriguez and
Goldman, 1995; Candy, 2000; Moerbeek et al., 2001a).

4.3 Random logistic model for group testing

Suppose that, within field j, each plant in the ith locality is randomly assigned to one of
the )B pools. Let A)BZÅ = 1 if the kth plant in the rth pool in the jth field in locality i is
positive and A)BZÅ = 0 otherwise, for * = 1,2, … , D, [ = 1,2, … , ) , = 1,2, … , )B and
\ = 1,2, … , )BÅ as the pool size. Note that A)BZÅ is not observed, except when the pool
size is 1. Let ])BÅ =  ‚Ic+ be the indicator variable, whether the Fℎ pool inside
G∑ý}~ *Icý+ ,K

the field [ in locality i is positive (])BÅ =1) or negative (])BÅ =0). Thus we only observe the
random binary variable ])BÅ that takes the value of ])BÅ = 1 if the rth pool in field j and
locality i tests positive, and ])BÅ = 0 otherwise. Conditional on the random effects
[x) and N)B ], pools within field j and locality i are independent. Therefore, the probability
that the rth pool in field j and locality i is positive is given as
= ^W])BÅ = 1|x) , N)B Y = 3 + (1 − 3 − ` ) ∏Zd
b
.)B
` Ic+
(1 − .)BZ )

where 3 and ` denote the sensitivity and specificity of the diagnostic test, respectively.
3 and ` are assumed constant and close to 1 (Chen et al., 2009). For simplicity, in
87

planning the required sample size we will assume an equal pool size, s, in all fields. This
is a reasonable assumption for sample size determination, and .)B is reduced to:
`

.)B = ^W])BÅ = 1|x) , N)B Y = 3 + e(1 − .)B )b


`
(4)

Where e = (1 − 3 − ` ). The mixed group testing logit model for binary responses can
be written as the probability .)B
`
plus a level 1 residual denoted )BÅ :
`

])BÅ = .)B + )BÅ


` `
(5)

Where .)B
`
as given in Equation(4) and )B
`
has zero mean and variance -W])BÅ Xx) , N)B Y =

.)B W1 − .)B Y. Now let f = (L , OÄ , OP ) denote the vector of all estimable parameters. The
` `

multilevel likelihood is calculated for each level of nesting. First, the full conditional
likelihood ignoring the constant for pool r in field j and locality i, is given by:
` lIc+
g)BÅ (h ix) , N)B ) = j.)B k [1 − .)B ]JlIc+
`
(6)

By multiplying the conditional likelihood (Equation 6) by the density of x) and N)B and
integrating out the random effects, we get the marginal (unconditional) likelihood.

g(him) = ∏)dTn[∏Bd n(∏Åd g)BÅ (hXx) , N)B )oWN)B YpN)B )o(x) )px) ]V,
I r
Ic q

where o(x) ) is the density function of x) and oWN)B Y is the density function of N)B .
Unfortunately, this unconditional likelihood is intractable. There are various ways of
approximating the marginal likelihood function. Two of them are: (1) to use integral
approximations such as Gaussian quadrature; and (2) to linearize the non-linear part using
Taylor series expansion (TSE) (Moerbeek et al., 2001a; Breslow and Clayton, 1993). The
marginal form of the generalized linear mixed model (GLMM) is of interest here, since it
expresses the variance as a function of the marginal mean.

4.4 Approximate marginal variance of the proportion

The marginal model can be fitted by integrating the random effects out of the log-
likelihood and maximizing the resulting marginal log-likelihood or, alternatively, by
using an approximate method based on TSE (Breslow and Clayton, 1993). Next, .)B
`
is
approximated using a first-order TSE around x) = 0 and N)B =0, as
88

ˆ.*[ ˆ.*[
.)B ≈ .)B X
` `
+ — (x) − 0) + — WN)B − 0Y
ÄI dPIc d x* N*[
ÄI dPIc d ÄI dPIcd

+ e(1 − .*[ ) (x) ) + e(1 − .*[ )


−1 −1
.)B ≈ .)B X .)B W1 − .)B Yi .)B W1 −
` `
ÄI dPIc d ÄI dPIc d

.)B Yi WN)B Y
ÄI dPIcd

.)B
`
≈ .2 ` + e(1 − .2)bJ .2(1 − .2)x) + e(1 − .2)bJ .2(1 − .2)N)B (7)

where .2 ` = .)B X =  + e(1 − [1 + exp(−L )]J )b and .2 = .)B XÄ dP =


`
ÄI dPIc d I Ic d

[1 + exp(−L )]
, since x) and N)B are independen and identically distributed (iid) and
J

we use the fact that


u u u u
tHIc tHIc tHIc tHIc tHIc tHIc tHIc tH tH
= = , = tPIc = twIc = .)B (1 − .)B ) and
ÄI HIc tÄI PIc HIc tPIc tÄI
,
Ic Ic

u
tHIc
HIc
= e(1 − .)B )bJ

Now, by substituting Equation (7) in Equation (5) we can approximate Equation (5)
by

])BÅ ≈ .2 ` + e(1 − .2)bJ .2(1 − .2)x) + e(1 − .2)bJ .2(1 − .2))N)B + )BÅ
`
(8)

Therefore, the approximate marginal variance based on a first order TSE of the
responses of a pool is equal to:

-x W])BÅ Y ≈ Te(1 − .2)bJ V, T.2(1 − .2)V, [OÄ, + OP, ] + .2 ` (1 − .2 ` )

Where the variance of )BÅ was approximated by .2 ` (1 − .2 ` ). Note that ]̅ =


`

{
∑-I}~ ∑|
c}~ ∑+ lIc+
is a moment estimator of (.)B ) and its variance is equal to:
`
rq

-x (]̅) ≈
 u (JH
 u)
+
 )V„ †„
+
)
Tb€(JH  (JH
‚ƒ~ V„ TH  )V„ .„  )‚ƒ~ V„ TH
Tb€(JH  (JH H
r rq
(9)

Recall that we will select a sample of l localities and m fields, assuming that the same
number of pools per field will be obtained, i.e.,  = ̅ . Since the probability of success is
not a constant over trials but varies systematically from field to field, the parameter .)B is
a random variable with a probability distribution. Therefore, it is reasonable to work with
89

the expected value of .)B across localities and fields to determine sample size. To
approximate (.)B ) we take advantage of the relationship between ]̅ and W.)B Y:
`

]̅ = G.)B K = W3 + e(1 − .)B )b Y = W3 ) + (e(1 − .)B )b Y = 3 +e(‡)


`
(10)

Where ‡ = (1 − .)B )b . Using a first order TSE around x) = 0 and N)B =0, we can
approximate ‡ as
ˆ‡ ˆ‡
‡ ≈ ‡|x* =N*[ =0 + v (x* − 0) + — WN*[ − 0Y
x) x =N N)B
* *[ =0 x* =N*[ =0

‰ + (1 − .2)bJ .2(1 − .2)x) + (1 − .2)bJ .2(1 − .2)N)B


‡≈‡ (11)
‰ = ‡|Ä dP d = (1 − [1 + exp(−L )]J )b = (1 − .2)b and we use the fact that
where ‡ I Ic

tŠ tHIc tŠ tHIc tHIc tHIc tHIc


=H =H , = = = .)B (1 − .)B ) and H =
tŠ tŠ tŠ
ÄI tÄI PIc tPIc tÄI tPIc twIc
,
Ic Ic Ic

(1 − .)B )bJ

Then

‰
(‡) ≈ ‡

‰ , and so
But doing TSE of first order we can obtain that W1 − (.)B )Y ≈ (1 − .2)b =‡
b

b
(‡) ≈ W1 − (.)B )Y

That is, we approximate (‡) = [(1 − .)B )b ] by [1 − W.)B Y]b . This implied that
W.)B Y ≈ 3 +e(1 − (.)B ))b , and since ]̅ is an estimator for W.)B Y, then an estimator
` `

for (.)B ) can be obtained from

3 +e(1 − (.)B ))b ≈ ]̅

Therefore, an estimator for (.)B ) is


~
Πu ~
Œ
Ž JGH‘1 K ‚ Ž Jl“ ‚
(.‹/ ) ≈ 1 − 0 2 =1−G K
€ €
90

Œ
The variance of this estimator, (.‹/ ), can be approximated from the variance of ]
̅
~

(.)B ) of the function (”) = 1 − G K


` Ž J• ‚
€
(Equation 9) with a first order TSE around .

After some algebra we get:


,
ˆ(” )
Œ
-W(.‹/ )Y ≈ – — ˜ -x (]̅)
ˆ” •dGHu K
Ic

~
tq(•̂ )  Ž J• ‚ J 
= G € K € = b€(JH)‚ƒ~. However, since (.)B ) doesn’t have a close
 `
t• b
Where

exact form, we replace this by .2 ` and we obtain

Œ
-W(.‹/ )Y = -(.
/) ≈ +
„∗
+
„∗
. † š(›)
r rq
(12)

„
ƒ„ u
(Ž3JH
 u)‚ H
OÄ,∗ = T.2(1 − .2)V, OÄ, , OP,∗ = T.2(1 − .2)V, OP, , -(ž) =
 (JH
 u)
b„ (€)„/‚
Where and

.2 ` = 3 + e(1 − .2)b .

4.5 Sample sizes without constraints

4.5.1. How to obtain a certain width of the confidence interval


Assume the researcher is interested in choosing the number of localities, given the
number of fields (m), pools per fields (g) and pool size (s), to obtain a specified width
(«) of the confidence interval (CI) of the proportion. Assuming that the distribution of ./
is approximately normal with a mean .2 and a fixed variance -(./), then the (1 −
¬)100% Wald confidence interval of .2 is given by ./ ∓ ]J¯/, ª-(./), where ]J¯/, is
the quantile 1 − ¬/2 of the standard normal distribution. Therefore, the observed width
of the CI is equal to ° = 2]J¯/, ª-(./). The quantity 2]J¯/, ª-(./) (added and
subtracted from the observed proportion, ./) in the CI is defined as W/2 (where W is the
full width of the CI; W or W/2 can be set a priori by the researcher depending on the
desired precision). Therefore, if the researcher wants a CI width of « for the full width,
we can obtain the required number of localities, l, by making
91

2]J¯/, ©
 (JH
 )V„ .„  (JH
 )V„ †„
+ + = ω and solving for D. The required, number of
TH TH š(›)
r rq

localities, l, is equal to:


„
1l~ƒ²/„  (JH
 )V„ †„
D= [T.2(1 − .2)V, OÄ, + ]
TH š(›)
³„ r rq
+ (13)

Recall that equation (13) is useful for obtaining the required number of localities
given a number of fields, pools per field and a pool size. However, this sample size (Eq.
13) is not optimal.

4.5.2 How to obtain a certain power


Assume we are interested in testing ¶ : .2 = .2 vs ¶ : .2 > .2 . For example, the
European Union (Anonymous, 2003) requires that the proportion of genetically modified
(GM) seed impurities in a seed lot be lower than 0.005. Here, given a number of fields
(m), pools per field (g) and pool size (s), we want to reach a power of (1 − ¸) with
significance level ¬, when δ = |.2 − .2 |. To perform a test with a type I error rate α and
a type II error rate γ, the following must hold:
]J¯ = (./ − .2 )/ª-(.
»)
 and ]J¼ = (.
/ − .2 )/ª-(.
»)


Here -(.
»)
 is the variance of .
/ but using the value of the null hypothesis. Both ]J¯ and
]J¼ have a standard normal distribution since the variance components are assumed
known. According to Cochran (1977) and Moerbeek et al. (2000), this results in the
relation:

-2 = (l
|›|„
~ƒ² œl~ƒ½ )
„ , (14)

If we change the alternative hypothesis to ¶ : .2 < .2 , equation (14) still is valid, but
not if the alternative is ¶ : .2 ≠ .2 , in which case ]J¯ needs to be replaced by ]J¯/, in
Equation (14).

Then, given a certain number of fields, pools per field and pool size, which is the
required number of localities, l, needed to achieve a power level (1 − ¸) for a desired ž?
TH
 Á (JH
 Á )V„ .„  Á (JH
 Á )V„ †„
To obtain the required l, we need to solve for l from ´ + +
TH
r
92

µ = (l . Therefore, by solving for l, the required number of localities (D) is


š(›Á ) |›|„
rq ~ƒ² œl~ƒ½ )
„

equal to:

´T.2 (1 − .2 )V, OÄ, +


(l~ƒ² œl~ƒ½ )„  Á (JH
 Á )V„ †„
D= + µ
TH š(›Á )
|›|„ r rq
(15)

Equation (15) gives the required number of localities given the number of fields (m),
number of pools per field (g) and pool size (s), but this is not optimal.

4.6 Optimal sample sizes under constraints for a given pool size

4.6.1 Minimizing variance subject to a budget constraint


Assume that we have a fixed sampling budget for estimating the average population
proportion .2. The question of interest is: what is the optimal allocation of localities (l),
fields (m) and pools per field (g), given a pool size (s), for estimating the proportion with
minimum variance, subject to the following budget constraint?

C=D + D, + D0 + D1 (4 > 0, D, , ,  ≥ 2, ü = 1,2,3,4) (16)

where C is the total sampling budget available,  is the cost of sampling and measuring a
plant in an already sampled field, , is the cost of testing a pool of size (s), 0 is the cost
of sampling and measuring a field, and 1 is the cost of sampling and measuring a
locality. The values of 0 and 1 are average values, since at the time of planning the
survey, it is not known which localities and fields per locality will be sampled and the
travel times are not the same for all localities and fields per locality. The budget C and
costs  , , , 0 and 1 are given in dollars but can be changed to any other currency.
Optimal allocation of the units can be performed using Lagrange multipliers. By
combining equations (12) and (16), we obtain the Lagrangean

g(D, , g, £) = g = -(./)+£[C−(D + D, + D0 + D1 )] (17)

where -(./) given by equation (12) is the objective function that will be minimized with
respect to l, m and g, given a pool size (s) subject to the constraint given in Equation (16),
and £ is the Lagrange multiplier. The partial derivatives of Eq. (17) with respect to £, D, 
and g are
93

=C−(D + D, + D0 + D1 ) =0; then D =


t¤ ¦
t¥ rqb§~ œrq§„ œr§5 œ§6

− £D( + , ) = 0; then £ = −
t¤ š(›) š(›)
tq q„ r q„ r„ „ (b§~ œ§„ )
=−

− £[D + D, + D0 ] = 0


 (JH
 )V„ †„

t¤ TH š(›)
tr r„ r„ q
=−

− £[ + , + 0 + 1 ] = 0


 (JH
 )V„ .„  (JH
 )V„ †„
− −
t¤ TH TH š(›)
t „ r„ „ rq
=−

By solving these equations, we obtain the optimal values for l, m and g (see
Appendix 4. A):

D = rqb§ , where  = ©§6 ,  = ©(b§


¦ § † §5 ªš(›)
~ œrq§„ œr§5 œ§6  (JH
~ œ§„ ) TH  )V„ †
(18)
5 .

First, we calculate the number of pools per field, , and then we can calculate the
number of fields per locality, m. Using these values, we calculate the number of localities
to be sampled, D. Both g, m and D should be rounded up to the nearest integer to maintain
the hierarchical structure. They are assumed equal to 2 if g, m and l are less than 2. Note
that equation (18) is a generalization of the optimal sample size for continuous data in
three-level sampling given by Cochran (1977).

4.6.2 Minimizing the budget to obtain a certain width of the confidence interval
So far, the allocation of units minimizing -(./) has been derived under the condition that
the budget sampling and measuring is fixed to a certain value. However, many times the
researcher wants the optimal allocation of units to minimize the sampling and measuring
budget in order to obtain a specified width («) of the confidence interval (CI) of the
proportion. The solution to this optimization problem is the same as minimizing the total
budget subject to a variance constraint. The variance constraint is obtained from the
(1 − ¬)100% Wald confidence interval of .2 (./ ∓ ]J² ª-(./)) (given in section 4.5.1);
„

since the width of the CI is equal to ° = 2]J¯/, ª-(./), and since we specified the
required width of the CI to be «, this implies that -(./) = «,/4]J¯/,
,
. Therefore, the
optimization problem is to minimize the sampling budget as given in Equation (16) under
the condition that -(./) = «, /4]J¯/,
,
is fixed. That is, we want to minimize C =
94

D + D, + D0 + D1 subject to -(./) = - . Again, using Lagrange multipliers,
the corresponding Lagrangean is: g(D, , g, £) = g = D + D, + D0 + D1 +
£[-(./) − - ]. Now the partial derivatives of g with respect to £, D,  and g are

 (JH
 )V„ .„  (JH
 )V„ †„
+ − - = 0; then
t¤ TH TH š(›)
t¥ r rq
= +

 (JH
 )V„ †„
D = [T.2(1 − .2)V, OÄ, + ]/-
TH š(›)
r rq
+

=D( + , ) − £ = 0 ; then £ =
t¤ š(›) „ q„ r„ (b§ œ§ )
~ „
tq rq„ š(›)

=D( + , ) + D0 − [T.2(1 − .2)V, OP, + ]=0


t¤ ¥ š(›)
tr r„ q

=( + , ) + 0 + 1 − £ ´
 (JH
 )V„ .„  (JH
 )V„ †„
+ + µ=0
t¤ TH TH š(›)
t „ r„ „ rq

By solving these equations, we have that the optimal values are (see Appendix 4. B):

 (JH
TH  )V„ †„ š(›) § §5
D = [T.2(1 − .2)V, OÄ, + ]/- , where  = ©§6 ,  = ©(b§
† ªš(›)
r rq  (JH
~ œ§„ ) H ) †
+ (19)
5 .

Note that the number of fields per locality, m, and the number of pools per field, g,
required when we minimize the budget subject to -(./) = - (Equation 19) are the same
as when minimizing -(./) subject to a budget constraint (Equation 18). However, the
expression for obtaining the required number of localities, l, is different. In this case, the
value of - = «, /4]J¯/,
,
is substituted into Equation (19) and the expression for the
„
1l~ƒ²/„  (JH
 )V„ †„
required number of localities is D = [T.2(1 − .2)V, OÄ, + ]. Another
TH š(›)
³„ r rq
+

way of obtaining the same solution to this problem is given in Appendix 4. C.

4.6.3 Minimizing the budget to obtain a certain power


Assume a threshold is defined a priori, and our main interest is to test ¶ : .2 = .2 vs
¶ : .2 > .2 . We want to determine a sampling plant (i.e., l, m and g given a pool size) for
minimizing the budget required for this test to have a specified power (1 − ¸) and
significance level ¬, when δ = |.2 − .2 |. Again, -(.
»)
 is a fixed quantity and equal to

equation (14), since we want to minimize the total budget to obtain a specified power
95

(1 − ¸). Therefore, we want to minimize: À = D + D, + D, + D1 subject to


-(./) = -, . The optimal allocation of fields and pools per field is also given in equation
„
u ƒ„ u u
WŽ3JH
Á Y‚  Á (JH
Á )
(19) but using equation (14) to get -. Here, -(ž ) =
H
b„ (Ž3œŽ`J)„/‚
is used in place

´T.2 (1 − .2 )V, OÄ, +


(l~ƒ² œl~ƒ½ )„
of -(ž), and .2 in place of .2. This implies that D = |›|„

 Á (JH
 Á )V„ †„
+ µ, where  = ©§6 ,  = ©(b§
TH š(›Á ) § † §5 ªš(›Á )
r rq  Á (JH
~ œ§„ ) H  Á) †
.
5 .

4.7 Tables for determining the optimal sample size assuming equal locality and field
sizes

This section contains tables (Tables 4.1 and 4.2) that help to calculate the optimal sample
size assuming equal locality and field sizes given a pool size (s). These two tables can be
used to minimize the variance given a budget constraint or to minimize the budget given
a variance constraint. For example, assume that we want optimal values of l, m and g to
minimize the -(./) given a pool size ( = 20) and a budget constraint equal to C=20000.
Also assume that after a literature review, we estimate  = 10, , = 35, 0 = 400,
1 = 1200, OÄ, = 0.25, OP, = 0.15, 3 = ` = 0.95, and .2 = 0.01. This implies that

, / = 3.5, 0 / = 40, §6 = 120. Then, looking at Table 4.1 in the intersection
§
~

between the value of .2 = 0.01 (first column) and the value of OP, = 0.15 (second
column) for 3 = ` = 0.95 and  = 20, we get the required optimal values of fields per
locality (m=2, column 3) and pools per field (g=9, column 4). Finally, using the optimal
values of m and g and the cost, we can calculate the optimal number of localities as:

À 20000
D= = = 3.21 ≈ 4
 + , + 0 + 1 (2)(9)(20)(10) + (2)(9)35 + (2)(400) + 1200

Table 4.1 also can be used to calculate the optimal allocations of l, m, g given a pool

size (s) to minimize the budget, C, subject to a variance constraint equal to - =


«, /4]J¯/,
,
(to get a CI width of « =0.015 with 95%); then ]J¯/, = 1.96, which
(.Ü)„
implies that - = 1(.çބ ) =0.00001464. Assuming the same values of , , , 0 , 1 , OÄ, ,
96

OP, , 3 , ` , .2 and  = 20. Then the optimal values of fields per locality and pools per field
are 2 and 9, respectively (using Table 4.1 exactly as above). However, now the optimal
 (JH
 )V„ †„
number of localities is calculated as: D = [T.2(1 − .2)V, OÄ, + + ]/- =
TH .éÞ0
r rq

T.(J.)V„ (.Ü)
[T0.01(1 − 0.01)V, (0.25) + + (,)(ç)]/0.00001464 = 6.23 ≈ 7
š(›)
,

If we wish to calculate the optimal values for a given power, we can also use Table
4.1 and the last formula to calculate the required number of localities but using the -
calculated with Equation (14). Table 4.2 should be used exactly as Table 4.1, assuming a
§
given pool size; the only difference is that now there are five options for the ratio §5 ,
~

and only one value of OP, = 0.20.


§6
§~

4.8 Adjusting for unequal locality and field sizes

In practice, unequal cluster sizes (localities and fields) are the rule. Cluster size variation
increases bias and produces a considerable loss of power and precision in the parameter
estimates. For this reason, we will calculate the relative efficiency of unequal versus
equal cluster sizes (localities and fields) for adjusting the optimal sample size derived
under the assumption of equal localities and field sizes. The definition of the relative
efficiency of equal versus unequal cluster sizes is:
 XÆÇÈÉÊË Y
šÄÅWH
Ã(./) = šÄÅWHXÆ
ÉÌÇÈÉÊË Y
(20)

where -x W./XÍÎÏÐÑÒ Y denotes the variance of the proportion estimator given a design
with equal cluster sizes, and -x W./XÍÐÓÎÏÐÑÒ Y denotes a similar value for an unequal
cluster size design, but with the same number of localities (l), fields (m) and the same
total number of pools (Ô = ∑)d ∑Bd )B ) as with the equal cluster size design. Given
I r

that it is possible to have variability in locality and field sizes, a Ã(./) will be calculated
for localities (Ã(./) ) and another for fields (Ã(./)7 ) to incorporate both variabilities.
To derive the Ã(./) , we assume that only localities have variations in size, whereas to
97

derive the Ã(./)7 , we assume that only fields have size variation. If we wish to calculate
the total relative efficiency, it should be equal to Ã(./) = Ã(./) ëÃ(./)7 .

Assuming only localities have different sizes, and using Equation (20), Ã(./) is
equal to:
8 „∗ „∗
9(:)
( .„∗ œ “† œ “““{ )/ 8;
( .„∗ œ “““ )
Ã(./) = = = ∑)d[ ]
|““ | | rœ¯-  rI
„∗ „∗
r rI œ¯-
(21)
„∗ 8† 9(:) „ ∑-I}~[ „∗ 8;
∑-I}~[ . œ | œ | { ]/ . œ | ]/
I I I

where OÄ,∗ = T.2(1 − .2)V, OÄ,, OP,∗ = T.2(1 − .2)V, OP, , O§,∗ = OP,∗ + and ¬ = O§,∗ /
š(›)

OÄ,∗ .
Now, assuming unequal field sizes in locality i, the relative efficiency of locality i
(Ã) ) is:
9(:) 9(:)
( †„∗ œ )/rI ( †„∗ œ )
∑r
q“œ¯<  qc
Ã) = |I
{
= |I
{
= q“ rI Bd[ q
I
]
„∗ œ9(:)]/r„ c œ¯<
9(:)
∑c}~[ † I ∑c}~ [ †„∗ œ ]/rI
{c {c

where ¬7 = -(ž)/OP,∗ .
Therefore, the average relative efficiency of the l localities, assuming unequal field
sizes, is equal to

∑r
q“œ¯<  qc
Ã(./)7 = q“ r Bd[ q ]
c œ¯<
(22)

Note that Equations (21) and (22) can be considered moment estimators of
q“œ¯< Ö<
[ ] and í î, respectively, assuming that locality and field sizes
rœ¯- Ö-
r Ö- œ¯- q“ q< œ¯<

() , * = 1,2, , … , D; B , [ = 1,2, , … , ) are realizations of two random variables ô (with


mean Õ and standard deviation O ) and ô7 (with mean Õ7 and standard deviation O7 ),
respectively. Equations (21) and (22) are therefore equal to the Equations derived for
obtaining the RE of equal versus unequal cluster sizes in cluster randomized and
multicenter trials given by Van Breukelen et al. (2007) for recovering the loss of power
when estimating treatment effects using a linear model. Here we use RE to repair the loss
of power or precision when estimating the proportion using a random logistic model for
group testing. Defining £Z = (ÕZ /(ÕZ + ¬Z )) and the coefficient of variation of the
random variable ôZ by À-Z = OZ /ÕZ , with \ = D, o. Since Ã(./) and Ã(./)7 were
98

expressed as that derived by Van Breukelen et al. (2007, pp. 2601-2602; see Appendix 4.
D), a second-order Taylor series approximation of Equations (21) and (22) can be
obtained. For localities, this is equal to
Ã(./) × ≈ T1 − À- , £ (1 − £ )V (23)
And for fields, it is equal to:
Ã(./)7× ≈ T1 − À-7, £7 W1 − £7 YV (24)
This is possible because the expectation part of the moment estimators of Equations
(21) and (22) is equal to (Ö ) ≈ £Z T1 − À-Z, £Z (1 − £Z )V with \ = D, o. Ã(./) × and
Öý
ý œ¯ý

Ã(./)7× do not depend on the number of localities and fields, respectively, but rather on
the distribution of cluster sizes (mean and variance of localities and fields) and intraclass
correlations. Note that À-7 = 0 means that the fields are equal and Ã(./)7× = 1, this
means that only the sample of localities should be adjusted. Also, to correct for the loss of
efficiency due to the assumption of equal locality sizes, one simply divides the number of
localities (D) by the expected RE resulting from Equation (23). The adjustment for
unequal field sizes is the same, but using Equation (24). For practical purposes, we will
denote Ã(./) × = Ã × and Ã(./)7× = Ã7× .

4.9 A numerical example for estimating the presence of transgenic maize

In 2004, a study was conducted to detect the presence of genetically modified maize
plants in farmers’ fields in two localities (7 and 11) in the Sierra Juárez region of the
Mexican state of Oaxaca (see Table 4.3). Thirty fields were sampled in each locality, and
300 leaves were collected from plants randomly chosen throughout each field. Six pools
of 50 leaves each were formed from the 300 leaves. DNA was extracted from each pool
sample and the presence of CaMV-35S sequences was determined by polymerase chain
reaction (PCR) (see Table 4.3) (Pineyro-Nelson et al., 2009).

Assuming that we wish to conduct another study in this region of Oaxaca, we can
estimate the parameters (.2, OÄ,, OP, ) required for calculating the optimal sample size using
the information given in Table 4.3, since the authors only reported the total number of
positive fields and pools per locality. Assuming a non-informative sampling process, we
99

estimated the parameters (.2, OÄ,, OP, ) by fitting model (1) to these data but taking into
account the pooled data (see in Appendix 4.E the Glimmix code used in the estimation
process). The resulting estimates were .2 = 0.0024, OÄ, = 0.57 and OP, = 0.77. After
performing a literature review, we decided to use 3 = 0.999, ` = 0.997, and
C = 20,000 (total budget for the study);  = 10 is the cost of enrolling the plants in
the study, , = 35 is the cost of each diagnostic test, 0 = 300 the cost of enrolling a
field in the study, and 1 = 500 the cost of enrolling a locality in the study. Next we
illustrate how we obtained the optimal sample sizes.

For minimizing the variance. Once again, we assume a pool size of  = 50. Then .2  =
0.999 + (1 − 0.999 − 0.997)(1 − 0.0024)Ü = 0.115755 and
„ „
ƒ„ 
(.çççJ.ÜéÜÜ )%Áƒ„ (.ÜéÜÜ )(J.ÜéÜÜ )
-(ž) = =
W Ž J.2  Y ‚ 2 (J.2  )
.
b„ ( Ž œ Žu J)„/‚ Ü„ (.çççœ.ççéJ)„/%Á
=0.0000522.
Therefore
0 ª-(ž) 300 √0.0000522
== == = 2.576 ≈ 3
( + , ) .(1 − .)OP ((50)(10) + 35) 0.0024(1 − 0.0024)(0.8775)

 = ©§6 = ©0 = 1.50 ≈ 2,


§ † Ü .èééÜ
5 . .éÜÜ

D = rqb§ = (,)(0)(Ü)()œ(,)(0)(0Ü)œ(,)(0)œÜ = 4.64 ≈ 5.


¦ ,
~ œrq§„ œr§5 œ§6

This means that we need to select five localities at random from the population of
localities, two fields at random from each selected locality, and three pools per selected
field. Thus the total number of plants to use in the study should be Dëëë =
5ë2ë3ë50 = 1500 plants, i.e., 150 plants per selected field.

Now, if locality and field sizes are unequal, how do we compensate for the loss of
efficiency due to varying cluster sizes? Assume the mean and standard deviation of
locality and field sizes are ÕZ = 177 and OZ = 81.5, respectively, with \ = D, o. Then
„∗ 9(:) Á.ÁÁÁÁ%„„
† œ { T.,1(J.,1)V„ (.éé)œ
À- = = 0.4605, ¬ = = = 1.44, so
è.Ü ~>>
éé „∗
. T.,1(J.,1)V„ .Üé

£ = (177/(177+1.44) =0.9919. Therefore Ã × = T1 − (0.4605, )(0.9919)(1 −


0.9919)V =0.9983. Thus efficiency for localities can be restored by taking D = .ççè0 =
1.Þ1

4.6479 ≈ 5 localities. Now for fields the À-7 = = 0.4605, ¬7 = =


è.Ü š(›)
éé „∗
†
100

=11.83, so £7 = (177/(177+11.83) =0.9374. Therefore Ã7× =


.Ü,,
T.,1(J.,1)V„ .éé

T1 − (0.4605, )(0.9374)(1 − 0.9374)V =0.9876. Thus efficiency for fields can be

restored by taking  = = 1.519 ≈ 2 fields.


.Ü
.çèéÞ

For a desired CI width. Now suppose that the researcher requires a 95% confidence
interval estimate, with a desired width for the proportion of transgenic plants that is equal
to ° = (.2Ö − .2¤ ) ≤ « =0.005. Therefore, ]J.Ü/, = 1.96 and - = «, /4]J¯/,
,
=

= 0.000001627, assuming the same values of , 3 , ` , OÄ, , OP, , .2, 1 , 0 , , and


.܄
1∗.çބ

 given for minimizing the variance. Using Equation (19), we obtain that =

©((Ü)()œ0Ü) .,1(J.,1)(.èééÜ) ≈ 3,  = ©0 = 1.50 ≈ 2, while the number


0 √.Ü,, Ü .èééÜ
.éÜÜ

of localities is equal to:


0.0000522
D = [T0.0024(1 − 0.0024)V, (0.57) + T0.0024(1 − 0.0024)V, (0.77)/2 + ]/0.000001627 = 9.3669
(2)(3)
≈ 10

Since  and  do not change, we need 150 plants per field, 2 fields per locality and
10 localities to reach the required width (0.005) of the 95% CI. Now the budget is two
times larger than that obtained for minimizing the variance given a budget constraint.
However, this sample size is only valid for equal cluster sizes. If adjustments need to be

made for unequal locality sizes and field sizes, they can be carried out by D ∗ = ڐ and
Ò

∗ = ڐ , respectively.
r

Now assume that we wish to determine the required number of localities without a
budget constraint, assuming 2 fields per locality, a pool size of 50 and g=10 pools per
field. Using Equation (13) and assuming the same values for «, ¬, 3 , ` , OP, , and .2, that
were given for minimizing the variance, we have

4(1.96), 0.0000522
D= íT0.0024(1 − 0.0024)V, (0.57) + T0.0024(1 − 0.0024)V, (0.77)/2 + î = 5.623
0.005 , (2)(10)
≈6
101

This implies that we need a sample of 6 localities, 2 fields per locality and 10 pools of
size 50 per field. These values do not change, assuming unequal locality and field sizes
with the same mean and standard deviations.

For a desired power. Now suppose that we need to know the budget and sample size
required for testing ¶ : .2 = 0.0024 vs ¶ : .2 > 0.0024 at a ¬ = 0.05 significance
level with a power (1 − ¸) = 0.9 (90%) for detecting ž = 0.003 and using the same
parameters (, 3 , ` , OP, , 1 , 0 , , and ) as in the example for minimizing the variance.

Then, - = -2 = (.Þ1ܜ .,è,)„ = 0.000001051. Since -(ž ) = -(ž) = 0.0000522,


.0„

.2 = .2 , then  = © ≈ 3,  = © = 1.50 ≈ 2,


0 √.Ü,, Ü .èééÜ
((Ü)()œ0Ü) .,1(J.,1)(.èééÜ) 0 .éÜÜ

while the number of localities is equal to:


0.0000522
D = [T0.0024(1 − 0.0024)V, (0.57) + T0.0024(1 − 0.0024)V, (0.77)/2 + ]/0.000001051 = 14.5
(2)(3)
≈ 15

Here, again, we need 150 plants per field, 2 fields per locality and 15 localities to
reach the required power of 90%. This means that the required budget is three times
larger than that obtained for minimizing the variance given a budget constraint. To
compensate for the unequal locality and field sizes and assuming the same mean and
standard deviation of the sizes (ÕZ = 177 and OZ = 81.5, for k=l, f), we have to multiply
 
ڐ-Û Ú<Û
the localities and fields obtained by correction factors and , respectively.

Now assuming that we decided to use 10 pools per field (g), 2 fields per locality and
a pool size of 50, without a budget constraint and using the same values of
3 , ` , OÄ, , OP, , ¬, (1 − ¸), ž = .2 − .2 as above. Then using Equation (15), the required
number of localities, l, is equal to:

(1.645 + 1.282), T0.0024(1 − 0.0024)V, (0.77) 0.0000522


D= ?T0.0024(1 − 0.0024)V,
(0.57) + + @
0.003, 2 (2)(10)
= 8.7 ≈ 9
This means that to perform the study, we need 9 localities, 2 fields per locality and 10
pools per field of size 50. These values do not change, assuming unequal locality and
field sizes with the same mean and standard deviation.
102

4.10 Discussion and conclusions

In this paper, we derived optimal sample sizes for group testing in a three-stage sampling
process under a budget or variance constraint. Given a pool size (s) and using Lagrange
multipliers, we derived formulae to produce the optimal allocation of localities (l), fields
(m) and pools per field (g). Although these formulae are similar to those derived by
Cochran (1977; p. 285), they assume that all localities and fields are of the same size.
However, in practice, this assumption is rarely satisfied; for this reason, we derived
correction factors (inverse of the relative efficiency) to adjust the optimal sample sizes
for unequal locality and field sizes. We also show examples of how to calculate the
optimal values of l, m and g using these formulae when we wish to obtain a certain
precision (width of the confidence interval) or a specified power.

If sample sizes for precision or power without a budget constraint are required,
Equations (13) and (15) can be used for precision and power, respectively. However,
these sample sizes are not optimal since the values of m, g and  are given by the
researcher in a non optimal way.

There are two important aspects that should be taken into account, given that our
optimal sample sizes were derived using a first-order TSE approach under the assumption
that the variance components are known. First, the optimal sample sizes will be slightly
biased, based on Monte Carlo simulations (Goldstein and Rabash, 1996; Moerbeek et al.,
2001; Moerbeek and Maas, 2005; Candel et al., 2010). Second, we assumed a relatively
simple covariance structure for deriving the optimal sample sizes. These approximate
sample sizes should be reasonable and can be calculated easily. However, further study
on the performance of the proposed optimal sample sizes is required.

4.11 References

Anonymous. (2003). Regulation (EC) 1829 of the European Parliament and the European
Council of 22 September 2003 on genetically modified food and feed. Official
Journal of the European Union L 268.
103

Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear


mixed models. Journal of the American Statistical Association, 88(421):9-25.

Candel, M. J., & Van Breukelen, G. J. (2010). Sample size adjustments for varying
cluster sizes in cluster randomized trials with binary outcomes analyzed with second-
order PQL mixed logistic regression. Statistics in medicine, 29(14):1488.

Candy, S. G. (2000). The application of generalized linear mixed models to multi-level


sampling for insect population monitoring. Environmental and Ecological Statistics
7(3):217-238.

Chen, P., Tebbs, J., and Bilder, C. (2009). Group Testing Regression Models with Fixed
and Random Effects. Biometrics 65(4):1270-1278.

Cochran, W. G. (1977). Sampling techniques. New York, Wiley. 3rd ed.

Goldstein, H. (1991). Nonlinear multilevel models, with an application to discrete


response data. Biometrika 78(1):45-51.

Goldstein, H. (2003). Multilevel Statistical Models. Third Edition. London, Edward


Arnold.

Goldstein, H., & Rasbash, J. (1996). Improved approximations for multilevel models
with binary responses. Journal of the Royal Statistical Society, Soc. A 159(3):505-
513.

Moerbeek, M., & Maas, C. J. (2005). Optimal experimental designs for multilevel
logistic models with two binary predictors. Communications in Statistics—Theory
and Methods 34(5):1151-1167.

Moerbeek, M., van Breukelen, G. J. P., & Berger, M. P. F. (2001b). Optimal


experimental Designs for Multilevel Models with Covariates. Communications in
Statistics, Theory and Methods, 30:2683–2697.

Moerbeek, M., van Breukelen, G. J., & Berger, M. P. (2000). Design issues for
experiments in multilevel populations. Journal of Educational and Behavioral
Statistics, 25(3):271-284.
104

Moerbeek, M., Van Breukelen, G. J., & Berger, M. P. (2001a). Optimal experimental
designs for multilevel logistic models. Journal of the Royal Statistical Society: Series
D (The Statistician), 50(1):17-30.

Mood, A. M., Graybill, F. A., & Boes, D. C. (1974). Introduction to the Theory of
Statistics (3rd Edition). McGraw-Hill.

Piñeyro-Nelson, A., van Heerwaarden, J., Perales, H. R., Serratos-Hernández, J. A.,


Rangel, A., Hufford, M. B., Gepts, P., Garay-Arroyo, A., Rivera-Bustamante, R., &
Álvarez-Buylla, E. R. (2009). Transgenes in Mexican maize: molecular evidence and
methodological considerations for GMO detection in landrace populations. Mol. Ecol.
18(4):750-761.

Rabe-Hesketh, S., & Skrondal, A. (2006). Multilevel modelling of complex survey data.
Journal of the Royal Statistical Society: Series A (Statistics in Society), 169(4):805-
827.

Rodríguez, G., & Goldman, N. (1995). An assessment of estimation procedures for


multilevel models with binary responses. Journal of the Royal Statistical Society.
Series A 158(1):73-89.

Skrondal, A., & Rabe-Hesketh, S. (2007). Redundant overdispersion parameters in


multilevel models for categorical responses. Journal of Educational and Behavioral
Statistics 32:419-430.

Van Breukelen, G. J. P., Candel, M. J. J. M., & Berger, M. P. F. (2007). Relative


efficiency of unequal versus equal cluster sizes in cluster randomized and multicenter
trials. Statistics in Medicine 26:2589-2603.
105

Table 4.1. Optimal sample sizes (, ) given the pool size (s=10, 20) for group testing in
three stages for five values of OP, .
.2
OP,
3 = ` = 0.90
m 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0.05 2 41 25 20 17 16 15 14 14 14 13
0.1 2 29 18 14 12 11 11 10 10 10 9
s=10 0.15 2 23 15 12 10 9 9 8 8 8 8
0.2 2 20 13 10 9 8 7 7 7 7 7
0.25 2 18 11 9 8 7 7 6 6 6 6
0.05 2 19 13 11 10 10 10 10 10 11 12
0.1 2 14 9 8 7 7 7 7 7 8 8
s=20 0.15 2 11 8 6 6 6 6 6 6 6 7
0.2 2 10 7 6 5 5 5 5 5 6 6

3 = ` = 0.95
0.25 2 9 6 5 5 4 4 5 5 5 5

0.05 2 32 21 17 15 14 13 13 12 12 12
0.1 2 23 15 12 11 10 9 9 9 8 8
s=10 0.15 2 19 12 10 9 8 8 7 7 7 7
0.2 2 16 11 9 8 7 7 6 6 6 6
0.25 2 15 10 8 7 6 6 6 5 5 5
0.05 2 16 12 10 9 9 9 9 9 9 10
0.1 2 11 8 7 6 6 6 6 6 7 7
s=20 0.15 2 9 7 6 5 5 5 5 5 5 6
0.2 2 8 6 5 5 4 4 4 4 5 5

3 = ` = 0.98
0.25 2 7 5 4 4 4 4 4 4 4 4

0.05 2 28 19 16 14 13 12 12 11 11 11
0.1 2 20 14 11 10 9 9 8 8 8 8
s=10 0.15 2 16 11 9 8 8 7 7 7 6 6
0.2 2 14 10 8 7 7 6 6 6 6 6
0.25 2 12 9 7 6 6 6 5 5 5 5
0.05 2 15 11 9 9 8 8 8 8 8 9
0.1 2 10 8 7 6 6 6 6 6 6 6
s=20 0.15 2 8 6 5 5 5 5 5 5 5 5
0.2 2 7 5 5 4 4 4 4 4 4 4
0.25 2 7 5 4 4 4 4 4 4 4 4
(.2), and two values of OÄ = 0.25 = 0.5, = 3.5, § = 40,
, §„ §5
§~
Ten values of the proportion
~
= 120 and three combinations of 3 and ` .
§6
§~
106

Table 4.2. Optimal sample sizes (, ) given the pool size (s=10, 20) for group testing in
A
three stages for five values of B .
AC

.2
0 /
3 = ` = 0.90
m 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

15 2 12 8 6 5 5 5 4 4 4 4
25 2 16 10 8 7 6 6 6 5 5 5
s=10 35 2 19 12 9 8 7 7 7 6 6 6
45 2 22 13 11 9 8 8 8 7 7 7
55 2 24 15 12 10 9 9 8 8 8 8
15 2 6 4 3 3 3 3 3 3 3 4
25 2 8 5 4 4 4 4 4 4 4 5
s=20 35 2 9 6 5 5 5 5 5 5 5 6
45 2 10 7 6 5 5 5 5 6 6 6

3 = ` = 0.95
55 2 11 8 7 6 6 6 6 6 7 7

15 2 10 7 5 5 4 4 4 4 4 4
25 2 13 8 7 6 6 5 5 5 5 5
s=10 35 2 15 10 8 7 7 6 6 6 6 6
45 2 17 11 9 8 7 7 7 6 6 6
55 2 19 13 10 9 8 8 7 7 7 7
15 2 5 4 3 3 3 3 3 3 3 3
25 2 6 5 4 4 3 3 3 4 4 4
s=20 35 2 8 5 5 4 4 4 4 4 4 5
45 2 9 6 5 5 5 5 5 5 5 5

3 = ` = 0.98
55 2 10 7 6 5 5 5 5 5 5 6

15 2 9 6 5 4 4 4 4 4 3 3
25 2 11 8 6 6 5 5 5 5 4 4
s=10 35 2 13 9 8 7 6 6 6 5 5 5
45 2 15 10 9 8 7 7 6 6 6 6
55 2 16 11 9 8 8 7 7 7 7 6
15 2 5 3 3 3 2 2 2 2 3 3
25 2 6 4 4 3 3 3 3 3 3 3
s=20 35 2 7 5 4 4 4 4 4 4 4 4
45 2 8 6 5 5 4 4 4 4 4 5
55 2 9 6 5 5 5 5 5 5 5 5
Ten values of the proportion (.2), OP = 0.2, OÄ = 0.5, three combinations
, ,
of 3 and ` ,
= 3.5, and §6 = 25, 55, 85, 115, 145.
§„ §
§
~ ~
107

Table 4.3. Number of pools comprised of leaf samples from two localities in Oaxaca,
Mexico (2004).
Locality Field Positive Locality Field Positive
pools pools
7 6 1 11 7 1
8 1 17 1
11 1 19 1
15 1
17 2
25 1
27 1
30 3
With a positive 35S PCR band based on 30 fields per locality and 6 pools per field
composed of 50 maize leaves.
108

Appendix 4.A. Derivation of the optimal solution for minimizing -(./) subject to
C=D + D, + D0 + D1 (4 > 0, D, , ,  ≥ 2, ü = 1,2,3,4) given the pool
size (s)
By combining Eq. (12) and (16), we obtain the Lagrangean
g(D, ,g,£) = g = -(./)+£[C−(D + D, + D0 + D1 )] (A1)

 (JH
 )V„ .„  (JH
 )V„ †„
where -(./) = + + = +
„∗
+ . £ is the Lagrange
TH TH š(›) „∗
. † š(›)
r rq r rq

multiplier. The partial derivatives of Eq. (17) with respect to £,  and g are

=C−(D + D, + D0 + D1 ) =0; then D = rqb§


t¤ ¦
t¥ ~ œrq§„ œr§5 œ§6

=− q„ r − £D( + , ) = 0; then £ = − q„ r„
t¤ š(›) š(›)
tq „ (b§ œ§ )
~ „

 (JH
 )V„ †„
− − £[lg( + , ) + D0 ] = 0 ⟺
t¤ TH š(›)
tr r„ r„ q
=−

 (JH
 )V„ †„
[lg( + , ) + D0 ]=
š(›) TH š(›)
q„ „ r„ (b§~ œ§„ ) r„ r„ q
+ , since

£=−
š(›)
q„ r„ „ (b§~ œ§„ )

⟺ -(ž)D( + , ) + -(ž)D0 = , D( + , )[T.2(1 − .2)V, OP, + ],


š(›)
q

⟺ -(ž)D0 = , D( + , )T.2(1 − .2)V, OP,

⟺  = ©(b§
§5 ªš(›)
 (JH
~ œ§„ ) H ) †

 (JH
 )V„ .„  (JH
 )V„ †„
− − − £[( + , ) + 0 + 1 ] =
t¤ TH TH š(›)
t „ r„ „ rq
=−
 (JH
 )V„ .„
0 ⟺ q „ r„ [( + , ) + 0 + 1 ]= +
š(›) TH
„ (b§ œ§ ) „
~ „

 (JH
 )V„ †„
, since £ = − q„ r„
TH š(›) š(›)
„r „ rq „ (b§ œ§ )
+
~ „

-(ž)(+ , ) + -(ž)0 + -(ž)1 = , , ( + , )T.2(1 − .2)V, OÄ, +


, ( + , )T.2(1 − .2)V, OP, + -(ž)( + , )] ;

-(ž)0 + -(ž)1 = , , ( + , )T.2(1 − .2)V, OÄ, + , ( + , )T.2(1 − .2)V, OP,
109

⟺ -(ž)0 + -(ž)1 = [, ( + , )T.2(1 − .2)V, OÄ, + ( +


§5 š(›)
(b§~ œ§„ )TH(JH)V„ †„

, )T.2(1 − .2)V, OP, ; since  = ©


§5 ªš(›)
 (JH
(b§~ œ§„ ) H ) †
.

⟺ -(ž)0 + -(ž)1 = + 0 -(ž)


§5 š(›) .„ r„
„
†

⟺=©
§6 †
§5 .
110

Appendix 4.B. Derivation of the optimal solution for minimizing C=D +


D, + D0 + D1 subject to -(./) = -

By combining Eq. (12) and (16), we obtain the Lagrangean


g(D, ,g, £) = g = (D + D, + D0 + D1 ) + £[-(./) − -] (B1)
 (JH
 )V„ .„  (JH
 )V„ †„
where -(./) = + +
TH TH š(›)
r rq
. Now the partial derivatives of L with
respect to £, D,  and g are
 (JH
 )V„ .„  (JH
 )V„ †„
+ + − - = 0; then D = [T.2(1 − .2)V, OÄ, +
t¤ TH TH š(›)
t¥ r rq
=
 (JH
 )V„ †„
+ ]/-
TH š(›)
r rq

=D( + , ) − £ = 0 ; then £ =
t¤ š(›) „ q„ r„ (b§ œ§ )
~ „
tq rq„ š(›)

=D( + , ) + D0 − [T.2(1 − .2)V, OP, + ]=0


t¤ ¥ š(›)
tr r„ q

⟺ D( + , ) + D0 = [T.2(1 − .2)V, OP, + ], since


q„ (b§~ œ§„ ) š(›)
š(›) q

£=
„ q„ r„ (b§ œ§ )
~ „
š(›)

⟺ D( + , ) + D0 = [T.2(1 − .2)V, OP, ] + D( + , ),


q„ (b§~ œ§„ )
š(›)

⟺ D0 = T.2(1 − .2)V, OP,


q„ (b§~ œ§„ )
š(›)

⟺  = ©(b§
§5 ªš(›)
 (JH
~ œ§„ ) H ) †

 (JH
 )V„ .„  (JH
 )V„ †„
=( + , ) + 0 + 1 − £[ + + ] =
t¤ TH TH š(›)
t „ r„ „ rq

0 ⟺ [( + , ) + 0 + 1 ] = [ +
q„ „ r„ (b§~ œ§„ ) TH(JH)V„ .„
š(›) „

 (JH )V„ †„
], since £ =
TH š(›) „ q„ r„ (b§ œ§ )
~ „
„r + „ rq š(›)

⟺ -(ž)( + , )+ -(ž)0 + -(ž)1 = , , ( + , )T.2(1 −


.2)V, OÄ, + , ( + , )T.2(1 − .2)V, OP, + -(ž)( + , )]
 (JH )V„ .„
⟺ -(ž)0 + -(ž)1 = + -(ž)0 ;
š(›)r„ §5 TH
 (JH
TH  )V †„
„

0 ª-(ž)
==
( + , ) T.2(1 − .2)VOP

⟺  = ©§6
§ †
5 .
111

Appendix 4. C. An alternative way of minimizing the values of m and g given a pool


size and a budget or variance constraint

In section 4.6 we derived optimal values for locality (l), fields (m) and pools per field (g)
given a pool size (s) using Lagrange multipliers. We found that minimizing the variance
given a budget constraint or minimizing the budget given a variance constraint produces
the same optimal values for m and g due to duality. Only the expression for obtaining the
optimal allocation of localities is different. For this reason, according to Cochran (1977),
we obtain the same solution by minimizing the product of the variance of interest by the
budget constraint. This means that the problem of minimization is the same as
minimizing the product:

À(, ) = -(./)( D + D, + D0 + D1 )

À(, ) = [OÄ,∗ +
„∗
+ ] [1 + 0 + ( + , )]
† š(›)
r rq“
(C1)

The Cauchy-Schwarz inequality (Cochran 1977, p. 77) is

(∑ D, ) (∑ û, ) − (∑ D û ), = ∑) ∑),B(D) ûB − DB û) ), ≥ 0 (C2)

Therefore

(∑ D, ) (∑ û, ) ≥ (∑ D û ), (C3)

D = ªOÄ,∗ , D, = © r† , D0 = © û = √1, û, = ª0,


„∗
š(›)
rq“
Making ,

û0 = ª( + , )

we can express (C1) as (C3):

[OÄ,∗ + ] [1 + 0 + ( + , )] ≥ GªOÄ,∗ 1 + ªOP,∗ 0 +


„∗
+
† š(›)
r rq“
,
ª-(ž)( + , )K (C4)

The product À(, ) will be minimized, provided that the equality of Equation (C4)
holds. Setting the equality of Equation (C4) and expanding both sides, we find that
112

ªr§5
ä6
= ≥ \ a constant (C5)
© .„∗ 8„∗
= †
|

and

ªr§5
= ≥\
ªrq(b§~ œ§„ )
9(:)
(C6)
8„∗ ©
= † |{
|

from (C5)

√1 ª0 √1 ª0 1 OP


= ⟹ ∗ = ⟹ ==
ªOÄ,∗ ©OP,∗ OÄ OP

0 OÄ


and also from (C6)

ª0 ª( + , ) ª0 ª( + , )


= ⟹ = ⟹
-(ž) OP∗ ª-(ž)
©OP
,∗
©
 ̅

0 ª-(ž)
==
( + , ) T.2(1 − .2)VOP

If we wish to minimize the variance given a budget constraint, the optimal allocation of
localities can be obtained by solving the budget constraint for l, and we get: D ∗ =
C/( + , + 0 + 1 ). However, if we wish to minimize the budget given a
variance constraint (- ), the optimal value of localities can be obtained by solving the
 (JH
 )V„ †„
variance constraint for l, and we get: D ∗ = [T.2(1 − .2)V, OÄ, + ]/- . Thus
TH š(›)
r rq
+

we get the same solution by using Lagrange multipliers.


113

Appendix 4.D. Taylor series approximation (Eq. 23) of the RE in (Eq. 21) given by
Van Breukelen et al. (2007)

Taylor series approximation (24 and 25) is derived from the RE of Equations (22 and 23)
in four steps.

Step 1. Let the ) values be independent realizations of a random variable cluster size ô
with expectation ÕÝ and standard deviation ÕÝ . Equations (22 and 23) are moment
estimator of
Ã(./) = “  G K
q“œ¯ Ö
q ֜¯
(F1)
where ¬ = (1 − Ÿ)/Ÿ ≥ 0.

Step 2. Define p = (ô − ÕÝ ); then the last term in (F1) can be written as:
ô ÕÝ + p ÕÝ + p 1
 ’ =  ’ = T ’ ’V
ô+¬ ÕÝ + ¬ + p ÕÝ + ¬ 1 + (p/(ÕÝ + ¬))
The last term is a Taylor series [Mood et al. (1974), p. 533, Equation (34)]:
G K = ∑ø Bd( )
 Jõ B
œ(õ/(ö÷ œ¯)) ö÷ œ¯
if −(ÕÝ + ¬) < p < (ÕÝ + ¬) to ensure convergence.

Since p = ô − ÕÝ and ¬ ≥ 0, this convergence condition will be satisfied, except for a


small probability ^(ô > 2ÕÝ + ¬) for strongly positively skewed cluster size
distributions combined with large Ÿ(= small ¬). Thus we have:

 G֜¯K = TGö÷œ¯K ∑ø
Bd(ö )B V
Ö ö œõ Jõ
÷ œ¯
(F2)
÷

Step 3. If we ignore all terms p B with [ > 2, and rearrange terms in (F2), we will have

 G֜¯K = £T1 − À- , £(1 − £)V


Ö
(F3)
where £ = (Õq /(Õq + ¬)) ∈ (0,1], assuming ̅ = Õq and À- = Oq /Õq is the coefficient
of variation of the random variable U.

Ã(./)× ≈ T1 − À- , £(1 − £)V


Step 4. Plugging (F3) into (F1) gives:
(F4)

Ignoring in (A2) only those p B terms with [ > 4 instead of 2, will give
Remark

Ã(./)× ≈ 1 − T(1 − £)[£À- , − £À- 0 skew + £0 À- 1 (kurt + 3)]V (F5)

skew=the 3rd central moment of the U divided by OÝ0 , and kurt=the 4th moment of U
where skew and kurt are the skewness and kurtosis of the cluster size distribution, that is,

divided by OÝ1 , minus 3 (see, for example, Mood et al., 1974, p.76).
114

Appendix 4.E. Glimmix code to estimate the proportion and variance components
data corn2004;
input Locality Field pool yp;
cards;
11 1 1 0
11 1 2 0
11 1 3 0
11 1 4 0
11 1 5 0
11 1 6 0
11 2 1 0
11 2 2 0
11 2 3 0
11 2 4 0
11 2 5 0
11 2 6 0
.
.
.
11 29 1 0
11 29 2 0
11 29 3 0
11 29 4 0
11 29 5 0
11 29 6 0
11 30 1 0
11 30 2 0
11 30 3 0
11 30 4 0
11 30 5 0
11 30 6 0
7 1 1 0
7 1 2 0
7 1 3 0
7 1 4 0
7 1 5 0
7 1 6 0
.
.
.
7 29 1 0
7 29 2 0
7 29 3 0
7 29 4 0
7 29 5 0
7 29 6 0
7 30 1 0
7 30 2 1
7 30 3 1
7 30 4 1
7 30 5 0
7 30 6 0
;
proc glimmix data=corn2004 ;
class Locality field;
115

model yp(event='1')= /solution dist=binary;


random intercept/ subject=Locality;
random intercept / subject=field(Locality);
prod=1; s=50;
do i = 1 to s;
p1= exp(_linp_)/(1 + exp(_linp_));
prod = prod*(1 - p1);
end;
_MU_ =1 - prod;
run;
116

Chapter 5: Optimal sampling plans for multistage


group testing data with optimal pool size

Abstract
Planning of sample size for multistage group testing is of paramount importance for
achieving precise estimates at a low cost. In chapters 2 and 4, optimal values [of fields
(m) and pools per field (g)] and [localities (l), fields (m) and pools per field (g)] where
obtained given a pool size (s). In this study we obtain optimal sample size for group
testing data for two and three stages but also optimizing the pool size (s). Since it was not
possible to get analytical solutions, we provided R code to get the computational
solutions. We found that when the sensitivity and specificity were lower than one the
optimized pool size produced a considerable saving in the total cost. However, when the
sensitivity and specificity were perfect (equal to 1) no gain was obtained when the pool
size was optimized.

Key words: multistage, group testing, sample size, optimal values.

5.1 Introduction

This chapter extends the optimal sampling plans of two and three stages for group testing
data but in place of using a given pool size (s), the pool size is also optimized.
Optimizing the pool size (s) is expected to produce a significant reduction in cost
compared to those sampling plans with a fixed pool size. In the plot of the variance of the
proportion (./) vs pool size (s) (Figure 5.1) we see that it is possible to get an optimal
pool size when the sensitivity and specificity is less than one. In the case of the two stage
sampling we will get optimal values of fields (m), pools (g) and pool size (s), while in the
context of a three stage sampling process we will obtain optimal values of localities (l),
fields (m), pools (g) and pool size (s).
117

This chapter is organized as follows: Section 5.2 gives the optimal sample sizes of
fields (m), pools per field (g) and pool size (s) for two stage sampling assuming equal
field size. Section 5.2.1. provides tables for illustrating the sample size determination for
a two stage sampling process. Section 5.3 gives the optimal sample sizes of localities (l),
fields (m), pools per field (g) and pool size (s) for three stages assuming equal locality
and field sizes. Section 5.3.1. provides tables for sample size determination for a three
stage sampling process. Finally section 5.4 give the conclusions.

5.2 Optimal sample sizes for two stages

Recall from chapter 2 that the variance of the proportion (./) is equal to

TH
 (JH
 )V„ †„
-(./) = + =
„∗
+
š(›) † š(›)
r rq r rq
(1)

And we want to minimize -(./) using the following budget constraint

C= + , + 0 (4 > 0, , ,  ≥ 2, ü = 1,2,3) (2)

where C is the total sampling budget available,  is the cost of sampling and measuring a
plant in an already sampled field, , is the cost of testing a pool of size (s), 0 is the cost
of sampling and measuring a field. The value of 0 is an average value, since at the time
of planning the survey, it is not known which fields will be sampled and the travel times
are not the same for all fields. The budget C and costs  , , , and 0 are given in dollars
but can be changed to any other currency. As shown in chapter 4 (section 4. 6)
minimizing -(./) subject to fixed C, or C subject to fixed -(./) is equivalent to
minimizing the product -(./)À(see Appendix 4.C of chapter 4; Cochran, 1977).
Therefore, using this approach the optimal allocation of units can be obtained by
minimizing

!(, ) = -(./)(  + , + 0 )

!(, ) = [OP,∗ + ] ( + , + 0 )


š(›)
q
(3)
118

Theorem 1. D(, ) is a convex function with respect to , . Then there are unique
values of  and  that minimize !(, ) for ,  ∈ [1, +∞). The optimal values are
defined as ∗ ,  ∗ (proof is in Appendix 5.B).

Unfortunately, it is difficult to get an analytic solution to the optimal values of g and


s by minimizing !(, ). For this reason, we solved this system computationally using
the nlminb function of R. The R code used to obtain these optimal values is given in
Table 5.1. The nlminb function is an optimization algorithm useful for the minimization
of general nonlinear functions of n parameters. These parameters may have bounds, often
called box constraints. Thus, with this function we can solve the problem
ë ∗ = x *(Wo(ë)Y subject to g ≤ ë ≤ ô
Where ë ∈ ℝÝ and o: ℝÝ → ℝ. Ideally, we would like to assume o() is smooth, that is,
the function and all its derivatives are continuous. Many statistical problems do not fit
this assumption, but are sufficiently well-behaved that the function to be minimized is
continuous, with first and second derivatives defined and continuous in all but a restricted
subset of arguments (Nash and Varadhan, 2011). When the goal is to maximize a
function, we will minimize the negative of that function. This algorithm is available in
the stats package of R. There are other algorithms that can perform the same job (for
more information see Nash and Varadhan, 2011).

The inputs of this function (Opt.GT.2S) are the initial values of g and s (called gstar
and sstar), the value of the proportion (.2) called p, the cost values ( , , and 0 ), and the
values of 3 , ` and OP, . These values should be specified at the end of the function in
Opt.GT.2S.Output, as illustrated in Table 5.1A. The output produced by the function
given in Table 5.1A is shown in Table 5.1B. First, consider the value of $Conv. If this
value is equal to one, the program did not converge and the code needs to be rerun using
other starting values for g, and s. However, if $Conv has a value of zero, the method
converged and produced the optimal values of g and s. These optimal values are given in
$s.opts in Table 5.1B, where 8 and 10 are the optimal values of pools per field (g) and
pool size (s), respectively.
119

Next, to find the optimal value of fields, ∗ , we need to solve the equation of cost or
variance, according to what was fixed. For fixed C: ∗ = C/( + , + 0 ), while

for fixed -(./), the optimal value is ∗ = [T.2(1 − .2)V, OP, + ]/- , where - is fixed
š(›)
q

based on whether we wish to obtain a specified width of the CI or a specified power, as in


chapter 2 (section 2.5.2) (a detailed explanation of how to calculate ∗ is given in the
next section). If the fields are of unequal size then the adjustment to approximate the
relative efficiency based on a second order Taylor series expansion (Ã(./)× ≈ T1 −
À- , £(1 − £)V) given for unequal cluster size in chapter 2 can be used.

5.2.1. Tables for sample size determination for a two stage sampling process
Tables 5.2 and 5.3 should be used to determine the optimal allocations of fields (m),
pools per field (g) and pool size (s) to estimate .2 with minimum variance given a budget
constraint or with a minimum budget given a variance constraint (-) (these optimal
values were obtained with the code given in Table 5.1). Table 5.2 can be used to
minimize the variance subject to a budget constraint, as follows. For example, assume
that we want optimal values of m, g and pool size (s) to minimize the -(./) given a
budget constraint equal to C=20000. Also assume that after a literature review, we
estimate  = 20, , = 10, 0 = 600, OP, = 0.15, 3 = ` = 0.95, and .2 = 0.01. This
implies that , / = 0.5, 0 / = 30. Then, from Table 5. 2 in the intersection between
the value of .2 = 0.01 (first column) and the value of OP, = 0.15 (columns 6 and 7) for
3 = ` = 0.95, we get the required optimal values of pools per field (g=5, column 6)
and pool size (s=29, column 7). Finally, using the optimal values of g and s and the costs,
we can calculate the optimal number of field as:

À 20000 20000
= = = = 5.28 ≈ 6
 + , + 0 (5)(29)(20) + (29)10 + 600 3790
Now if we want to minimize the budget, C, subject to a variance constraint equal to
- = «, /4]J¯/,
,
(to get a CI width of « =0.009 with 95%). Table 5.2 also can be used.
(.ç)„
Therefore, ]J¯/, = 1.96, which implies that - = 1(.çބ ) =0.00000527. Assuming the

same values of , , , 0 , OP, , 3 , ` , and .2. Then the optimal values of pools per field and
120

pool size are 5 and 29, respectively (using Table 5.2 exactly as above). However, now the
optimal number of fields is calculated as:

 = [T.2 (1 − .2 )V, OP, + ]/- = [T0.01(1 − 0.01)V, (0.15) +


-(ž)
]/
.ÜÞè
q (,ç)

0.00000527 = 22.4012 ≈ 23.

Also if we want to test ¶ : .2 = 0.01 vs ¶ : .2 > 0.01 with a power (1 − ¸) =


0.90 (90%) and significance level ¬ = 0.05 when δ = |.2 − .2 | = 0.009 and assuming
the same values of , , , 0 , OP, , 3 , ` , and .2. This implies that the variance constraint
should be equal to:

|ž|, |0.009|, |0.009|,


-= = = = 0.00000945
(]J¯ + ]J¼ ), (1.645 + 1.282), (2.927),

While the optimal values of pools per field and pool size are 5 and 29, respectively (using
Table 5.2 exactly as above). However, now the optimal number of fields is equal to
Á.ÁÁÁ%~GH
´T.(J.)V„ (.Ü)œ µ
= = 12.4952 ≈ 13.
(„I)
0.00000945

Table 5.3 should be used exactly as Table 5.2, the only difference is that now there
are three options for the ratio §„ , and only one combination of 3 and ` .
§
~

In Table 5.4, we compare the relative cost between the optimal s and the fixed s
methods based on different proportions (.2). The relative cost is defined as
!(, , ) = !(∗ , ∗ ,  = )/!(∗ , ∗ ,  ∗ ), which is the cost of optimizing the
values of ,  given a pool size (s) (Chapter 4, section 4.6) versus optimizing , , 
(section 5.2). We observe that the smaller the values of the given s the bigger the
difference between the two approaches, and the approach that also optimizes the pool size
(s) is the best. For example when .2 = 0.01 and OP, = 0.05 the procedure that optimizes s
(section 5.2) is 3 times better that the procedure that assumes a fixed pool size (s=5).
Even when the pool size of s=15 the approach that also optimizes the pool size is better.
Also we can observe in Table 5.4B that for smaller values of 3 = ` the approach that
121

also optimizes the pool size (s) is the best. This is expected since according with the
proof given in Appendix 5.B when 3 = ` = 1 the optimal pool size is equal to 1. For
this reason, the lower the values of 3 = ` better is the performance of the approach
that also optimize the pool size (s). For this reason, for values of 3 and ` close to one,
there is only a small gain by using the procedure given in section 5.2 compared to using
a given pool size.

5.3 Optimal sample sizes for three stages

From chapter 4 the variance of the proportion (./) is equal to

TH
 (JH
 )V„ .„ TH
 (JH
 )V„ †„
-(./) = + + = +
„∗
+
š(›) „∗
. † š(›)
r rq r rq
(4)

And we want to minimize -(./) using the following budget constraint

C=D + D, + D0 + D1 (4 > 0, D, , ,  ≥ 2, ü = 1,2,3,4 (5)

where C is the total sampling budget available,  is the cost of sampling and measuring a
plant in an already sampled field, , is the cost of testing a pool of size (s), 0 is the cost
of sampling and measuring a field, and 1 is the cost of sampling and measuring a
locality. The values of 0 and 1 are average values, since at the time of planning the
survey, it is not known which localities and fields per locality will be sampled and the
travel times are not the same for all localities and fields per locality.

In Chapter 4, we derived optimal values of localities, fields and pools per field given
a pool size (s). However, obtaining an optimal value of pool size (s) can produce even
more savings than assuming a fixed pool size. Therefore, in this section we obtain the
optimal allocation of localities (l), fields (m), pools per field (g) and pool size (s),
assuming the budget constraint given in Equation (5). In section 4.6, chapter 4, we show
that minimizing -(./) subject to fixed C, or C subject to fixed -(./) is equivalent to
minimizing the product -(./)À (see Appendix 4.C of chapter 4; Cochran, 1977).
Therefore, the optimal allocation of units can be obtained by minimizing

!(, , ) = -(./)( D + D, + D0 + D1 )


122

!(, , ) = [OÄ,∗ + ] (  + , + 0 + 1 )


„∗
+
† š(›)
r rq
(6)

Theorem 2. D(, , ) is a convex function with respect to , , . Then there are


unique values of ,  and  that minimize D(, , ) for , ,  ∈ [1, +∞). The optimal
values are defined as ∗ , ∗ ,  ∗ (proof is in Appendix 5.C).

Also here it is difficult to get an analytic solution to the optimal values of m, g and s
by minimizing !(, , ). For this reason, we get a computational solution using the
nlminb function of R (R code is given in Table 5.5).

Now the inputs of this function (Opt.GT.3S) are the initial values of m, g and s
(called mstar, gstar and sstar), the value of the proportion (.2) called p, the cost values
( , , , 0 and 1 ), and the values of 3 , ` , OÄ, and OP, . These values should be specified
in Opt.GT.3S.Output, as illustrated in Table 5.5A. The output produced by this function
given is shown in Table 5.5B. If the value of $Conv is equal to one, the solution did not
converge and we need to rerun the code using other starting values for m, g, and s.
Otherwise, if $Conv has a value of zero, the method converged and the output are the
optimal values of m, g and s. These optimal values are given in $s.opts in Table 5.5B,
where 2, 8 and 10 are the optimal values of fields (m), pools per field (g) and pool size
(s), respectively.

Next, to find the optimal value of localities, D ∗ , we need to solve the equation of cost
or variance, according to what was fixed. For fixed C: D ∗ = C/( + , + 0 +
 (JH
 )V„ †„
1 ), while for fixed -(./), the optimal is D ∗ = [T.2(1 − .2)V, OÄ, + ]/- ,
TH š(›)
r rq
+

where - is fixed based on whether we wish to obtain a specified width of the CI or a


specified power, as in section 4.6 of chapter 4 (a detailed explanation of how to calculate
D ∗ is given in the next section). If localities and fields are of unequal sizes the adjustments
to approximate the relative efficiency for localities (Ã(./) × ≈ T1 − À- , £ (1 −
£ )V) and fields (Ã(./)7× ≈ T1 − À-7, £7 W1 − £7 YV) based on the second-order Taylor
series expansion given in chapter 4 can be used.
123

5.3.1. Tables for sample size determination for a three stage sampling process
Tables 5.6 and 5.7 should be used to determine the optimal allocations of localities (l),
fields (m), pools (g) and pool size (s) to estimate . with minimum variance given a
budget constraint or with a minimum budget given a variance constraint (-) (these
optimal values were obtained using the code given in Table 5.5). Table 5.6 can be used to
minimize the budget subject to a variance constraint, as follows. Assume we wish to test
¶ : .2 = 0.01 vs ¶ : .2 > 0.01 with a power (1 − ¸) = 0.90 (90%) and significance level
¬ = 0.05 when δ = |.2 − .2 | = 0.01. This implies that the variance constraint should be
equal to:

|ž|, |0.01|,
-= = = 0.00001167
(]J¯ + ]J¼ ), (1.645 + 1.282),

Assume that after a literature review, the estimates are  = 20, , = 10, 0 = 600,
1 = 1600, OÄ, = 0.3, OP, = 0.1, 3 = ` = 0.98, and .2 = 0.01. This implies that

, / = 0.5, 0 / = 30, §6 = 80. Again, in the intersection (Table 5.6) between the
§
~

value of .2 = 0.01 (column 1) and the value of OP, = 0.10 (column 5) for 3 = ` =
0.98, we get the optimal values of fields (m=2, column 5), pools per field (g=10, column
6) and pool size (s=21, column 7). Then, using the optimal values of m, g and s and the
variances, we can calculate the optimal number of localities as:
TÁ.Á~(~ƒÁ.Á~)V„ (Á.~) Á.ÁÁÁ%I65
íT.(J.)V„ (.0)œ œ („)(~Á) î
D= = 7.585459 ≈ 8.
„
.Þé

Table 5.6 can also be used to calculate the optimal allocations of l, m, g and s to
minimize the variance, -(./), subject to a budget constraint, C=20000, assuming the same
values of , , , 0 , 1 , OÄ, , OP, , 3 , ` , .2, (1 − ¸), ¬ and δ. The optimal values of fields,
pools per field and pool size are the same, i.e., 2, 10 and 21, respectively (using Table 5.6
exactly as above). However, now the optimal number of localities is calculated as:

D = rqb§ = (,)()(,)(,)œ(,)()()œ(,)(Þ)œÞ = 1.89 ≈ 2.


¦ ,
~ œrq§„ œr§5 œ§6
124

Table 5.7 should be used exactly as Table 5.6; the only difference is that now there
are three options for the ratio „ , and only one combination of 3 and ` .
§
§~

In Table 5.8, we compare the relative cost based on different proportions (.). The
relative cost is defined as !(, , ) = !(∗ , ∗ ,  = 10)/!(∗ , ∗ ,  ∗ ), which is the
cost of optimizing the values of ,  given a pool size (s=10) (Chapter 4, section 4.6)
versus optimizing , ,  (section 5.3). We observe that for smaller values of 3 = `
and for smaller values of the proportion, the difference between the two optimization
approaches is larger, and the best option is to also optimize the value of s (the procedure
given in section 5.3). However, for values of 3 and ` close to one, there is only a small
gain by using the procedure given in section 5.3. This result makes sense since we proved
(Appendix 5.C) that for a perfect diagnostic test (3 = ` = 1), the optimal pool size is
§
equal to 1. Also in Table 5.8B, we see that for larger values of the ratio §„ and smaller
~

proportions (.2), the difference between the two optimization procedures is larger and,
again, the procedure given in section 5.3 is the best (of course the behavior change with
different values of s).

5.4 Conclusions

The purpose of this study was to obtain optimal sample size for two and three stages for
group testing data but optimizing also the pool size. This implied that when the survey is
conducted in two stages we will get optimal values of fields (m), pools per field (g) and
pool size (s). While when the survey is conducted in three stages we will get optimal
values of localities (l), of fields (m), pools per field (g) and pool size (s). Optimizing also
the pool size produce a considerable gain when the sensitivity and specificity is imperfect
(less than one). But no gain is obtained in optimizing the pool size when the sensitivity
and specificity is perfect (equal to one). However, in this case it was not possible to
obtain analytical solutions for this reason we provided R code to get the optimal values in
two (Table 5.1) and three stages (Table 5.5).
125

For this reason, we suggest using this approach when sensitivity and specificity are
considerably lower than 1. However, this computational allocation sometimes presents
convergence problems, in which case we suggest to rerun the code with new starting
values or to use the approach that assumes a given pool size.

5.5 References

Cochran, W. G. (1977). Sampling techniques. New York, Wiley. 3rd ed.

Nash JC, Varadhan R (2011). Unifying Optimization Algorithms to Aid Software


System. Users: optimx for R." Journal of Statistical Software, 43(9):1-14.

Xiong, W. (2014). The optimal group size using inverse binomial group testing
considering misclassification. Communications in Statistics – Theory and methods
(not published yet).
126

a b

0.006

0.006
Se = 0.82 S p = 0.82
Se = 0.86 S p = 0.86
0.004

0.004
Se = 0.9 S p = 0.9
Se = 0.94 S p = 0.94
Var(π^)

Var(π^)
Se = 0.98 S p = 0.98
0.002

0.002
0.000

0.000
20 40 60 80 100 120 20 40 60 80 100 120

pool size (s) pool size (s)

c
0.006

~ 0.006 ~
π = 0.04 π = 0.04
~
π = 0.042 ~
π = 0.042
~ ~
0.004

0.004

π = 0.044 π = 0.044
π = 0.046 ~
π = 0.046
^)

^)

~
Var(π

Var(π

π = 0.048 π = 0.048
0.002

0.002
0.000

0.000

20 40 60 80 100 120 20 40 60 80 100 120

pool size (s) pool size (s)

Figure 5.1. -x (./) for a two stage survey as a function of pool size (s).
a) for m=2 fields, g=10 pools per field, .2 = 0.04, ` = 0.90, OP, = 0.3 and
4 values of 3 . b) for m=2 fields, g=10 pools per field, .2 = 0.04, 3 = 0.90,
OP, = 0.3 and 4 values of ` . c) for m=2 fields, g=10 pools per field,
3 = 0.95, ` = 0.90 OP, = 0.3 and 4 values of .2. d) for m=2 fields, g=10
pools per field, ` = 3 = 1, OP, = 0.3 and 4 values of .2.
127

Table 5.1. R code for obtaining optimum values of g, and s for a two-stage survey with
group testing.
Opt.GT.2S=function(gstar,sstar,p,c1,c2,c3, Se,Sp, sigma2b)
A {g=gstar; s=sstar; p=p; Se=Se; Sp=Sp; c1=c1; c2=c2; c3=c3;
VdN=function(thetav)
{
g=thetav[1]; s=thetav[2];
r=Se+Sp-1;
par1=(1-p)^2/(r^2*s^2);
par2=Se*exp(-s*log(1-p))-r;
par3=(1-Se)*exp(-s*log(1-p))+r;
Vd=(par1*par2*par3)/(g);

V.b.star=(((p*(1-p))^2)*sigma2b);
V.p=V.b.star+Vd
C=(g*s*c1 + g*c2 + c3);
CX=V.p*C;
return(CX)
}
opt.sol=nlminb(c(g,s), VdN, control=list(trace=0, iter.max=10000), lower =1)
s.opt=opt.sol$par;
gs=ifelse(opt.sol$par[1]<2,2,ceiling(opt.sol$par[1]))
ss=ifelse(opt.sol$par[2]<2,2,round(opt.sol$par[2]))
s.opts=c(gs,ss)
list(s.opts= s.opts,Conv= opt.sol$convergence)
}
Opt.GT.2S.Output=Opt.GT.2S(gstar=2,sstar=2,p=0.04,c1=20, c2=10, c3=600,
Se=0.95, Sp=0.9, sigma2b=0.25)
Opt.GT.2S.Output
> Opt.GT.2S.Output
$s.opts
B [1] 8 10

$Conv
[1] 0
128

Table 5.2. Optimal sample sizes (, ) for group testing in two stages for three combinations of 3 and ` .

OP, = 0.05 OP, = 0.1 OP, = 0.15 OP, = 0.20 OP, = 0.25 OP, = 0.30
3 = ` = 0.90
.2 g s g s g s g s g s g s
0.01 7 37 5 37 4 37 4 37 NC NC 3 37
0.02 10 19 7 19 6 19 5 19 5 19 4 19
0.03 11 13 8 13 7 13 6 13 5 13 5 13
0.04 13 10 9 10 8 10 7 10 6 10 6 10
0.05 14 8 10 8 8 8 7 8 7 8 6 8
0.06 15 7 11 7 9 7 8 7 7 7 7 7
0.07 16 6 12 6 10 6 8 6 7 6 7 6
0.08 17 6 12 6 10 6 9 6 8 6 7 6
0.09 18 5 13 5 10 5 9 5 8 5 8 5
0.1 18 5 13 5 11 5 9 5 9 5 8 5
3 = ` = 0.95
0.01 8 29 6 29 5 29 4 29 4 29 3 29
0.02 10 16 8 16 6 16 5 16 5 16 5 16
0.03 12 11 9 11 7 11 6 11 6 11 5 11
0.04 13 9 10 9 8 9 7 9 6 9 6 9
0.05 15 7 10 7 9 7 8 7 7 7 6 7
0.06 15 6 11 6 9 6 8 6 7 6 7 6
0.07 16 5 12 5 10 5 8 5 8 5 7 5
0.08 17 5 12 5 10 5 9 5 8 5 7 5
0.09 18 5 13 5 10 5 9 5 8 5 7 5
0.1 18 4 13 4 11 4 9 4 8 4 8 4
3 = ` = 0.98
0.01 10 21 7 21 6 21 5 21 5 21 4 21
0.02 12 12 9 12 7 12 6 12 6 12 5 12
0.03 14 9 10 9 8 9 7 9 7 9 6 9
0.04 15 7 11 7 9 7 8 7 7 7 7 7
0.05 16 6 12 6 10 6 8 6 8 6 7 6
0.06 17 5 12 5 10 5 9 5 8 5 7 5
0.07 18 5 13 5 10 5 9 5 8 5 7 5
0.08 18 4 13 4 11 4 9 4 8 4 8 4
0.09 19 4 13 4 11 4 10 4 9 4 8 4
0.1 19 4 14 4 11 4 10 4 9 4 8 4
Ten values of the proportion (.2), six values of OP , = 0.5 and
, §„
= 30. NC means that the computational
§5
§~ §~
solution did not converge.
129

Table 5.3. Optimal sample sizes (, ) for group testing in two stages for three values of
, J.
OP,
0.05 0.1 0.15 0.20 0.25 0.30
, J =2
.2 g S g s g s g s g s g s
0.01 7 40 5 40 4 40 NC NC 3 40 NC NC
0.02 8 22 6 22 5 22 4 22 4 22 4 22
0.03 10 15 7 15 6 15 5 15 5 15 4 15
0.04 10 12 7 12 6 12 5 12 5 12 5 12
0.05 11 10 8 10 7 10 6 10 5 10 5 10
0.06 12 9 8 9 7 9 6 9 5 9 5 9
0.07 12 8 9 8 7 8 6 8 6 8 5 8
0.08 12 7 9 7 7 7 6 7 6 7 5 7
0.09 13 7 9 7 8 7 7 7 6 7 6 7
0.1 13 6 9 6 8 6 7 6 6 6 6 6
, J =6
0.01 6 46 4 46 3 46 3 46 3 46 3 46
0.02 7 26 5 26 4 26 4 26 3 26 3 26
0.03 7 19 5 19 4 19 4 19 4 19 3 19
0.04 8 15 6 15 5 15 4 15 4 15 3 15
0.05 8 13 6 13 5 13 4 13 4 13 4 13
0.06 8 11 6 11 5 11 4 11 4 11 4 11
0.07 9 10 6 10 5 10 5 10 4 10 4 10
0.08 9 9 6 9 5 9 5 9 4 9 4 9
0.09 9 8 7 8 5 8 5 8 4 8 4 8
0.1 9 8 7 8 6 8 5 8 4 8 4 8
, J = 10
0.01 5 50 4 50 3 50 3 50 3 50 2 50
0.02 6 29 4 29 4 29 3 29 3 29 3 29
0.03 6 21 5 21 4 21 3 21 3 21 3 21
0.04 7 17 5 17 4 17 4 17 3 17 3 17
0.05 7 15 5 15 4 15 4 15 3 15 3 15
0.06 7 13 5 13 4 13 4 13 3 13 3 13
0.07 7 11 5 11 4 11 4 11 4 11 3 11
0.08 8 10 5 10 5 10 4 10 4 10 3 10
0.09 8 9 6 9 5 9 4 9 4 9 3 9

Ten values of the proportion (.2), six values of 3 = ` = 0.90, and OP, , 0 J = 30. NC
0.1 8 9 6 9 5 9 4 9 4 9 4 9

means that the computational solution did not converge.


130

Table 5.4. Relative cost ( !(, ) = !(∗ ,  = ))/!(∗ ,  ∗ ) for two stages sampling.
A OP, B OP,
.2 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25
s=5 3 = ` = 0.88
0.01 3.11 3.02 2.98 2.91 NC 1.78 1.76 1.74 0.00 1.71
0.02 1.93 1.88 1.85 1.84 1.83 1.26 1.25 1.25 1.23 1.24
0.03 1.54 1.52 1.50 1.47 1.47 1.09 1.09 1.09 1.08 1.09
0.04 1.35 1.32 1.31 1.31 1.29 1.03 1.03 1.02 1.03 1.02
0.05 1.23 1.22 1.22 1.20 1.19 1.00 1.00 1.00 1.00 1.00
0.06 1.16 1.15 1.14 1.13 1.14 0.99 0.99 0.99 0.99 0.99
0.07 1.10 1.09 1.09 1.08 1.09 1.00 0.99 1.00 0.99 1.00
0.08 1.06 1.05 1.05 1.05 1.06 1.02 1.02 1.01 1.02 1.01
0.09 1.03 1.04 1.03 1.02 1.03 1.04 1.04 1.04 1.04 1.04
0.1 1.02 1.02 1.01 1.02 1.01 1.08 1.08 1.08 1.08 1.07
B s=10 3 = ` = 0.92
0.01 1.68 1.65 1.64 1.62 NC 1.57 NC NC 1.52 NC
0.02 1.22 1.21 1.20 1.20 1.20 1.18 1.17 1.16 1.17 1.17
0.03 1.08 1.08 1.07 1.06 1.07 1.07 1.07 1.06 1.06 1.06
0.04 1.02 1.02 1.02 1.02 1.02 1.01 1.01 1.02 1.01 1.02
0.05 1.00 1.00 1.00 1.00 0.99 1.00 0.99 0.99 1.00 0.99
0.06 0.99 0.99 0.99 0.99 1.00 0.99 0.99 1.00 1.00 0.98
0.07 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.01
0.08 1.01 1.01 1.01 1.01 1.02 1.02 1.01 1.02 1.01 1.02
0.09 1.04 1.04 1.03 1.03 1.04 1.04 1.03 1.04 1.03 1.03
0.1 1.08 1.08 1.06 1.07 1.07 1.07 1.06 1.06 1.06 1.06
s=15 3 = ` = 0.96
0.01 1.31 1.29 1.29 1.28 NC 1.34 1.32 1.32 1.30 NC
0.02 1.05 1.05 1.04 1.05 1.05 1.11 1.10 1.10 1.10 1.08
0.03 1.00 1.00 1.00 0.99 1.00 1.04 1.04 1.04 1.03 1.04
0.04 1.00 0.99 0.99 1.00 0.99 1.00 1.01 1.01 1.00 1.01
0.05 1.02 1.03 1.02 1.03 1.01 0.99 0.99 0.99 0.98 1.00
0.06 1.07 1.06 1.06 1.06 1.07 0.99 0.99 0.99 1.00 0.99
0.07 1.14 1.12 1.12 1.11 1.12 1.00 0.99 1.00 1.01 1.00
0.08 1.22 1.20 1.19 1.18 1.19 1.01 1.02 1.00 1.02 1.00
0.09 1.32 1.31 1.29 1.27 1.27 1.03 1.04 1.03 1.03 1.02

Ten values of the proportion (.2), five values of OP, . , J = 4, and
0 J =
0.1 1.47 1.44 1.40 1.41 1.38 1.06 1.06 1.06 1.05 1.04

30, three values of s=5, 10 and 15 and 3 = ` = 0.90. Part û) , J = 4, and
0 J =
Part A)

30 , pool size (s=10) and three combinations 3 and ` . NC means that the computational
solution did not converge.
131

Table 5.5. R code for obtaining optimum values of m, g, and s for a three-stage survey
with group testing.
Opt.GT.3S=function(mstar,gstar,sstar,p,c1,c2,c3,c4, Se,Sp,sigma2a,sigma2b)
A { m=mstar; g=gstar; s=sstar; p=p; Se=Se; Sp=Sp; c1=c1; c2=c2; c3=c3; c4=c4;
VdN=function(thetav)
{
m=thetav[1];g=thetav[2]; s=thetav[3];
r=Se+Sp-1;
par1=(1-p)^2/(r^2*s^2);
par2=Se*exp(-s*log(1-p))-r;
par3=(1-Se)*exp(-s*log(1-p))+r;
Vd=(par1*par2*par3)/(m*g);
V.a.star=(((p*(1-p))^2)*sigma2a);
V.b.star=(((p*(1-p))^2)*sigma2b)/(m);
V.p=V.a.star+V.b.star+Vd
C=((m*g*s)*c1 + (m*g)*c2 + (m)*c3 + c4);
CX=V.p*C;
return(CX)
}
opt.sol=nlminb(c(m,g,s), VdN, control=list(trace=0, iter.max=10000), lower
=1)
s.opt=opt.sol$par;
ms=ifelse(opt.sol$par[1]<2,2,ceiling(opt.sol$par[1]))
gs=ifelse(opt.sol$par[2]<2,2,ceiling(opt.sol$par[2]))
ss=ifelse(opt.sol$par[3]<2,2,round(opt.sol$par[3]))
s.opts=c(ms,gs,ss)
list(s.opts= s.opts,Conv= opt.sol$convergence)
}
Opt.GT.3S.Output=Opt.GT.3S(mstar=2,gstar=2,sstar=2,p=0.04,c1=20, c2=10,
c3=600,c4= 1600, Se=0.95,Sp=0.9,sigma2a=0.3,sigma2b=0.25)
Opt.GT.3S.Output
> Opt.GT.3S.Output

B $s.opts
[1] 2 8 10

$Conv
[1] 0
132

Table 5.6. Optimal sample sizes (, , ) for group testing in three stages for three
combinations of 3 and ` .
OP, = 0.05 OP, = 0.1 OP, = 0.15 OP, = 0.20 OP, = 0.25
.2
3 = ` = 0.90
m g s m g s m g s m g s m g s

0.01 2 10 37 2 7 37 2 6 37 2 5 37 2 5 37
0.02 2 13 19 2 10 19 2 8 19 2 7 19 2 6 19
0.03 2 16 13 2 11 13 2 9 13 2 8 13 2 7 13
0.04 2 18 10 2 13 10 2 11 10 2 9 10 2 8 10
0.05 2 20 8 2 14 8 2 12 8 2 10 8 2 9 8
0.06 2 21 7 2 15 7 2 13 7 2 11 7 2 10 7
0.07 2 23 6 2 16 6 2 13 6 2 12 6 2 10 6
0.08 2 24 6 2 17 6 2 14 6 2 12 6 2 11 6
0.09 2 25 5 2 18 5 2 15 5 2 13 5 2 11 5

3 = ` = 0.95
0.1 NC NC NC 2 18 5 2 15 5 2 13 5 2 12 5

0.01 2 11 29 2 8 29 2 6 29 2 6 29 2 5 29
0.02 2 15 16 2 10 16 2 9 16 2 8 16 2 7 16
0.03 2 17 11 2 12 11 2 10 11 2 9 11 2 8 11
0.04 2 19 9 2 13 9 2 11 9 2 10 9 2 9 9
0.05 2 20 7 2 15 7 2 12 7 2 10 7 2 9 7
0.06 2 22 6 2 15 6 2 13 6 2 11 6 2 10 6
0.07 2 23 5 2 16 5 2 13 5 2 12 5 2 10 5
0.08 2 24 5 2 17 5 2 14 5 2 12 5 2 11 5
0.09 2 25 5 2 18 5 2 14 5 2 13 5 2 11 5

3 = ` = 0.98
0.1 2 26 4 2 18 4 2 15 4 2 13 4 2 12 4

0.01 2 14 21 2 10 21 2 8 21 2 7 21 2 6 21
0.02 2 17 12 2 12 12 2 10 12 2 9 12 2 8 12
0.03 2 20 9 2 14 9 2 12 9 2 10 9 2 9 9
0.04 2 21 7 2 15 7 2 13 7 2 11 7 2 10 7
0.05 2 23 6 2 16 6 2 13 6 2 12 6 2 10 6
0.06 2 24 5 2 17 5 2 14 5 2 12 5 2 11 5
0.07 2 25 5 2 18 5 2 14 5 2 13 5 2 11 5
0.08 2 25 4 2 18 4 2 15 4 2 13 4 2 12 4
0.09 2 26 4 2 19 4 2 15 4 2 13 4 2 12 4
2 27 4 2 19 4 2 16 4 2 14 4 2 12 4
Ten values of the proportion (.2), five values of OP, , OÄ, = 0.3, , J = 0.5, of 0 J =
0.1

30, and 1 J = 80. NC means that the computational solution did not converge.
133

Table 5.7. Optimal sample sizes (, , ) for group testing in three stages for three values
of , J.
OP, = 0.05 OP, = 0.1 OP, = 0.15 OP, = 0.20 OP, = 0.25
.2 m g s m g s m g s m g s m g s
, J =2
0.01 2 9 33 2 7 33 2 6 33 2 5 33 2 5 33
0.02 2 12 19 2 8 19 2 7 19 2 6 19 2 6 19
0.03 2 13 14 2 9 14 2 8 14 2 7 14 2 6 14
0.04 2 14 11 2 10 11 2 8 11 2 7 11 2 6 11
0.05 2 15 9 2 10 9 2 9 9 2 8 9 2 7 9
0.06 2 15 8 2 11 8 2 9 8 2 8 8 2 7 8
0.07 2 16 7 2 11 7 2 9 7 2 8 7 2 7 7
0.08 2 16 7 2 12 7 2 10 7 2 8 7 2 7 7
0.09 2 17 6 2 12 6 2 10 6 2 9 6 2 8 6
0.1 2 17 6 2 12 6 2 10 6 2 9 6 2 8 6
, J =6
0.01 2 8 41 2 6 41 2 5 41 2 4 41 2 4 41
0.02 2 9 24 2 6 24 2 5 24 2 5 24 2 4 24
0.03 2 9 18 2 7 18 2 6 18 2 5 18 2 4 18
0.04 2 10 15 2 7 15 2 6 15 2 5 15 2 5 15
0.05 2 10 13 2 7 13 2 6 13 2 5 13 2 5 13
0.06 2 11 11 2 8 11 2 6 11 2 6 11 2 5 11
0.07 2 11 10 2 8 10 2 6 10 2 6 10 2 5 10
0.08 2 11 9 2 8 9 2 7 9 2 6 9 2 5 9
0.09 2 11 8 2 8 8 2 7 8 2 6 8 2 5 8
0.1 2 12 8 2 8 8 2 7 8 2 6 8 2 5 8
, J = 10
0.01 2 7 46 2 5 46 2 4 46 2 4 46 2 3 46
0.02 2 7 28 2 5 28 2 5 28 2 4 28 2 4 28
0.03 2 8 21 2 6 21 2 5 21 2 4 21 2 4 21
0.04 2 8 17 2 6 17 2 5 17 2 4 17 2 4 17
0.05 2 9 15 2 6 15 2 5 15 2 5 15 2 4 15
0.06 2 9 13 2 6 13 2 5 13 2 5 13 2 4 13
0.07 2 9 11 2 7 11 2 5 11 2 5 11 2 4 11
0.08 2 9 10 2 7 10 2 6 10 2 5 10 2 4 10
0.09 2 9 10 2 7 10 2 6 10 2 5 10 2 4 10

Ten values of the proportion (.2), five values of OP, ,


= 0.25, 3 = ` = 0.95, OÄ,
0.1 2 10 9 2 7 9 2 6 9 2 5 9 2 5 9

0  = 30, and 1  = 80. NC means that the computational solution did not
J J

converge.
134

Table 5.8. Relative cost ( !(, , ) = !(∗ , ∗ ,  = 10))/!(∗ , ∗ ,  ∗ ) for three


stage sampling.
A OP, B OP,
.2
, J = 2
0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25
3 = ` = 0.90
0.01 1.27 1.28 1.25 1.26 1.24 1.21 1.18 1.17 1.18 1.17
0.02 1.06 1.05 1.05 1.05 1.06 1.04 1.04 1.04 1.03 1.02
0.03 1.00 1.01 1.01 1.00 1.01 1.01 1.01 0.99 1.00 1.01
0.04 1.00 1.00 0.98 1.00 1.00 1.00 0.98 1.01 1.01 1.01
0.05 1.00 1.01 0.99 1.00 1.00 0.99 1.02 1.00 0.99 0.99
0.06 1.02 1.02 1.01 1.02 1.02 1.01 1.00 1.02 1.01 1.01
0.07 1.04 1.04 1.06 1.02 1.05 1.02 1.04 1.02 1.02 1.03
0.08 1.04 1.06 1.03 1.04 1.05 NC 0.99 1.01 1.03 1.04
0.09 1.10 1.10 1.07 1.08 1.08 1.04 NC 1.05 1.04 1.04
0.1
, J = 6
NC 1.12 1.09 1.10 1.09 1.06 1.05 1.06 1.05 1.05
3 = ` = 0.95
0.01 1.12 1.12 1.13 1.10 1.12 1.37 1.32 1.32 1.34 1.33
0.02 1.00 1.01 0.99 0.99 1.00 1.14 1.15 1.13 1.11 1.13
0.03 0.99 1.00 0.98 0.98 0.99 1.06 1.03 1.05 1.04 1.07
0.04 0.99 1.00 0.98 0.98 0.98 1.02 1.01 1.02 1.03 1.01
0.05 1.01 1.01 1.00 1.01 1.02 1.01 1.00 1.00 1.01 1.01
0.06 1.02 1.04 1.02 1.03 1.02 NC 0.98 1.02 0.99 0.99
0.07 1.06 1.05 1.07 1.06 1.06 1.00 1.00 1.00 1.00 1.00
0.08 1.07 1.07 1.04 1.07 1.06 1.00 0.99 0.98 0.99 1.01
0.09 1.05 1.08 1.05 1.04 1.07 1.02 1.01 1.01 1.00 1.03
0.1
, J = 10
1.12 1.12 1.11 1.10 1.12 1.01 1.02 1.01 1.01 1.04
3 = ` = 0.98
0.01 1.03 1.04 1.03 1.03 1.04 1.53 1.49 1.50 1.46 1.51
0.02 1.00 1.00 1.00 0.99 0.99 1.23 1.22 1.17 1.20 1.17
0.03 0.98 1.00 0.98 0.98 1.00 1.11 1.08 1.10 1.11 1.09
0.04 1.01 1.01 1.00 1.01 1.01 1.06 1.04 1.05 1.06 1.04
0.05 1.00 1.03 1.02 1.01 1.02 1.02 1.04 1.01 1.01 1.02
0.06 1.04 1.04 1.05 1.05 1.04 1.00 1.01 1.02 0.98 1.02
0.07 1.03 1.05 1.03 1.02 1.05 1.00 0.99 1.03 0.99 1.02
0.08 1.09 1.08 1.07 1.07 1.08 1.00 1.00 1.00 1.00 1.00
0.09 1.08 1.09 1.09 1.08 1.07 1.00 1.00 0.96 1.00 1.00
0.1 1.08 1.10 1.08 1.08 1.08 1.00 0.98 0.98 1.02 0.98

Ten values of the proportion (.2), five values of OP, . Part A) OÄ, = 0.3, , J = 0.5,
0.01 1.03 1.04 1.03 1.03 1.04 1.53 1.49 1.50 1.46 1.51

0 J = 30, and 1 J = 80, three combinations of 3 and ` . Part û) OÄ, = 0.25,
3 = ` = 0.95, 0 J = 30, and 1 J = 80, and three combinations of , J . NC
means that the computational solution did not converge.
135
„
ƒ„ u
(Ž3JH
 u)‚ H
Appendix 5.A. Derivative of -(ž) =
 (JH
 u)
b„ (Ž œŽu J)„/‚
with respect to pool size (s)

following the ideas of Xiong (2014)


„
ƒ„ u
(Ž JH
 u)‚ H
-(ž; ) = ; -(ž) rewritten as a function of s.
 (JH
 u)
b„ (Ž œŽu J)„/‚
„
ƒ„ u
(Ž3JH
 u)‚ H (JH
 u)
= „ ; .2 ` = 3 + W1 − 3 − ` Y(1 − .2)b
b„ WŽ œŽu JY ‚
„
ƒ„
(Å(JH
 )‚ ) ‚ [Ž JÅ(JH
 )‚ ][Å(JH
 )‚ œJŽ ]
= „ ; = W3 + ` − 1Y
b„ Å ‚
[Ž JÅ(JH
 )‚ ][Å(JH
 )‚ œJŽ ]
b„ Å „ (JH
 )„‚ƒ„
=

[3 (1 − .2)Jb − ][ + (1 − 3 )(1 − .2)Jb ]


(JH
 )„
b„ Å „
=

[3 exp(− log(1 − .2) ) − ][(1 − 3 ) exp(− log(1 − .2) ) + ]


(JH
 )„
b„ Å „
=

Using Taylor series expansion we have:

´3 ∑ø − µ ´(1 − 3 ) ∑ø
(JH
 )„ (Jb ÒKL(JH
 ))I (Jb ÒKL(JH
 ))I
-(ž; ) = b„ Å „ )d )! )d )!
+ µ
(JH
 )„
= j∑ø
)d D(*)  − rkj∑)d û(*)  + k;
) ø )
b„ Å „
I
; û(*) = (1 − 3 )
 )Y  ))I  ))I
D(*) = 3 = − D(*)
WJ óq(JH (J óq(JH (J óq(JH
)! )! )!

(1 − .2), ø ø ø ø
= ?  D([) û(\) BœZ
+ r  D(*)  )
−  û(*)  )
, , Bd Zd )d )d

− ,
@

(1 − .2), ø ø ø
= ?  D([) û(\) BœZ
+ r  [D(*) − û(*)]  ) − ,
@
, , Bd Zd )d

(1 − .2), ø B ø
= ?  D(\) û(\ − [) B
+ r  [D(*) − û(*)]  )
, , Bd Zd )d

+ [D(0) − û(0)] − ,
@

(1 − .2), ø ø
= í À(*)  )
+ r  [D(*) − û(*)]  ) + ,
− ,
î
, , )d )d
136

(1 − .2), ø ø
= í À(*)  )
+ r  !(*)  ) î
, , )d )d

(1 − .2), ø
=  (*)  )
, , )d
ø B
À([) =   D(\) û(\ − [) B
Bd Zd

!(*) = D(*) − û(*)


(*) = À(*) + !(*), * ≥ 1)
(0) = À(0)
 )]I
(,Ž J)[JÒKL(JH
as 3 > 0.5, then !(*) = D(*) − û(*) = )!
> 0, and since

D(*), û(*), (xü*( FℎxF ` > 0.5) > 0)∀*, then À(*) > 0. Therefore, (*) > 0.
Therefore the first derivative of -(ž, ) is

∑ø
(JH
 )„
= )d(* − 2)(*) 
tš(›,b) )J0
tb ń

And the second derivative of -(ž, ) is


ˆ , -(ž, ) (1 − .2), ø
=  (* − 2)(* − 3)(*)  )J1 > 0
ˆ , ,
)d

since (* − 2)(* − 3)(*) > 0, ∀* ≠ 2,3. All coefficients are positive and then >
t „ š(›,b)
tb„

0. Therefore, -(ž, ) is a convex function and has a minimum  ∗ , which is the solution of

= 0.
tš(›,b)
tb
137

Appendix 5. B. Proof of Theorem 1

The function D(, ) = [OP,∗ + ] (  + , + 0 ) defined on the set of all triples
š(›)
q
of numbers. Its first partials are

(  + , )+(  + , + 0 )[−


tO(r,q,b) š(›) š(›)
= [OP,∗ + ]
tq q q„
]

 (J. )„ ø
( )+(  + , + 0 ) [ ∑)d(*
tO(r,q,b) š(›)
tb
= [OP,∗ + q
] q Å „ − 2)(*)  )J0 ] by

Appendix 5.A.

The second derivatives of !(, ) are

= 2(  + , )[− + , + 0 )[


tO(r,q,b) š(›) ,š(›)
]+(  ]
tq„ q„ q5

⟺ −2(  + , )[ + , )[
š(›) š(›) ,š(›)
]+2(  ]+ (0 )[ ]
q„ q„ q5

-(ž)
⟺ (20 ) ? 0 @ > 0

(J.
 )„
∑ø
tO(r,q,b)
tb„
= 2  [ ń )d(* − 2)(*) 
)J0
]
 (J. )„ ø
+(  + , + 0 ) q ´ ∑)d(* − 2)(* − 3)(*)  )J1 µ > 0
Å „

Since (* − 2)(*) >0, ∀* ≠ 2 and (* − 2)(* − 3)(*) > 0, ∀* ≠ 2,3 (see details in
Appendix 5.A).

Since all coefficients are positive, then !(, ) is a convex function and has minimal
points  and , which are the solution to = 0, = 0, respectively. The proof
tO(q,b) tO(q,b)
tq tb
of theorem 1 is complete.
138

Appendix 5.C. Proof of Theorem 2

The function !(, , ) = [OÄ,∗ + ] (  + , + 0 + 1 ) defined on


„∗
+
† š(›)
r rq
the set of all triples of numbers. Its first partials are

(  + , + 0 )+(  + , + 0 +


„∗
tO(r,q,b) š(›)
tr
= [OÄ,∗ + †
r
+ rq
]

1 )[−
„∗
š(›)
†

− r„ q

(  + , )+(  + , + 0 + 1 )[−


„∗
tO(r,q,b) š(›) š(›)
= [OÄ,∗ + †
+ ]
tq r rq q„ r
]

(  )
„∗
tO(r,q,b) š(›)
= [OÄ,∗ + †
+
tb r rq
]
 (J.
 )„
+(  + , + 0 + 1 ) [ ń ∑ø )d(* − 2)(*)  )J0 ]
qr

by Appendix 5. A.

The second derivatives of !(, , ) are

= 2(  + , + 0 )[− r†„ − r„ q]+( + , + 0 + 1 ) ´ r†5 +


„∗
tO(r,q,b) š(›) , „∗
tr„

⟺ −2(  + , + 0 )[ r†„ + r„ q]+2( + , + 0 ) ´ r†„ + r„ qµ +


„∗ „∗
,š(›) š(›) š(›)
r5 q
µ

, „∗ ,š(›) , „∗ ,š(›)
1 ´ r†5 + r5 q
µ ⟺ 1 ´ r†5 + r5 q
µ >0

= 2(  + , )[− q„ r]+(  + , + 0 + 1 )[ q5 r ]


tO(r,q,b) š(›) ,š(›)
tq„

⟺ −2(  + , )[q„ r]+2(  + , )[q„ r] + (0 + 1 )[q5 r]
š(›) š(›) š(›)

-(ž)
⟺ 2(0 + 1 ) ? 0 @ > 0
 
(J.
 )„
= 2 ( )[ ∑ø
tO(r,q,b)
)d(* − 2)(*)  ] +( + , + 0 +
)J0
tb„ ń
(J.
 )„
1 ) qr ´ ∑ø
)d(* − 2)(* − 3)(*) 

ń
)J1
µ>0

Since (* − 2)(*) >0, ∀* ≠ 2 and (* − 2)(* − 3)(*) > 0, ∀* ≠ 2,3 (see details in
Appendix 5.A).

Since all coefficients are positive, then !(, , ) is a convex function and has minimal
points ,  and , which are the solution to = 0, tq = 0, tb = 0,
tO(r,q,b) tO(r,q,b) tO(r,q,b)
tr
respectively. The proof of Theorem 2 is complete.
139

Chapter 6: General conclusions

In the present research we propose optimal sampling plans for two and three stages for
group testing data under budget and variance constraints. These optimal sampling plans
were derived under a mixed logistic group testing model (MLGTM). Then to be able to
get the marginal variance of the proportion (./) we used a first order Taylor series
expansion (TSE) estimation approach. That is, the MLGTM was approximated using a
first order TSE. For a two stage sampling process the marginal variance of ./ was
approximated using a first order TSE around N) = 0 where N) is the random component
(field effect) assumed N) ~Ô(0, OP, ). While for the three stage sampling process the
marginal variance of ./ was approximated using a first order TSE around x) = 0 and
N)B = 0 where x) is the random effect due to locality assuming x) ~Ô(0, OÄ, ) and
N)B ~ÔW0, OP(Ä)
,
Y is the random component due to field.

Then assuming a fixed budget for estimating the average population proportion (.2)
we minimized the variance of ./ subject to a budget constraint and the optimal allocation
of units was obtained using Lagrange multipliers. In the case of two stage sampling we
assumed a fixed pool size (s) and performed the optimal allocation of fields and pools per
field, while in the context of a three stage sampling we obtained the optimal allocation of
localities, fields and pools per field.

Next we relaxed the assumption of fixed pool size and optimized the pool size by
minimizing the product of the variance of the proportion (./) and the budget constraint.
Unfortunately it was difficult to get an analytical solution for the values of localities,
fields, pools per field and pool size and we got computational solutions using the nlminb
function of R. Comparing the solutions for a given pool size with the optimized pool size
we found that it is better to also optimize the pool size when the sensitivity and
specificity of the diagnostic test is considerably lower than one. However, no gain is
obtained when the sensitivity and specificity was near 1 or close to perfect.

These optimal sample sizes (for two and three stages) were derived under a TSE
which is expected to give a slight biased parameter estimates. However, they have the
140

advantage that they are simple analytical expressions that can be used by the investigators
when planning their survey research. These optimal sample sizes are also useful to plan
studies with different objectives: a) to minimize the variance of the proportion subject to
a budget constraint, b) to minimize the cost subject to a variance constraint. Specifically,
in b) the variance constraint can be due either a variance constraint to get a specific width
of confidence interval or a variance constraint to get a specific power for testing a
specific hypothesis.

Also, we make a generalization of a mixed group testing regression model for a


complex survey in two stages with stratification and clusters of different sizes under an
informative sampling process. This approach is important since all group testing
regression models developed so far assume that the selection probabilities are the same
for all clusters and individuals and weights are not required for estimation. However, we
showed that when the sampling process is informative and the design structure of the
survey (weights) is ignored the resulting estimates present a significant bias. Therefore,
the inclusion of the weights is required to get approximate unbiased estimates. However,
when the sampling process is conducted in two stages the weights should be
accommodated at the individual level and at the cluster level (i.e. two different positions
in the likelihood function) and the weights at individual level need to be scaled to get
unbiased precise estimates, since the inclusion of the raw weights produce even worst
results than ignoring the weights.

To estimate the population parameters we approximate the population likelihood


equations that would have been obtained in the case of a census, incorporating the
sampling weights at the cluster and the individual levels, using a pseudo-maximum
likelihood (PML) approach. This approach (PML) is recommended to obtain precise
parameter estimates when the sampling process is informative. However, the
incorporation of the weights to form the PML was challenging since they are functions of
sums across clusters and individuals and requires knowledge of the two weights. Also,
we compared the results of ignoring the sampling weights, using the raw weights and
using three methods of scaling the sampling weights. We found that if the sampling
process is informative we need to incorporate both weights at cluster and at the individual
141

level. However, scaled weights are necessary at individual level, because using raw
weights produces a considerable bias.

Since the incorporation of the weights is a difficult task, we provide SAS code that
the researchers can use for group testing data from a complex survey in the presence of
an informative sampling. Also, we provided a detailed application to show how an
analysis can be done using the SAS code provided.

6.1 Future research

Often group testing data are collected from complex sample surveys. To obtain high
quality data at minimum cost, it is important to use optimal sampling plans and
regression models for complex surveys under group testing. Optimal sampling with
group testing can produce a considerable saving of resources with little losses in
precision. Future work should be done to improve the sampling plans developed here
using a second order series expansion to improve the quality of parameter estimates and
to evaluate the estimates via simulation.

In addition, there is much to do regarding regression group testing models for


complex surveys: (1) to generalize the group testing models to three or more stage
surveys since most practical complex surveys usually have at least three stages, (2) to
explore the generalized estimation equations (GEE) approach to group testing data under
a complex survey since the use of a GLMM with adaptive Gaussian quadrature is
prohibitive when the number of random effects is large, (3) to explore the use of the
Expected-Maximization (EM) algorithm for multistage group testing problems which can
have advantages since it may be possible to include the individual weights in place of
pooled weights, (4) to use Bayesian analysis for parameter estimates for multistage group
testing data, (5) In place of using the pseudo-maximum likelihood (PML) approach,
explore the feasibility of using the sample distribution proposed by Pfeffermann (2006) in
non-group testing context, (6) to develop practical software for multistage group testing
settings so that researchers with little background in statistics can perform their analysis
of multistage group testing data.
142

6.2. References

Pfeffermann, D., Da Silva Moura, F.A., and Do Nascimento Silva, P.L. (2006). Multi-
level modelling under informative sampling. Biometrika, 93, (4):943-959.

You might also like