Professional Documents
Culture Documents
Designing and Conducting Laboratory
Designing and Conducting Laboratory
Laboratory Experiments
Why use laboratory experiments
Test and improve modelling assumptions, especially when human decision
makers are involved
Improve theory
Replication in different contexts improve applicability area of a theory
e.g. usage of psychological theory in behavioral OR experiments
In order to understand the cause and effect relationships in a system one must
deliberately change the input variables to the system and observe the changes in the
system output that these changes to the inputs produce. In other words experiments
need to be conducted on the system. Observations on a system or process can lead
to theories or hypotheses about what makes the system work, but experiments are
required to demonstrate that these theories are correct.
Categories of experiments
1. Testing an refining existing theory
2. Characterizing new phenomena leading to new theory
3. Testing institutional designs (Strategic options and information available to
participants should match those assumed by the theoretical model.)
The art of designing good experiments (as well as buildin good analytical models) is
in creating simple environments that capture the essence of the real problem while
abstracting away all unnecessary details. Thus, the first step of doing experimental
work is to start with an interesting theory.
Nuisance Variable(s):
Factors that may influcence the experimental response but in which we are
not directly interested (e.g. gender, experience, age...)
The process of the experiment so that the appropriate data will be collected and
analyzed by statistical methods, resulting in valid and objective conclusions. When
the problem involves data that are subject to experimental errors, statistical methods
are the only objective approach to analysis.
randomization,
replication and
blocking
2. Context:
3. Subject Pool:
4. Incentives:
Course credits
Binary lottery procedure, it has the advantage against risk aversion but it is
less straight forward
Decisions that involve social preferences and risk are affected by real
incentives.
5. Deception:
Common in psychology.
6. Additional Information:
Subject recruitment is most efficiently done through the Internet, and several
recruitment systems have been developed and are freely available for
academic use (ORSEE software, developed by Ben Greiner), can be accessed
from this URL: http://www.orsee.org/). Psychology departments often use Sona
Systems (http://www.sona‐systems.com), which is not free, but is convenient,
because it is hosted.
The majority, although not all, of experiments are conducted using a computer
interface.(z-Tree, oTree, SoPHIE...)
Experimental Designs
Guidelines for designing experiments
1. Recognition of and the statement of the problem
2. Selection of the response variable
3. Choice of factors,levels and range (design factors, held-constant factors and
allowed-to-vary, cause-and-effect-diagram)
4. Choice of experimental design
5. Performing the experiment
6. Statistical analysis of the data
7. Conclusions and recommendations (Follow-up runs and confirmation testing)
Will you only estimate the treatment effect? (Average treatment effect, t-
Test)
How will you achieve randomization? (Between subject vs. within subjects -->
Nested or Hierarchical Models - Fixed, Random or Mixed Effects Models )
Statistical Power
How large must the sample be?
There is type 1 error – namely, the probability of rejecting the null hypothesis
even though it is true.
There is type 2 error, which is the probability of not rejecting the null
hypothesis even though the alternative hypothesis is true.
Generally, the more observations that are collected, the more powerful the
test.
Design also involves thinking about and selecting a tentative empirical model
to describe the results. This model is just a quantitative relation between the
response and the important design factors. In many cases a low order
polynomial model will be appropriate (i.e. first order model). A first order
model in two variables is:
𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝜖y=β0+β1x1+β2x2+ϵ
𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝜖y=β0+β1x1+β2x2+ϵ
Higher-order interactions can also be included in the experiments with more than two
factors if necessary. Another widely used model is the second-order model
𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽12𝑥1𝑥2+
+𝛽11𝑥112+𝛽22𝑥222+𝜖y=β0+β1x1+β2x2+β12x1x2++β11x112+β22x222+ϵ
Second - order models are often used in optimization experiments.
Hypothesis testing
A statistical hypothesis is a statement either about the parameters of a probability
distribution or the parameters of a model. The hypothesis reflects
some conjecture about the problem situation. This may be stated formally as
𝐻0:𝜇1=𝜇2𝐻1:𝜇1≠𝜇2H0:μ1=μ2H1:μ1≠μ2
The statement 𝐻0H0 is called the null hypothesis, and 𝐻1H1 is called
the alternative hypothesis. The alternative hypothesis specified here is called
a two‐sided alternative hypothesis because it would be true if 𝜇1<𝜇2μ1<μ2 or
if 𝜇1<𝜇2μ1<μ2.
To test a hypothesis, we devise a procedure for taking a random sample, computing
an appropriate test statistic, and then rejecting or failing to reject the null
hypothesis 𝐻0H0 based on the computed value of the test statistic. Part of this
procedure is specifying the set of values for the test statistic that leads to rejection
of 𝐻0H0. This set of values is called the critical region or rejection region for the
test.
Two kinds of errors may be committed when testing hypotheses. If the null
hypothesis is rejected when it is true, a type I error has occurred. If the null
hypothesis is not rejected when it is false, a type II error has been made. The
probabilities of these two errors are given special symbols.
When assumptions are badly violated, the performance of the t‐test will be affected.
Generally, small to moderate violations of assumptions are not a major concern, but
any failure of the independence assumption and strong indications of non normality
should not be ignored. Both the significance level of the test and the ability to detect
differences between the means will be adversely affected by departures from
assumptions. Transformations are one approach to dealing with this problem.
Confidence Intervals
Although hypothesis testing is a useful procedure, it sometimes does not tell the
entire story. It is often preferable to provide an interval within which the value of the
parameter or parameters in question would be expected to lie. These interval
statements are called confidence intervals.
Controlling the impact of the sample size One way to do this is to consider the
impact of sample size on the estimate of the difference in two means. We know that
the 100(1−𝛼)100(1−α)% confidence interval on the difference in two means is a
measure of the precision of estimation of the difference in the two means. The length
of this interval is determined by
𝑡𝛼/2,𝑛1+𝑛2−2𝑆𝑝1𝑛1+1𝑛2⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√tα/2,n1+n2−2Sp1n1+1n2
We consider the case where the sample sizes from the two populations are equal, so
that 𝑛1=𝑛2=𝑛n1=n2=n. Then the length of the confidence interval is determined
by
𝑡𝛼/2,2𝑛−2𝑆𝑝2𝑛⎯⎯⎯⎯√tα/2,2n−2Sp2n
Consequently, the precision with which the difference in the two means is estimated
depends on two quantities-𝑆𝑝Sp, over which we have no control,
and 𝑡𝛼/2,2𝑛−2𝑆𝑝2𝑛⎯⎯⎯√tα/2,2n−2Sp2n, which we can control by choosing the sample
size 𝑛n.
Using a hypothesis testing framework
The choice of the sample size and the probability of type II error 𝛽β are closely
connected. Suppose that we are testing the hypotheses
𝐻0:𝜇1=𝜇2𝐻1:𝜇1≠𝜇2H0:μ1=μ2H1:μ1≠μ2
and that the means are not equal so that 𝛿=𝜇1−𝜇2δ=μ1−μ2. The probability of type
II error depends on the true difference in means, 𝛿δ. A graph 𝛽βversus 𝛿δ for a
particular sample size is called the operating characteristic curve, or OC
curve for the test. The 𝛽β error is also a function of the sample size. Generally, for a
given value of 𝛿δ, the 𝛽β error decreases as the sample size increases. That is, a
specified difference in means is easier to detect for larger sample sizes than for
smaller ones.
An alternative to the OC curve is a power curve, which typically plots power
or 1−𝛽1−β versus sample size for a specified difference in the means. Some
software packages perform power analysis and will plot power curves.
From examining power curves, we observe the following:
We will find it useful to describe the observations from an experiment with a model.
One way to write this model is in form of a means model:
𝑦𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗{𝑖=1,2,...,𝛼𝑗=1,2,...,𝑛yij=μi+εij{i=1,2,...,αj=1,2,...,n
where 𝑦𝑖𝑗yij is the ijth observation, 𝜇𝑖μi is the mean of the ith factor or treatment
and 𝜀𝑖𝑗εij is a random error component that incorporates all other sources of
variability in the experiment including measurement, variability arising from
uncontrolled factors, differences between the experimental units (such as test
material) to which the treatments are applied, and the general background noise in
the process (such as variability over time, effects of environmental variables). It is
convenient to think of the errors as having mean zero, so that 𝐸(𝑦𝑖𝑗)=𝜇𝑖E(yij)=μi.
An alternative way is to write the model in effects model. As in
𝜇𝑖=𝜇+𝜏𝑖,𝑖=1,2,...,𝛼,so
that𝑦𝑖𝑗=𝜇+𝜏𝑖+𝜀𝑖𝑗{𝑖=1,2,...,𝛼𝑗=1,2,...,𝑛μi=μ+τi,i=1,2,...,α,so
thatyij=μ+τi+εij{i=1,2,...,αj=1,2,...,n
In this form of the model, 𝑚𝑢mu is a parameter common to all treatments called
the overall mean, and 𝜏𝑖τi is a parameter unique to the ith treatment called the ith
treatment effect.
Both the means model and the effects model are linear statistical models, that is
the response variable 𝑦𝑖𝑗yij is a linear function of the model parameters. Although
both forms of the model are useful, the effects model is more widely encountered in
the experimental design literature. It has some intuitive appeal that 𝜇μ is a constant
and the treatment effects 𝜏𝑖τi represent deviations from this constant when the
specific treatments are applied. Both models also called the one‐way or single‐
factor analysis of variance (ANOVA) model because only one factor is
investigated. Furthermore, we will require that the experiment be performed in
random order so that the environment in which the treatments are applied (often
called the experimental units) is as uniform as possible. Thus, the experimental
design is a completely randomized design. Our objectives will be to test
appropriate hypotheses about the treatment means and to estimate them. For
hypothesis testing, the model errors are assumed to be normally and independently
distributed random variables with mean zero and variance 𝜎2σ2. The
variance 𝜎2σ2 is assumed to be constant for all levels of the factor. This implies that
the observations
𝑦𝑖𝑗 𝑁(𝜇+𝜏𝑖,0)yij N(μ+τi,0)
and that the observations are mutually independent.
Fixed or random factor?
The effects model describes two different situations with respect to the treatment
effects. First 𝛼α treatments could have been specifically chosen by the experimenter.
In this situation, we wish to test hypotheses about the treatment means, and our
conclusions will apply only to the factor levels considered in the analysis. The
conclusions cannot be extended to similar treatments that were not explicitly
considered. We may also wish to estimate the model parameters (𝜇,𝜏𝑖,𝜎2)(μ,τi,σ2).
This is called the fixed effects model.
Alternatively, the 𝛼α treatments could be a random sample from a larger
population of treatments. In this situation, we should like to be able to extend the
conclusions (which are based on the sample of treatments) to all treatments in the
population, whether or not they were explicitly considered in the analysis. Here,
the 𝜏𝑖τi are random variables, and knowledge about the particular ones investigated
is relatively useless. Instead, we test hypotheses about the variability of the 𝜏𝑖τi try
to estimate this variability. This is called the random effects
model or components of variance model.
𝑦𝑖𝑗=𝜇+𝜏𝑖+𝜀𝑖𝑗yij=μ+τi+εij
and confidence intervals on the treatment means.
If these assumptions are valid, the analysis of variance procedure is an exact test of
the hypothesis of no difference in treatment means.
In practice, however, these assumptions will usually not hold exactly. Consequently,
it is usually unwise to rely on the analysis of variance until the validity of these
assumptions has been checked. Violations of the basic assumptions and model
adequacy can be easily investigated by the examination of residuals. Examination
of the residuals should be an automatic part of any analysis of variance. If the model
is adequate, the residuals should be structureless; that is, they should contain no
obvious patterns. Through analysis of residuals, many types of model inadequacies
and violations of the underlying assumptions can be discovered.
In general, moderate departures from normality are of little concern in the fixed
effects analysis of variance. An error distribution that has considerably thicker or
thinner tails than the normal is of more concern than a skewed distribution. Because
the 𝐹F-test only slightly affected, we say that the ANOVA is robust to the normality
assumption. Departures from normality usually cause both the true significance level
and the power to differ slightly from the advertised values, with the power generally
being lower.
A very common defect that often shows up on normal probability plots is one residual
that is very much larger than any of the others. Such a residual is often called
an outlier. The presence of one or more outliers can seriously distort the analysis of
variance, so when a potential outlier is located, careful investigation is called for.
Frequently, the cause of the outlier is a mistake in calculations or a data coding or
copying error.
In some experiments, we may find that the difference in response between the levels
of one factor is not the same at all levels of the other factors. When this occurs, there
is an interaction between the factors.
Two‐factor interaction graphs such as these are frequently very useful in interpreting
significant interactions and in reporting results to nonstatistically trained personnel.
However, they should not be utilized as the sole technique of data analysis because
their interpretation is subjective and their appearance is often misleading.
There is another way to illustrate the concept of interaction. Suppose that both of our
design factors are quantitative (such as temperature, pressure, time). Then
a regression model representation of the two‐factor factorial experiment could
be written as
𝑦=𝛽0+𝛽1𝑥1+𝛽2𝑥2+𝛽12𝑥1𝑥2+𝜀y=β0+β1x1+β2x2+β12x1x2+ε
where 𝑦y is the response, the 𝛽β's are parameters whose values are to be
determined, 𝑥1x1 is a variable that represents factor 𝐴A, 𝑥2x2 is a variable that
represents factor 𝐵B, and 𝜀ε is a random error term. The variables 𝑥1x1 and 𝑥2x2 are
defined on a coded scale from -1 to +1 (the low and high levels of A and B),
and 𝑥1𝑥2x1x2 represents the interaction between 𝑥1x1 and 𝑥2x2.
The 2k factorial design
Introduction
Factorial designs are widely used in experiments involving several factors where it is
necessary to study the joint effect of the factors on a response.
The most important of these special cases is that of 2𝑘2k factors, each at only two
levels. These levels may be quantitative, such as two values of temperature,
pressure, or time; or they may be qualitative, such as two machines, two operators,
the “high” and “low” levels of a factor, or perhaps the presence and absence of a
factor. A complete replicate of such a design requires
2𝑥2𝑥2.....=2𝑘2x2x2.....=2k
observations and is called a 2𝑘2k factorial design.
The 2𝑘2k design is:
particularly useful in the early stages of experimental work when many factors
are likely to be investigated.
It provides the smallest number of runs with which 𝑘k factors can be studied in
a complete factorial design.
Consequently, these designs are widely used in factor screening experiments
(where the experiments is intended in discovering the set of active factors
from a large group of factors).
It is also easy to develop effective blocking schemes for these designs and to
fix them in fractional versions.
The 2222 Design
The general 2𝑘2k Design
The methods of analysis that we have presented thus far may be generalized
to the case of a 2𝑘2k factorial design with 𝑘k factors at two levels.
The statistical model for a 2𝑘2k would include 𝑘k main effects,
(𝑘2)(k2) two - factor interactions,
(𝑘3)(k3) three - factor interactions, ...
one k-factor interaction.
The complete model would contain 2𝑘−12k−1 effects for a 2𝑘2k design.
Remember!