Optimal Significance Level and Sample Size in Hypothesis Testing 1 - Tests of Means

Vol.
6, 2021-06
Optimal Significance Level and Sample Size in Hypothesis Testing. 1. Tests

of Means
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
hugo.hernandez@forschem.org
doi: 10.13140/RG.2.2.18643.09762
Abstract
There are two very important steps in hypothesis testing that are commonly undervalued:
Selecting the sample size and choosing the significance level. A minimum sample size can be
obtained based on power analysis, but this method requires previously choosing the
significance level. The significance level has been classically considered to be 5% (following the
initial suggestion provided by Ronald Fisher) and while other typical values are sometimes
employed (e.g. 10%, 1%, 0.1%) many practitioners of hypothesis testing do not have a clear,
objective criterion for choosing this value. Considering that the significance level has a direct
influence on the conclusion of the test, it should not be chosen using subjective methods. In
this direction, a new approach for determining the optimal sample size and optimal significance
level during hypothesis testing is presented. The focus in the first of a series of reports about
this topic is discussing the tests of means: The Z-test and Student’s T-test. The optimization of
the tests is done considering a desired Cohen resolution, which can be obtained from the
particular problem conditions or by using default values based on the minimum viability of each
test. The optimal sample size can be found by solving an economical optimization problem,
involving the ratio between the cost of an erroneous decision and the cost of an individual
observation. When this cost ratio is not available, a maximum tolerable test error can be used
for optimizing the sample size. Once the sample size has been defined, the optimal significance
level is found by minimizing the total test error, including both type I and type II errors. The
hypothesis tests are then performed as usual. In order to simplify the interpretation of the
tests, a decision value (D-value) is proposed. Positive D-values represent a positive decision
(rejecting the null hypothesis), and negative D-values represent a negative decision (not
rejecting the null hypothesis). D-values within a critical threshold around zero are considered
inconclusive. Decisions based on D-values are shown to be less fluctuating than those based on
P-values, and therefore, the former can lead to more reliable conclusions.
20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (1 / 45)

www.forschem.org
Optimal Significance Level and Sample Size in
Hypothesis Testing. 1. Tests of Means
Hugo Hernandez
ForsChem Research
Keywords
Decision Value, Hypothesis Testing, Normality, Means, Optimization, P-value, Power, Sample
Size, Significance Level, Statistical Tests, T-test, Test Error, Z-test
1. Introduction
Hypothesis testing is “a statistical analysis that uses sample data to assess two mutually exclusive
theories about the properties of a population.” [1] Those theories, usually denoted as
hypotheses, are associated to a particular research question about the behavior of a subject
population when observed under specific conditions [2]. The two mutually exclusive theories
are denoted as the null hypothesis ( ) and the alternative hypothesis ( ). The null hypothesis
is the default theory, usually associated to the lack of an effect of the particular conditions
considered on the observed behavior of the population. The alternative hypothesis, on the
other hand, is the challenging theory and represents a significant effect of the observation
conditions on the properties of the population.
Statistical tools are necessarily required for the validation of hypotheses since sampling data
from a population always introduces uncertainty. Thus, it is necessary to draw a conclusion
using approximate values of the population properties which are inferred with a limited
confidence and accuracy. The test confidence ( ) has an important effect on the risk of error
inherently associated to the validation process. Two types of errors are possible:
 Type I errors (false positives): Resulting when the null hypothesis is erroneously
rejected. In other words, the uncertainty in the data sample created an artificial effect
which was considered significant. The probability of occurrence of false positives is the
significance level ( ), which is related to the confidence level according to the
following expression:
(1.1)
In other words, the confidence level represents the probability of correctly rejecting
the null hypothesis.
 Type II errors (false negatives): Resulting when the null hypothesis is erroneously not
rejected (i.e. the alternative hypothesis is erroneously rejected). In this case, the
uncertainty in the data sample does not allow to clearly observing the effect, being
simply confounded with noise. The probability of occurrence of false negatives is
denoted by , whereas the power of the test, defined as the probability of correctly
rejecting the alternative hypothesis, is:

www.forschem.org
Hugo Hernandez
ForsChem Research
(1.2)
is also related to the confidence level of the test although in a more complex way, as it will be
explained in the following Sections. In general, as the confidence level increases, also
increases. If zero false positive errors are obtained in a test ( confidence), then the
probability of false negatives becomes ( power). Similarly, if zero false negative
errors are obtained ( power), the confidence of the tests drops to zero ( confidence).
Clearly, a balance between test confidence and test power is required.
While a compromise between confidence and power exists, it is possible to decrease both
simultaneously by increasing the sample size (reducing the uncertainty introduced by
sampling). Thus, the larger the sample size, the lower the probability of erroneous decisions
will be. However, larger sample sizes are usually associated to higher experimental costs (not
only in terms of money but also in terms of time and resources). Therefore, the optimal sample
size will be determined by the following cost-minimization problem:
( )
(1.3)
where is the size of the sample, is the total cost of the test, is the cost of performing
a single observation, is the total test error probability which is a function of , and
is the cost of not obtaining the correct conclusion from the test (including reaching an
erroneous conclusion or not reaching a conclusion at all). These costs, of course, are different
for each particular problem or research question considered.
The behavior of the total test error as a function of sample size depends on the particular test
considered. Thus, in this report, the optimal significance level and sample size in parametric
tests of means will be obtained. The tests considered are the one-tailed and two-tailed Z-tests,
and the paired and unpaired two-sample T-tests (also one-tailed and two-tailed). Other tests of
hypotheses will be covered in upcoming reports.
2. Z-Tests
The Z-test is used to test the mean value of a normal population with known variance. Knowing
the variance of a population is seldom the case in practice. Nevertheless, the one-sample
version of this test is explained in detail in order to clarify the concepts involved in the
optimization of significance level and sample size. Different types of Z-tests can be used,
depending on the nature of the alternative hypothesis: One-tailed (right-tail and left-tail) and
two-tailed Z-tests.

www.forschem.org
Hugo Hernandez
ForsChem Research
2.1. One-tailed Z-Test
Let us first consider the one-sample right-tail Z-test. The set of hypotheses to be tested is the
following:
(2.1)
where is the true mean value of the property observed in the population, and is an
arbitrary reference value (considered as a potential true value of ). By considering this test, it
is clear that the third possibility has been previously discarded. This is the case when the
sample average ̅ is found to be:
̅
(2.2)
In this case, the true mean ( ) can be greater than or even equal to (when the difference
is not statistically significant), but it definitely cannot be significantly less than .
If the null hypothesis were true, the observed sample average should be obtained from a
normal distribution with mean and standard deviation given by:
̅ √
√
(2.3)
where is the known standard deviation of the population of individual observations.
Since the range of possible values for a normal distribution is ( ), any value observed for
the sample average might have been obtained from the distribution described by the null
hypothesis. However, as the observed value deviates from the hypothetical mean, the
probability of being obtained from the hypothetical distribution decreases. Thus, a confidence
interval of possible sample averages can be constructed in order to determine if a certain
observed sample average is likely obtained from the hypothetical distribution or not,
considering a confidence level . If the observed sample average does not belong to
the confidence interval, it is more likely being obtained from an alternative distribution.
The right-tail Z-test of hypothesis is graphically represented in Figure 1. This plot shows the
normal probability density function and unilateral confidence interval (blue region) of the
sample average when the population mean is assumed to be the reference value , and the
standard deviation is obtained from Eq. (2.3). The red region represents the interval of sample
average values less likely to be obtained from the distribution given by the null hypothesis
(rejection interval).

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 1. Probability density distribution of sample average values in a right-tail one-sample Z-

test. Solid blue line: Hypothetical distribution ( ). Dashed red line: Observed sample
average ( ̅ ). Blue shade: Unilateral confidence interval for the estimation of ̅ . Red shade: Null
hypothesis rejection interval.
The unilateral confidence interval for the distribution of sample average values ( ̅ ) is:
̅ ( ̅ ]
(2.4)
where ̅ is a critical sample average value, such that:
(̅ ̅ )
(2.5)
The normal distribution of sample average values given by null hypothesis can be converted
into the standard normal distribution using the following transformation:
̅ ̅
̅ ⁄√
(2.6)
Thus, Eq. (2.5) becomes:
̅
( ) ( )
⁄√
(2.7)
where is the critical value of the Z-distribution. Calculating the probability in Eq. (2.7) using
the cumulative probability function of the standard normal distribution, it can be found that:
( ) √ ( )
(2.8)

www.forschem.org
Hugo Hernandez
ForsChem Research
We have then two possibilities: i) Rejecting the null hypothesis, or ii) Rejecting the alternative
hypothesis. By rejecting the null hypothesis we are concluding that the observed sample
average is significantly larger than the reference value assumed. The probability of erroneously
rejecting the null hypothesis (if it were truly valid) is then given by the area of the rejection
region (cf. red region in Figure 1):
( )
(2.9)
By rejecting the alternative hypothesis we are concluding that the observed sample average is
not significantly larger than the reference value assumed. The probability of erroneously
rejecting the alternative hypothesis depends on the true distribution of the population. Let us
assume that the true population mean is
(2.10)
then, the probability of type II errors will be:
̅ ̅
( ) ( ) ( √ )
⁄√ ⁄√ ⁄√
(2.11)
̅
where now ⁄√
, and
(2.12)
is the positive dimensionless difference between hypotheses, also denoted as Cohen distance
[3]. Using the normal cumulative probability function we obtain:
√
( ) ( ( ) √ )
√
(2.13)
Figure 2 shows the graphical representation of both types of error in a one-sample right-tail Z-
test. Changing the significance level will change the probability of both types of errors.
Changing the sample size or the Cohen distance will change only the type II error.
The total error probability ( ) of the right-tail Z-test will then be given by:
( ( ) √ )
(2.14)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 2. Graphical representation of test errors in a one-sample right-tail Z-test. Blue line:
Distribution given by the null hypothesis ( ). Green line: Distribution given by the
alternative hypothesis ( ). Black line: Critical value. Red shade: Type I errors. Orange
shade: Type II errors.
The effect of and √ on the total error is illustrated in Figure 3. For each value of √ an
optimal significance level ( ) emerges which results in a minimum total test error ( ).
Figure 3. Effect of significance level ( ) and the term √ on the total error of a one-sample
right-tail Z-test.
The optimal significance level ( ) can be found by solving:
( ( )) ( ( ) √ )
( √ ) √
(2.15)
resulting in:
√
(2.16)

www.forschem.org
Hugo Hernandez
ForsChem Research
and therefore:
( √ )
(2.17)
( √ )
(2.18)
Eq. (2.18) indicates that for the right-tail Z-test the minimum total error is obtained when the
probability of type I and type II errors are identical, and that particular value depends on the
sample size and the dimensionless difference between hypotheses. For simplicity, Eq. (2.18) can
be approximated using the following expression:
√ ( √ )
(2.19)
The effect of √ on is presented in Figure 4, considering both the exact Eq. (2.18) and the
approximated Eq. (2.19), using different scales for a better comparison.
Figure 4. Effect of the term √ on the optimal total error of a one-sample right-tail Z-test
described by the exact expression (Eq. 2.18, solid red line) and an approximation (Eq. 2.19,
dashed blue line). Top left: Original scales. Top right: Decimal logarithm transformation in .
Bottom left: Decimal logarithm transformation in √ . Bottom right: Decimal logarithm
transformation in both axis.

www.forschem.org
Hugo Hernandez
ForsChem Research
Performing a test of hypothesis knowing that the minimum test error will be greater than 50%
is not viable. For this test, such limit is given by √ √ ( ) . On the other hand,
performing tests where the total test error is expected to be below would be highly
desirable (that is erroneous conclusion out of trials). Such performance can be
achieved for this test when √ . Such improvement in performance can be obtained by
increasing the sample size. However, this will have an additional cost as described by Eq. (1.3).
The optimization problem can be solved by replacing Eq. (2.18) in Eq. (1.3):
( ( √ ))
(2.20)
Assuming that the cost of each observation and the cost of an erroneous conclusion are
independent of the sample size, then the solution of the optimal sample size ( ) is given by:
√
(2.21)
This expression leads to an implicit solution for , which can be iteratively solved using the
following recurrent expression:
( )
√
(2.22)
where represents the corresponding iteration. After the iteration converges (considering a
tolerance of ), the optimal sample size is found by rounding the value obtained to the
closest integer. This expression easily converges when √ . When the iteration does
not converge, different values of can be tested until a minimum value is obtained. When
the resulting value is √ , no optimal sample size is considered as the test becomes
unviable due to the large decision error.
Figure 5 shows the effect of Cohen distance and the relative cost ratio on the optimal
sample size and the corresponding test error.
If the cost ratio for a particular problem is unknown, it can be estimated as the
maximum number of observations or experiments that can be carried out considering the
budget, time or resources available for the test. Alternatively, a maximum total error tolerance
( ) can be established for the test, corresponding to an equivalent cost ratio:

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 5. Effect of Cohen distance and cost ratio on the optimal sample size (left
plot), and on the optimal test error (right plot) for a one-sample right-tail Z-test.
√ ( ) ( ( ))
( )
(2.23)
where the inverse error function can be approximated by (from Eq. 2.18 and 2.19):
( ) √
(2.24)
Then, a simplified approximation for low error tolerances can be obtained (shown in Figure 6):
( )
( )
(2.25)
Figure 6. Equivalent cost ratio determined from a maximum total error tolerance using Eq.
(2.23) (solid blue line) and Eq. (2.25) (dashed red line).
This equivalence can be used as long as the Cohen distance is known. However, the Cohen
distance is never known a priori because the true mean of the population is unknown. If the
true mean were known, the statistical test would be unnecessary.

www.forschem.org
Hugo Hernandez
ForsChem Research
In order to solve this issue, let us notice that the test error always decreases as the Cohen
distance increases. Thus, if a minimum acceptable Cohen difference ( ) is selected, any true
distance larger than this will be favorable. On the other hand, if the true distance is smaller, the
test error might become unacceptable. In this sense, the minimum acceptable Cohen
difference can be interpreted as the resolution of the test. Below this resolution threshold,
differences in mean values cannot be detected. An analogy can be used to better understand
the Cohen resolution. Let us consider a temperature sensor with a measurement scale of .
Any temperature value observed with this sensor will be rounded to the closest integer (in the
case of digital thermometers) or half degrees (in analog thermometers). Therefore, differences
in temperature smaller than (or in analog thermometers) cannot be detected.
Based on this analogy, the following Cohen distance based on measurement resolution ( )
can be considered:
(2.26)
where is the measurement resolution of , is the apparent standard deviation of the

measured variable , is the true standard deviation of , and the measurement uncertainty is
assumed to be uniform in the range of the measurement resolution. Either or are assumed
to be known, which is a condition of the Z-test. The maximum value possible of (determined
in this way) is obtained when corresponding to √
When the variability of the observed property is much larger than the measurement
uncertainty, the Cohen distance based on measurement resolution might lead to unreliable
tests, particularly when √ . Thus, a different approach for determining the Cohen
distance is possible. The minimum sample size required for performing a Z-test is , as the
estimation of the mean based on a single observation is unreliable. Now, based on the lower
limit for test viability we can define a Cohen distance based on the minimum sample size as
follows:
√ ( )
(2.27)
Figure 7 and Figure 8 show the optimal sample size and significance level obtained for this
Cohen resolution based on and considering different cost ratios. The corresponding
empirical models obtained for describing the behavior of the optimal values are presented in
Eq. (2.28) and (2.29).

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 7. Optimal sample size of a right-tail Z-test considering (based on

). Solid blue line: Values obtained by iteratively solving Eq. (2.22). Dashed green line:
Empirical model Eq. (2.28).
⟦ ( ( )) ( ( )) ( ( )) ( )⟧
(2.28)
( )
(2.29)
Figure 8. Optimal significance level of a right-tail Z-test considering (based on

). Solid blue line: Values obtained using Eq. (2.17) after numerically optimizing the
sample size. Dashed green line: Empirical model Eq. (2.29).
In any case, a different Cohen resolution value can be assumed in order to determine the
optimal significance level and sample size. The resolution value selected will determine not only
the optimal sample size and significance level of the test, but also the least significant
difference (LSD) between the sample average and the hypothetical mean required to reject the
null hypothesis:
(2.30)

www.forschem.org
Hugo Hernandez
ForsChem Research
Furthermore, the sample size used for the test does not necessarily must be the optimal.
However, changing the sample size will affect the optimal significance level of the test,
according to Eq. (2.17).
For the left-tail Z-test, similar results can be obtained following the previous procedure. In this
case, the set of hypotheses tested is:
(2.31)
whereas the probability of both types of error will be:
( ) ( )
(2.32)
( ( ) √ )
( √ )
(2.33)
where is again the dimensionless positive difference between hypothetical means, in this
case, .
The error probabilities are graphically represented in Figure 9. Only the direction of the plot
changes with respect to Figure 2, but the magnitudes remain identical, due to the symmetry of
the normal distribution. For this same reason, all the results previously obtained for the right-
tail Z-test remain valid for the left-tail Z-test.
Figure 9. Graphical representation of test errors in a one-sample left-tail Z-test. Blue line:
Distribution given by the null hypothesis ( ). Green line: Distribution given by the
alternative hypothesis ( ). Black line: Critical value. Red shade: Type I errors. Orange
shade: Type II errors.

www.forschem.org
Hugo Hernandez
ForsChem Research
2.2. Two-tailed Z-Test
Let us now consider the case of a two-tailed Z-test. The set of hypotheses to be tested is:
(2.34)
This test is used when there is no a priori knowledge of the position of the true mean with
respect to the hypothetical mean , or simply when the default validity of the null hypothesis
needs to be verified. It can be considered as the simultaneous evaluation of both types of one-
tailed tests. The type I error is now symmetrically distributed at both ends of the bell, as it can
be seen in Figure 10.
The probability of both types of error is given in this case by the following expressions
(considering the symmetry of the normal distribution):
( ) ( ) ( )
(2.35)
Figure 10. Graphical representation of test errors in a one-sample two-tailed Z-test. Blue line:
Distribution given by the null hypothesis ( ). Green lines: Distributions given by the
alternative hypotheses ( and ). Black line: Critical value. Red shade: Type I
errors. Orange shade: Type II errors.
( √ √ ) ( √ √ )
( ( ) √ ) ( ( ) √ )
(2.36)

www.forschem.org
Hugo Hernandez
ForsChem Research
where the Cohen distance is now defined as:
| |
(2.37)
The total test error then becomes:
( ( ) √ ) ( ( ) √ )
(2.38)
and therefore the optimal significance level can be obtained when the partial derivative of the
total error with respect to is zero:
( √ )
(2.39)
That is,
( )
√ √
√
( )
(2.40)
( ) ( )
√ √ √ √
√ √
( ) ( )
(2.41)
The two-tailed Z-test becomes unviable for √ where . In addition, the

total test error, which is , can be approximated using the following empirical
expression:
√ ( √ )
√
(2.42)
A graphical comparison of the optimal type I, type II and total test error between the one-tailed
and the two-tailed Z-test as functions of √ is presented in Figure 11. These plots show that
the one-tailed Z-test allows reaching lower test errors considering identical test resolutions and
sample sizes. The total test error plot includes a comparison with the empirical model (2.42)

www.forschem.org
Hugo Hernandez
ForsChem Research
obtained for the two-tailed Z-test. This approximation is suitable when exact determination of
the error or the hyperbolic arccosine functions are not readily available. An additional plot is
included comparing the optimal and values obtained for the two-tailed Z-test. This graph
clearly shows that although the expressions given by Eq. (2.40) and (2.41) are not equal, they
are practically identical (at least for viable tests, i.e. √ ):
( )
√ ( √ )
√
√
( )
(2.43)
Figure 11. Effect of the term √ on error probabilities for the Z-test. Top left: Optimal type I
error ( ). Top right: Optimal type II error ( ). Bottom left: Optimal total test error ( ).
Bottom right: Comparison between type I and type II error for the two-tailed Z-test.
Figure 12 compares the optimal critical z-score obtained for each test. It can be observed that
the optimal critical values are practically identical for both types of Z-test considering the same
test resolution and sample size.
Now, the Cohen distance can be chosen either based on the resolution of the measuring
system (Eq. 2.26), or based on the test viability for :
√
(2.44)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 12. Effect of the term √ on the optimal critical z-score ( ) for the two-tailed (solid
blue line) and one-tailed Z-tests (dashed green line).
The sample size optimization problem for the two-tailed Z-test becomes:
( ) ( ) ( )
√ √
√ √ √
( ( ) ( ) ( ))
(2.45)
The numerical solution of the optimization problem (2.45) is presented in Figure 13. The results
obtained can be fitted using the empirical expression given by Eq. (2.46). From this result, the
optimal significance level for the two-tailed Z-test can be obtained (cf. Figure 14), which can be
approximated using Eq. (2.47).
Figure 13. Optimal sample size of a two-tailed Z-test considering (based on

). Solid blue line: Values obtained by numerically solving Eq. (2.45). Dashed green line:
Empirical model Eq. (2.46).

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 14. Optimal significance level of a two-tailed Z-test considering (based

on ). Solid blue line: Values obtained using Eq. (2.40) after numerically optimizing the
sample size. Dashed green line: Empirical model Eq. (2.47).
⟦ ( ( )) ( ( )) ( )⟧
(2.46)
( )
(2.47)
By comparing Figure 7 and Figure 13, one may think that the two-tailed Z-test is more efficient
than the one-tailed Z-test because the optimal sample size and significance level for the former
are smaller assuming the same cost ratio. However, the Cohen resolutions of the tests are
different. Smaller resolutions can be used for the two-tailed Z-test (for example using the
Cohen distance based on measurement resolution) resulting in different optimal sample sizes
and significance levels. However, for low cost ratio values the test may become unviable with
total test error probabilities larger than . For that reason, a safe Cohen resolution
( ) is proposed.
2.3. Practical Considerations
In some cases it is necessary to analyze a data sample which was not necessarily obtained
considering an optimal sample size (from the minimal cost point of view, according to Eq. 1.3).
In these cases, the optimal significance level (from the minimal test error point of view) can be
obtained using Eq. (2.17) for the one-tailed Z-test or Eq. (2.43) for the two-tailed Z-test.
Using a fixed significance value as a decision criterion, independently of the sample size, will
affect the equivalent resolution of the test ( ), for the one-tailed Z-test:

www.forschem.org
Hugo Hernandez
ForsChem Research
√ ( )
√
(2.48)
and the two-tailed Z-test:
√ ( )
√
(2.49)
These effects are represented in Figure 15 considering a fixed significance level of . For a
fixed significance level and the same data sample, the resolution of a one-tailed Z-test will be
better (lower) than the resolution of a two-tailed Z-test. This change in test resolution will also
have an influence on the stability of the test conclusion. Let us consider that a sample is
obtained from a normal distribution with a mean value of and a standard deviation of .
The data will be obtained sequentially and after obtaining a new data point both one-tailed and
two-tailed Z-tests are performed again, considering as null hypothesis: . The -values
obtained are shown in Figure 16 for a particular random sample with a maximum size of .
This example clearly shows how the conclusion of the test easily changes just by changing the
sample size.
Figure 15. Effect of sample size ( ) on the equivalent test resolution ( ) for the Z-test
considering a fixed significance level of . Dashed green line: One-tailed Z-test. Solid blue line:
Two-tailed Z-test.
If an optimal, sample size-based significance level is considered, a more stable conclusion can
be obtained. In addition, following the idea of using a decision criteria based on the sign of the
value obtained [4], a -value (decision value) metric is proposed:
(2.50)

www.forschem.org
Hugo Hernandez
ForsChem Research
where is given by Eq. (2.17) for the one-tailed Z-test or Eq. (2.43) for the two-tailed Z-test,
and is the -value obtained for the test.
Figure 16. Effect of sample size ( ) on the -value for the Z-test by sequentially including
additional random data points in the analysis. Dashed green line: One-tailed Z-test. Solid blue
line: Two-tailed Z-test. Dashed red line: Fixed significance level of .
Positive -values represent a positive conclusion (the null hypothesis is rejected), whereas
negative -values represent a negative conclusion (the null hypothesis is not rejected). The -
values obtained using the same sequential random sample considered in Figure 16 are
presented in Figure 17. The behavior of the -value is more stable than for the -value, and
unless the -value is very close to zero, no sudden change in the conclusion is expected as the
sample size increases for large sample sizes. For smaller samples, which have an increased
uncertainty, larger fluctuations may result in different conclusions. By using the -values, the
test conclusion is obtained with a fixed resolution, independent of the sample size. On the
other hand, the differences in -values between the one-tailed and the two-tailed Z-tests are
due to the different resolution value considered for each case. Since the resolution of the
one-tailed Z-test based on is smaller, it will be able to detect smaller differences than the
two-tailed Z-test.
The -value limit obtained for large samples will be denoted as . This value may change
from sample to sample due to the randomness of the sampling process. Using a Monte Carlo
approach, considering different random samples and different true Cohen differences ( ) the
following empirical expressions are obtained for the average and standard deviation values of
:
〈 〉 ( )( )
(2.51)
( )
(2.52)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 17. Effect of sample size ( ) on the -value for the Z-test by sequentially including
additional random data points in the analysis. Dashed green line: One-tailed Z-test. Solid blue
line: Two-tailed Z-test. Dashed red line: Decision limit.
The Monte Carlo data obtained considering 25 replicates for each true distance are compared
to the empirical expressions in Figure 18.
Figure 18. Monte Carlo simulation for determining the -value limit in Z-tests. Left plot:
Average limit 〈 〉. Right plot: Standard deviation .
From Eq. (2.51) it can be concluded that 〈 〉 when . That is, can be
interpreted as equivalent to the measurement scale of the test, while is the minimum
difference detected by the test. In addition, from Eq. (2.51) it can be shown that the average
limiting -value obtained for the two-tailed Z-test 〈 〉 is related to the average limiting -
value obtained for the one-tailed Z-test 〈 〉 by:
〈 〉 〈 〉
(2.53)
( ( ) )( )
(2.54)
Also, solving for Eq. (2.51) becomes:

www.forschem.org
Hugo Hernandez
ForsChem Research
√ 〈 〉
(2.55)
and replacing in Eq. (2.52) yields:
〈 〉 √〈 〉
〈 〉 √〈 〉
(2.56)
For the minimum difference detected by the test ( ), the standard deviation in is
for the one-tailed Z-test and for the two-tailed Z-test. Considering
a minimum Cohen difference similar to a two-tailed test (because the fluctuation is relevant in
both directions) we find that the true distribution is at the resolution limit of the tests when
| | or | | . Approximating such differences to and respectively, the
following rules of thumb can be used for analyzing -values of Z-tests:
For a one-tailed Z-test:

 If then the null hypothesis can be rejected
 If then the null hypothesis cannot be rejected
 If | | then the test is inconclusive because it is at the limit of its resolution
( )
For a two-tailed Z-test:

 If | | then the test is inconclusive because it is at the limit of its resolution
( )
When the test is at the resolution limit, the sign of the -value might change depending on the
particular data sample used and on the sample size considered. Thus, it is better to consider it
inconclusive. In order to draw a definitive conclusion, a different Cohen resolution should be
selected, like for example, using the Cohen resolution based on the sensor resolution limit. In
this case, considering any arbitrary Cohen resolution , the rule of thumb becomes:
For a Z-test considering a Cohen resolution :

 If | | then the test is inconclusive because it is at the limit of its resolution ( )

www.forschem.org
Hugo Hernandez
ForsChem Research
where
( )
(2.57)
Let us recall that a new -value must be calculated (Eq. 2.50) for the selected resolution ,
since the optimal significance level will change, according to Eq. (2.17) for the one-tailed Z-test
or Eq. (2.43) for the two-tailed Z-test.
3. T-Tests
In practice, when a test of means will be performed, the variance or standard deviation of a
population is not known a priori. Thus, the standard deviation must also be inferred from the
data, involving additional uncertainty in the test. A correction to this additional uncertainty was
considered by William Sealy Gosset [5] leading to Student’s T distribution and the
corresponding Student’s T test of means. There are different types of T-tests including the one-
sample T-test, the two-sample paired T-test, the two-sample unpaired homoscedastic T-test
and the two-sample unpaired heteroscedastic T-test. The test statistic ( ) is determined using
specific expressions for each case, as indicated in Table 1.
Table 1. T-test statistic ( ) and degrees of freedom ( ) for each type of T-test available
T-test Test statistic Degrees of freedom
̅
One-sample
⁄√
(̅̅̅ ̅̅̅)
Two-sample paired
⁄√
(̅̅̅ ̅̅̅)
Two-sample unpaired
homoscedastic √
(̅̅̅ ̅̅̅) ( )
Two-sample unpaired
heteroscedastic √ ( ) ( )
 The subscripts and denote the number of each sample for the two-sample tests.
 is the hypothetical difference between the mean values of both populations.
 represents the standard deviation of the difference between paired data values.
( ) ( )
 The homoscedastic standard deviation is √ .

www.forschem.org
Hugo Hernandez
ForsChem Research
In addition, the determination of the -value requires an additional parameter: the degrees of
freedom for the estimation of the standard deviation ( ). The corresponding values of the
degrees of freedom for each type of T-test are also included in Table 1.
Once the test statistic and the degrees of freedom have been determined from the sample(s),
the test procedure is identical in all cases. Thus, we will consider the one-sample T-test as a
representative example for the optimization of the test.
When designing the test, the true Cohen difference is unknown because the true mean (or
mean difference) is unknown but also because the true standard deviation ( ) is unknown. The
definition of the Cohen difference for the T-test is similar to that for the Z-test:
| |
{
| |
(3.1)
Thus, again we need to assume a Cohen distance (resolution) in order to obtain the optimal
significance level and sample size for the test.
Considering that Student’s T distribution is also symmetrical, the behavior of the right-tail is
equivalent to that of the left-tail test. So, in general for the one-tailed T-test, the test error
probabilities will be:
( )
(3.2)
( √ )
( √ ) ( ) ( √ )
( )
√ ( )
(3.3)
where is the gamma function and is the Gaussian hypergeometric function.
As increases, Eq. (3.3) approximates to Eq. (2.13), resulting in the same behavior of the Z-test.
Thus, the Z-test can be used as an approximation for estimating the test error probabilities.
Due to the analytical complexity of the expressions obtained during the minimization, a
different approach is proposed. According to the results observed for the Z-tests, the minimum
total test error probability is achieved when . Thus, can be obtained by numerically
solving:
( √ )
(3.4)

www.forschem.org
Hugo Hernandez
ForsChem Research
The result obtained can be approximated using the following empirical function (cf. Figure 19):
( )
[ ( √ ) (( ) ) ]
(3.5)
and therefore:
( )
( √ ) (( ) )
(3.6)
Considering that the T-test requires estimating the standard deviation from the sample, the
minimum sample size required must consider this additional degree of freedom. Thus, using
( ), the Cohen resolution based on the minimum viable test will be
. This value is obtained from the numerical solution when the type II error,
calculated using Eq. (3.3), and the type I error are both set equal to .
Figure 19. Optimal significance level as a function of true Cohen distance and degrees of
freedom for one-tailed T-tests. Solid black lines: Values obtained by numerical optimization.
Dashed red lines: Values obtained using approximation (3.5).
By using and Eq. (3.6) in Eq. (1.3), the optimal degrees of freedom for the one-
tailed T-test can be approximately obtained by iteratively solving:
( ) (( ) )
√ ( )
[ ( )]
(3.7)

www.forschem.org
Hugo Hernandez
ForsChem Research
The resulting values can be approximated (cf. Figure 20) using the following empirical
expression:
⟦ ( ( )) ( ( )) ( ( )) ( ) ⟧
(3.8)
Figure 20. Optimal degrees of freedom of a one-tailed T-test considering

(based on ). Solid blue line: Values obtained by iteratively solving Eq. (3.7). Dashed
green line: Empirical model Eq. (3.8).
While the optimal sample sizes for the one-tailed T-test closely resemble the optimal sample
size for the one-tailed Z-test, the values obtained are slightly larger for the same cost ratio. In
part, this is caused by the slightly smaller Cohen resolution value.
For two-sample paired T-tests, represents the degrees of freedom of each sample ( ).
For two-sample unpaired T-tests, represents the combined degrees of freedom in both
samples. Given that the variance of each population is unknown during the design of the test, it
is most reasonable to assume homoscedasticity and to consider both samples with the same
size ( ⌈ ⌉ ). However, if the data is heteroscedastic, the observed degrees of freedom
will be less than those expected during the test design.
The optimal significance level obtained for the one-tailed T-test is compared to the optimal
significance level obtained for the one-tailed Z-test in Figure 21 as a function of the sample size
(degrees of freedom of the sample). Considering the same degrees of freedom, the optimal
significance level of the T-test is always larger than for the Z-test.
On the other hand, the error probabilities for the two-tailed T-test are:
( ) ( ) ( )
(3.9)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 21. Optimal significance level for one-tailed Z-tests (dashed blue line, Eq. 2.17,
) and one-tailed T-tests (solid green line, Eq. 3.5, ) as a function of
the degrees of freedom of the data.
( √ √ )
( √ √ )
( √ ) ( ) ( √ )
( )
√ ( )
( √ ) ( ) ( √ )
( )
√ ( )
(3.10)
Numerically solving for , the following empirical approximation (cf. Figure 22) is
obtained:
( )
[ ( √ ) ( ) ]
(3.11)
For , the Cohen resolution based on the minimum viable test for the two-tailed T-test
will be , obtained from Eq. (3.10) when and are both set equal to . Using
this value and Eq. (3.11) for numerically solving the optimization problem (1.3), the following
empirical expression is obtained for the optimal degrees of freedom (cf. Figure 23):
⟦ ( ( )) ( ( )) ( ( )) ( ) ⟧
(3.12)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 22. Optimal significance level as a function of true Cohen distance and degrees of
freedom for two-tailed T-tests. Solid black lines: Values obtained by numerical optimization.
Dashed red lines: Values obtained using approximation (3.11).
The optimal significance level obtained for the two-tailed T-test is compared to the optimal
significance level obtained for the one-tailed T-test and the two-tailed Z-test as a function of
the degrees of freedom in Figure 24. Considering the same degrees of freedom, the optimal
significance level is much larger for the two-tailed T-test than for the two-tailed Z-test, as a
result of the lower Cohen resolution value employed in the T-test. In addition, since the one-
tailed T-test has an even lower resolution value, it results in even larger optimal significance
levels than those observed in the two-tailed T-test.
Figure 23. Optimal degrees of freedom of a two-tailed T-test considering

(based on ). Solid blue line: Values obtained by numerical optimization. Dashed green
line: Empirical model Eq. (3.12).
Now, in order to analyze and interpret the results of a T-test considering data with degrees of
freedom, the -value of the test can be used. The -value will be determined as follows:
(3.13)

www.forschem.org
Hugo Hernandez
ForsChem Research
Figure 24. Optimal significance level for the two-tailed T-test (solid blue line, Eq. 3.11,
) compared to the two-tailed Z-test (dotted red line, Eq. 2.43, ) and
the one-tailed T-test (dashed green line, Eq. 3.5, ) as a function of the degrees of
freedom of the data.
Following a similar approach as the one used for the Z-tests, empirical expressions are obtained
for the average and standard deviation values of the -value limit :
〈 〉 ( )( )
(3.14)
(3.15)
The Monte Carlo data obtained considering 25 replicates for each true distance are compared
to these empirical expressions in Figure 25.
Figure 25. Monte Carlo simulation for determining the -value limit in T-tests. Left plot:
Average limit 〈 〉. Right plot: Standard deviation .
The average limiting -value obtained for the two-tailed T-test 〈 〉 is approximately related
to the average limiting -value obtained for the one-tailed T-test 〈 〉 considering the same
true Cohen distance by:

www.forschem.org
Hugo Hernandez
ForsChem Research
〈 〉 〈 〉 ( )( ) 〈 〉
(3.16)
In addition, the standard deviation of the limit -value can be approximated in terms of the
average as:
√〈 〉 √〈 〉
(3.17)
and therefore, at the resolution limit ( ), the standard deviation in is for

the one-tailed T-test and for the two-tailed T-test. Considering a minimum Cohen
difference similar to a two-tailed test, then the true distribution is at the resolution limit of the
tests when | | or | | approximately.
In general, considering any arbitrary Cohen resolution , the rule of thumb for the decision
based on the T-test becomes:

 If | | then the test is inconclusive because it is at the limit of its resolution ( )
where
(3.18)
4. Algorithm Implementation and Examples
4.1. General Procedure for Optimally Designing Tests of Means
The following procedure is proposed for planning experiments for testing means:
1. Choose between a Z-test (known variance) or a T-test (unknown variance). By default,
the selected test is the T-test because the variance is usually not known a priori.
2. Choose between a one-tailed and a two-tailed test. One-tailed tests offer a lower test
resolution value, providing a better decision capability. By default, the selected test
type will be one-tailed.
3. Determine the Cohen resolution of the test ( ). By default the following values will be
used:

www.forschem.org
Hugo Hernandez
ForsChem Research
( )
( )
{
( )
( )
(4.1)
A different Cohen resolution can be selected based on the sensor resolution (if the
variance is known) or any other criterion. However, since the calculation of sample size
is based on the default Cohen resolution, the value obtained for other test resolutions
will not correspond to the exact optimum.
4. Determine the cost ratio of the experiment, expressed as the total cost of an erroneous
decision divided by the cost of performing a single observation. When these costs are
not known, the cost ratio can be replaced by the maximum number of observations
possible considering the budget, schedule and resources limitations. Alternatively, an
equivalent cost ratio can be estimated from the maximum tolerable total test error
( ) using Eq. (2.25):
( )
( )
Considering the classical significance level of 5% used in hypothesis testing, a default
value for is . Since this approximation is based on the one-tailed Z-test, the
validity of the maximum test error constraint must always be confirmed. If required, a
larger equivalent cost ratio value can be used.
5. Calculate the optimal degrees of freedom of the test using the following general
expression:
⟦ ( ) ( ( )) ( ( )) ( ( )) ⟧
(4.2)
where the coefficients (based on the corresponding default Cohen resolution) are
summarized in Table 2.
For Z-tests the optimal sample size corresponds to the optimal degrees of freedom.
For one-sample and paired T-tests the optimal sample size is .
For two-sample unpaired T-tests the optimal size of each sample is ⌈ ⌉ .
Table 2. Coefficients used in Eq. (4.2) for the calculation of the optimal degrees of freedom for
testing means
Test
One-tailed Z-test
Two-tailed Z-test
One-tailed T-test
Two-tailed T-test

www.forschem.org
Hugo Hernandez
ForsChem Research
4.2. General Procedure for Optimally Testing Means
Once the observations for the test have been collected (using or not the optimal sample size
approach of the previous section), the following procedure is suggested for performing the
test of means:
1. Define the set of hypotheses about the means to be tested.
2. Choose the test to be used: Z-test vs. T-test, one-tailed vs. two-tailed.
3. If the sample size (for each sample) is less than 20, check the normality of the sample. If
the data is not normal, use a suitable monotonic non-linear transformation that
guarantees normality (positive -values [4]).
4. Calculate the average value of the sample(s). The value of the average will determine
the correct direction of the test, for one-tailed tests.
5. For T-tests also calculate the standard deviation of the sample(s).
6. Select the Cohen resolution of the test ( ) to be used based on the minimum size
viable test (Eq. 4.1), considering the sensor resolution (divided by the standard
deviation ), or any other criterion.
7. For Z-tests, calculate the -score using Eq. (2.6). For T-tests, use the equations given in
Table 1 for calculating the statistic and the degrees of freedom of the data ( ).
8. Determine the -value of the test as follows.
For Z-tests:
( )
{ ( )
( (| |))
(4.3)
For T-tests:
( )
{ ( )
( (| | ))
(4.4)
9. Calculate the -value of the test (Eq. 3.13):
where for Z-tests, and

( √ )
(4.5)
The coefficients ( ) used in Eq. (4.5) are summarized in Table 3. Alternatively, can
be obtained from Eq. (2.17) for the one-tailed Z-test; from Eq. (2.43) for the two-tailed Z-
test; from Eq. (3.5) for the one-tailed T-test; or from Eq. (3.11) for the two-tailed T-test.
10. Calculate the critical -value at the limit of the test resolution ( ):

www.forschem.org
Hugo Hernandez
ForsChem Research
( )
{
(4.6)
11. Draw a conclusion from the test:
 If | | then the test is inconclusive because it is at the resolution limit ( ).
In this case, a different Cohen resolution should be selected.
Table 3. Coefficients used in Eq. (4.5) for the calculation of the logarithm of the optimal
significance level for testing means
Test
One-tailed Z-test
Two-tailed Z-test
One-tailed T-test
Two-tailed T-test
4.3. Algorithm Implementation
The general procedures previously presented have been implemented as functions in R

(https://cran.r-project.org/). The R codes for these functions are presented in this Section.
Optimally Designing Tests of Means
optsize.mean.test<-
function(test=c('T','Z'),tails=c(1,2),dr=NULL,emax=0.1,costratio=NULL){
#test is the type of means test (Z or T). Default: T-test.
#tails is the number of tails for the test. Default: One tail.
#dr is the Cohen resolution of the test. If null, default values for each test
#type are used.
#emax is the maximum total error tolerable for the test. By default it is set
#to 10%.
#costratio is the ratio between the cost of an erroneous decision to the cost
#of a single observation.
#If no costratio value is used, an equivalent costratio value is obtained from
#emax, based on the one-tailed Z-test.
#The value v returned is the sample size for Z-tests or the degrees of freedom
#for T-tests.
test<-test[1]
tails<-tails[1]
#Choose test type
if (test=='Z' && tails==1){
ttype<-1

www.forschem.org
Hugo Hernandez
ForsChem Research
}
ttype<-2
}
if (test=='T' && tails==1){
ttype<-3
}
ttype<-4
}
#Calculate default Cohen resolution
ddr=c(0.9539,1.6259,0.9428,1.159)
if (is.null(dr)){
dr<-ddr[ttype]
}
#Calculate equivalent cost ratio value
CReq <- 5.6848/(dr^2*(emax^1.029))
#Calculate log(cost ratio)
if (is.null(costratio) || costratio<CReq){
LCR<-log(CReq)/1.02
} else {
LCR<-log(costratio)/1.02
}
a0<-c(0,0,-1,-1)
a1<-c(-1.7746,1.5716,-2.135,-1.043)
a2<-c(1.4742,0.1362,1.739,1.314)
a3<-c(-9.7032e-2,-4.527e-3,-0.125,-0.0876)
a4<-c(2.3696e-3,0,3.3e-3,2.07e-3)
c0<-c(1.0898,0.7871,0.9825,0.9577)
c1<-c(0.2315,0.1917,0.1876,0.1325)
c2<-c(0.1179,0.1202,0.1182,0.0958)
#Verify maximum test error
testerror<-1
while (testerror>emax){
LCR<-LCR*1.02
#Optimal degrees of freedom
v<-
round(a0[ttype]+a1[ttype]*LCR+a2[ttype]*LCR^2+a3[ttype]*LCR^3+a4[ttype]*LCR^4)
#Estimate optimal significance level and test error
alpha<-exp(-(c0[ttype]+c1[ttype]*dr*sqrt(v)+c2[ttype]*dr^2*v))
testerror<-2*alpha
}
CR=round(exp(LCR))
OUT=data.frame(v,test,tails,dr,CR,alpha,testerror)
return(OUT)
Optimally Testing Means
opt.mean.test<-
function(x,y=NULL,mu=0,test=c('T','Z'),tails=c(1,2),dr=NULL,sigma.x=NULL,sigma.y=
NULL){
#x is a vector containing the data sample
#y is an optional vector containing a second data sample
#If the data is paired, do not input separately but use x-y as the first
#argument.
#test is the type of means test (Z or T). Default: T-test. Z-tests require the
#BSDA package.
#tails is the number of tails for the test. Default: One tail.

www.forschem.org
Hugo Hernandez
ForsChem Research
#dr is the Cohen resolution of the test. If null, default values for each test
#type are used.
#The output includes the D- and P-values of the test, the critical D, the
#optimal alpha, and the test decision.
test<-test[1]
tails<-tails[1]
if (is.null(sigma.x)){
sigma.x<-sd(x)
}
if (is.null(sigma.y)){
sigma.y<-sd(y)
}
#Choose test type and perform classical test for calculating the P-value
ttype<-1
if (!require(BSDA)) install.packages(BSDA)
library(BSDA)
ST<-z.test(x=x,y=y,alternative="less",mu=mu,sigma.x=sigma.x,sigma.y=sigma.y)
P<-ST$p.value
if (P>0.5){
P<-1-P
}
v<-length(x)+length(y)
}
ttype<-2
if (!require(BSDA)) install.packages(BSDA)
library(BSDA)
ST<-
z.test(x=x,y=y,alternative="two.sided",mu=mu,sigma.x=sigma.x,sigma.y=sigma.y)
P<-ST$p.value
v<-length(x)+length(y)
}
ttype<-3
ST<-t.test(x=x,y=y,alternative="less",mu=mu)
P<-ST$p.value
if (P>0.5){
P<-1-P
}
v=round(ST$parameter)
names(v)<-NULL
}
ttype<-4
ST<-t.test(x=x,y=y,alternative="two.sided",mu=mu)
P<-ST$p.value
v=round(ST$parameter)
names(v)<-NULL
}
#Calculate default Cohen resolution
ddr=c(0.9539,1.6259,0.9428,1.159)
if (is.null(dr)){
dr<-ddr[ttype]
}
#Calculate the optimal significance level and test error:
c0<-c(1.0898,0.7871,0.9825,0.9577)
c1<-c(0.2315,0.1917,0.1876,0.1325)
c2<-c(0.1179,0.1202,0.1182,0.0958)
La<--(c0[ttype]+c1[ttype]*dr*sqrt(v)+c2[ttype]*dr^2*v)

www.forschem.org
Hugo Hernandez
ForsChem Research
alpha<-exp(La)
testerror<-2*alpha
#Calculate D-value and critical D-value
D<-(La-log(P))/v
if (test=='Z'){
Dcr<-(1+6.45*dr*(1+dr/2))/100
} else {
Dcr<-(1.75+3.75*dr)/100
}
#Decision
if (D>Dcr){
decision="Ho is rejected"
}
if (D<(-Dcr)){
decision="Ho is not rejected"
}
if (abs(D)<=Dcr){
decision="Inconclusive"
}
OUT=data.frame(decision,D,Dcr,P,alpha,v,test,tails,dr,testerror)
return(OUT)
}
5. Examples
5.1. Testing Top Temperature in a Distillation Column
A sensor temperature PT-100 Class A is used to monitor the temperature at the top of a
continuous distillation column for separating a mixture of benzene and toluene, like the one
described by Jalee and Aparma [6]. The top temperature set point is in order to obtain
benzene with a purity. The sensor has a display resolution of , and an accuracy of
at . A test of hypothesis is required in order to verify if the measured top
temperature during normal operation guarantees a maximum mean temperature of .A
maximum test error of is allowed.
In order to design the test, we will first assume that the accuracy of the sensor provide
information about the variance of the data. Thus, a Z-test is selected. Now, since we need to
check if the temperature is not significantly larger than , then a one-tailed Z-test is
chosen. The Cohen resolution of the test will be selected based on the sensor information.
Assuming that the sensor error is uniformly distributed, then:
⁄√
(5.1)
Using the above information in the optsize.mean.test function in R, the following output is
obtained:

www.forschem.org
Hugo Hernandez
ForsChem Research
optsize.mean.test('Z',1,dr=0.5587,emax=0.1)
v test tails dr CR alpha testerror
32 Z 1 0.5587 1049 0.04983113 0.09966226
Thus, an optimal sample size of 32 observations is suggested for the test resolution considered
(based on the sensor resolution and accuracy). Notice that this test is optimized for a classical
significance level. The 32 observations obtained are summarized in Table 4.
Table 4. Temperature measurements in at the top of a benzene-toluene distillation column.

80.5 80.1 80.5 80.6 80.1 80.3 80.2 80.1
80.1 80.5 80.5 80.4 80.0 80.4 80.1 80.4
80.4 80.0 80.0 80.5 80.2 80.3 80.6 80.1
80.3 80.2 80.0 80.3 80.3 80.5 80.3 80.5
The corresponding test results, using the opt.mean.test function in R, are the following:
opt.mean.test(Temp,mu=80.2,test="Z",tails=1,dr=0.5587,sigma.x=0.31/sqrt(3))
decision D Dcr P alpha v test tails dr testerror
Ho is rejected 0.0991159 0.0561029 0.0020895 0.0498311 32 Z 1 0.5587 0.0996623
The results obtained ( ) indicate that the mean temperature at the top of the column is
significantly larger than (considering a significance level), even though the
average value observed ( ) is apparently within the sensor accuracy. Of course, we have
assumed a priori that the data variance is determined by the sensor accuracy, which might not
necessarily be the case. Thus, the distillation column operating conditions must be improved in
order to increase the purity of benzene obtained in the top.
5.2. Comparing Catalysts Performance
Ren et al. [7] compared different catalysts for synthesizing poly(L-lactic acid) PLA from either
lactic acid or lactide (a dimer of lactic acid). The performance of the reaction considering
different catalysts (p-toluene sulfonic acid TSA, tin (II) chloride SnCl2, and an acid styrene-
divinylbenzene copolymer ion exchange resin A-15) is monitored over reaction time by means
of the weight-average molecular weight of the PLA obtained. The results extracted from a plot
reported by the authors are summarized in Table 5. The pairwise comparison between the
different treatments will be done using paired two-tailed T-tests. Considering that only 7 data
points (6 degrees of freedom) are available for each catalyst, using Eq. (3.5) the optimal
significance level to be used for these tests is (assuming a default Cohen resolution of ):
. The critical -value (Eq. 3.18) for these tests is: .
The T-tests require that the sample averages behave normally. This condition can be fulfilled
either by using normally-distributed data, or by considering relatively large sample sizes (larger

www.forschem.org
Hugo Hernandez
ForsChem Research
than 20, as requested by the central limit theorem [8]). For this particular example, the sample
size is not enough to guarantee a normal behavior of the sample averages. On the other hand,
the pairwise differences between treatments, with the only exception of the TSA-SnCl2 pair,
are not normal as can be observed in Table 6. Thus, a monotonic non-linear variable
transformation might be required before using the T-test. Table 7 summarizes the results of
normality tests for the pairwise differences considering different possible monotonic non-linear
transformations of the molecular weight (the original untransformed molecular weight -
values are also included for comparison).
Table 5. Weight-average molecular weight of PLA as a function of reaction time using different
catalysts [7]
Mw (g/mol) Catalyst
Time (h) No catalyst TSA SnCl2 A-15
2 54 129 221 258
3 121 179 375 429
4 150 267 421 796
5 188 458 438 808
6 246 608 508 850
7 279 650 529 879
8 258 658 517 892
Table 6. Approximated Shapiro-Wilk normality tests [4] of pairwise differences in molecular

weight of PLA obtained using different catalysts. Upper value: Shapiro-Wilk’s -value. Lower
value: Normality -value. Green numbers: Normal distribution. Red numbers: Non-normal
distribution.
P-value
No catalyst TSA SnCl2
N-value
0.1469784
TSA
-0.5553325
0.0015038 0.3335536
SnCl2
-5.13765 0.2641853
0.0068098 0.1292352 0.007783
A-15
-3.627253 -0.6839842 -3.493692
Since no single transformation guarantees data normality for all possible pairs, the
transformation providing the largest positive -value will be used for the T-test of each pair.
The pairwise T-test results ( - and -values) obtained for the best transformation are
summarized in Table 8. All comparisons indicated a significant difference between catalysts,
with the exception of the pair TSA-SnCl2. So, basically the ion exchange resin A-15 provided a

www.forschem.org
Hugo Hernandez
ForsChem Research
significant increase in the rate of polymerization of PLA, even greater than the increase
observed for TSA and SnCl2.
Table 7. -values for the pairwise differences using different transformations of the molecular
weight of PLA obtained with different catalysts. Green numbers: Normal distribution. Red
numbers: Non-normal distribution.
Mw N-values
Log(Mw)
Sqrt(Mw)
No catalyst TSA SnCl2
Mw^(1/3)
1/Mw
-0.582773
-2.458134
TSA -0.315826
0.067225
-7.944898
-5.072747 0.276166
0.560602 -0.743061
SnCl2 0.923501 -0.200467
-0.200281 -0.477680
-3.677582 -1.895365
-3.640831 -0.683362 -3.512910
0.322006 0.232793 -1.915714
A-15 0.589069 -2.130488 -3.589495
1.015794 -0.602268 -3.182006
-3.687475 -0.989234 0.999234
Table 8. Paired two-tailed T-tests comparing the molecular weight of PLA using different
catalysts. Upper value: Best transformation. Middle value: -value. Lower value: -value. Green
letters: Positive results (H0 is rejected). Red letters: Negative results (H0 is not rejected).
Transformation
P-value No catalyst TSA SnCl2
D-value
Mw^(1/3)
TSA 4.41E-04
0.936703
Sqrt(Mw) Mw
SnCl2 3.00E-07 0.875265
2.152405 -0.328792
Mw^(1/3) Log(Mw) 1/Mw
A-15 2.74E-06 2.23E-03 1.97E-04
1.783613 0.666492 1.07078

www.forschem.org
Hugo Hernandez
ForsChem Research
5.3. Testing a New Fertilizer
Based on results published in the scientific literature [9], a farmer decides to test a new sludge-
derived organo-mineral fertilizer. The fertilizer is expected to provide similar maize crop yields
than conventional fertilizers, but at a reduced cost. The farmer has a maximum of 1000 plot
areas (of 200 m2 each) available for experimentation. Using this number as an estimate of the
cost ratio and considering a default Cohen resolution for a two-tailed T-test, then the
optsize.mean.test function in R yields:
optsize.mean.test(test='T',tails=2,costratio=1000)
30 T 2 1.159 1000 0.00348456 0.00696912
Assuming equal variances, the farmer then decides to test the new fertilizer in 16 plot areas,
while comparing the performance of the conventional fertilizer in other 16 plots. However, the
supplier of the organo-mineral fertilizer was only able to provide enough material for 10 plots.
So the experiment was run with 22 plots of the conventional fertilizer. The crop yield results
obtained during the trial are summarized in Table 9.
Table 9. Maize crop yields obtained during a trial for the comparison between a new organo-
mineral fertilizer and a conventional fertilizer.
Crop yield Crop yield
Plot # Fertilizer Plot # Fertilizer
(Ton/ha) (Ton/ha)
1 Conventional 54.65 17 Conventional 43.69
2 Organo-mineral 42.32 18 Conventional 47.42
3 Conventional 45.02 19 Organo-mineral 36.17
Since the sample size for the conventional fertilizer is larger than 20, the sample average is
assumed to behave normally [8]. The normality of the organo-mineral fertilizer data, on the
other hand, must be tested due to the small sample size. The approximated Shapiro-Wilk test
of normality [4] for this data set yields a -value of , and -value of , indicating

www.forschem.org
Hugo Hernandez
ForsChem Research
that data for the organo-mineral fertilizer (and therefore the corresponding sample average)
can also be assumed to behave normally.
The optimal T-test (opt.mean.test) for the two samples (considering a default test resolution of
) yields:
opt.mean.test(Conv,Orgmin,mu=0,test="T",tails=2)
Ho is not rejected -0.06571 0.06096 0.04672 0.01101 22 T 2 1.159 0.022017
Notice that the final degrees of freedom of the test are smaller than originally expected due to
the last minute change in the design. This change increases the optimal significance level for
this test from to . Nevertheless, the conclusion obtained is that the crop yields are
not significantly different and therefore, the new organo-mineral fertilizer can be used as a
substitute of the conventional fertilizer.
Figure 26. Normal Q-Q plots for the data shown in Table 9. Left plot: All data plotted as a single
sample (Null hypothesis). Right plot: Both categories plotted independently (Alternative
hypothesis). Blue diamonds: Conventional fertilizer. Red squares: Organo-mineral fertilizer.
Green line: Overall fit to a normal distribution.
A different conclusion might have been obtained if the classical significance level would
have been used. However, considering the degrees of freedom available and the test
resolution used, such significance level is suboptimal. In order to better understand the
conclusion obtained based on the -value, let us consider the normal Q-Q plots in Figure 26.
The left plot shows the null hypothesis where both samples are obtained from the same
normal probability distribution. On the other hand, the right plot shows the alternative
hypothesis where both samples are obtained from different distributions. The lines and
equations represent the best fit of each data set to a normal distribution model. When plotted
independently, different mean values are observed for each category. However, it is not
possible to reject the idea that all data have been obtained from the same distribution (left
plot). Notice how there is not a clear difference in the location of organo-mineral fertilizer
yields with respect to the conventional fertilizer yields. Furthermore, we may also argue that

www.forschem.org
Hugo Hernandez
ForsChem Research
the difference between the -value and the classical significance level is small, and thus, for
such significance level the test should be declared inconclusive (additional observations might
arbitrarily change the test conclusion). This also suggests that the result obtained is almost at
the limit of the test resolution ( in this case) obtained by using the classical
significance level.
5.4. Surfactant Substitution in Emulsion Polymerization
A polymer latex manufacturer requires the substitution of a nonylphenol ethoxylate (NPEO)

surfactant used in the synthesis of acrylic polymer dispersions. Considering different
commercial alternatives available [10], the manufacturer decided to test a fatty alcohol
ethoxylate (FAEO) recommended by a surfactant supplier. Their main concern is the potential
change in the viscosity of the final dispersion (directly related to the final particle size
distribution). However, they are willing to tolerate true mean viscosity differences of up to
. The commercial batch size of their polymer latex is , with a total estimated
batch cost of . On the other hand, a lab-scale batch of has a total estimated
cost of about . Thus, the cost ratio for the test is about . The latex viscosity
obtained with the NPEO surfactant in the industrial reactor has been historically in
average with a standard deviation of . However, the behavior of the viscosity with the
new surfactant is unknown. Thus, an initial estimate of the test resolution can be used for
designing the test, and a corrected resolution can then be used to analyze the results. The
initial estimate of the test resolution is . A two-sample one-tailed T-test is
selected for this evaluation. Two samples are used in order to rule out potential differences
caused by changes in the monomers and lab equipment used during the experiments. In
addition, a one-tailed test was chosen because it is, in general, more efficient than the two-
tailed test considering the same test resolution. Thus, the optimal design of the test yields:
optsize.mean.test(test="T",tails=1,dr=0.9259,costratio=120)
17 T 1 0.9259 120 0.03266932 0.06533864
It was then decided to prepare only 9 batches using the old NPEO surfactant, and 10 batches
using the new FAEO surfactant. The viscosity results obtained are summarized in Table 10.
Since both samples are small (less than 20 elements), they are checked for normality resulting
in the following positive -values: (NPEO) and (FAEO), indicating that
normality can be safely assumed.

www.forschem.org
Hugo Hernandez
ForsChem Research
Table 10. Acrylic polymer latex viscosity obtained using two different surfactants.
NPEO FAEO
789 737
781 797
650 751
719 930
703 667
767 806
752 784
646 762
756 689
741
The overall standard deviation of the samples is √ . This
value is used to update the test resolution: This value is much lower than
the original resolution of the design, and therefore, must be used with caution. In fact, the
result obtained with this resolution is inconclusive:
opt.mean.test(NP,FA,mu=0,test="T",tails=1,dr=0.78125)
Inconclusive -0.031670 0.046797 0.10906 0.065674 16 T 1 0.78125 0.131349
The significance level and total test error increase as a result of the smaller test resolution. In
addition, the lower degrees of freedom obtained (with respect to those expected in the
design) indicate a heteroscedastic behavior between the samples. A simple comparison
between the -value and the optimal significance level suggests that the null hypothesis
cannot be rejected, but for this test resolution such conclusion is not necessarily reliable.
However, using the original test resolution considered during the test design (also providing a
lower test error) it is possible to confirm the conclusion that the viscosity is not significantly
affected by the surfactant substitution:
opt.mean.test(NP,FA,mu=0,test="T",tails=1,dr=0.9259)
Ho is not rejected -0.06767 0.05222 0.10906 0.03693 16 T 1 0.9259 0.07387
6. Conclusion
An objective method for determining the optimal sample size and the optimal significance level
for testing hypotheses about means (Z-tests and T-tests) is proposed in this report. The
method requires defining the resolution of the test, which can be obtained for the particular
case considering either sensor resolutions or process tolerances. In addition, a default
resolution value is proposed for each type of test based on the minimum test viability

www.forschem.org
Hugo Hernandez
ForsChem Research
condition. The optimal sample size is then obtained by minimizing the total cost of the test,
which is a compromise between the cost of experimentation (proportional to the sample size)
and the cost of an erroneous decision (proportional to the total test error). Since an exact
analytical expression for the optimal size is not available, empirical approximations are
proposed for each test type. The optimal significance level can be found by minimizing the total
test error for a particular sample size considered. For the tests considered in this report, the
optimal condition is found when the both the type I and type II errors are identical.
The interpretation of this approach is simplified by using the concept of decision value ( -
value). The -value relates the conventional -value obtained for each test with the optimal
significance level and the sample size (or degrees of freedom in the case of T-tests). Positive -
values indicate positive results (rejecting the null hypothesis), whereas negative -values
indicate negative results (rejecting the alternative hypothesis). A critical -value threshold
around zero is also defined for inconclusive results. This is an unstable region where sampling
errors might lead to different conclusions. However, the decisions based on -values are in
general less sensitive to sampling errors than the decisions based on -values.
The codes for the proposed procedures implemented in R language are included, and different
practical examples using these procedures are presented and discussed.
Acknowledgments
This research did not receive any specific grant from funding agencies in the public,
commercial, or not-for-profit sectors.
References
[1] Frost, J. (2020). Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions.
Statistics by Jim Publishing, State College (PA, USA).
[2] Hernandez, H. (2020). Formulation and Testing of Scientific Hypotheses in the presence of
Uncertainty. ForsChem Research Reports, 5, 2020-01. doi: 10.13140/RG.2.2.36317.97767.
[3] Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Second edition.
Lawrence Erlbaum Associates Publishers, Hillsdale (NJ, USA). Chapter 8.
[4] Hernandez, H. (2021). Testing for Normality: What is the Best Method? ForsChem Research
Reports, 6, 2021-05. doi: 10.13140/RG.2.2.13926.14406.
[5] Student (Gosset, W. S.) (1908). The probable error of a mean. Biometrika, 1-25.

www.forschem.org
Hugo Hernandez
ForsChem Research
[6] Jalee, E. A., & Aparna, K. (2016). Neuro-fuzzy soft sensor estimator for benzene toluene
distillation column. Procedia Technology, 25, 92-99.
[7] Ren, H. X., Ying, H. J., Ouyang, P. K., Xu, P., & Liu, J. (2013). Catalyzed synthesis of poly (l-
lactic acid) by macroporous resin Amberlyst-15 composite lactate utilizing melting
polycondensation. Journal of Molecular Catalysis A: Chemical, 366, 22-29.
[8] Hernandez, H. (2019). Sums and Averages of Large Samples Using Standard
Transformations: The Central Limit Theorem and the Law of Large Numbers. ForsChem
Research Reports, 4, 2019-01. doi: 10.13140/RG.2.2.32429.33767.
[9] Deeks, L. K., et al. (2013). A new sludge-derived organo-mineral fertilizer gives similar crop
yields as conventional fertilizers. Agronomy for sustainable development, 33(3), 539-549.
[10] Fernandez, A. M., Held, U., Willing, A., & Breuer, W. H. (2005). New green surfactants for
emulsion polymerization. Progress in organic coatings, 53(4), 246-255.

www.forschem.org

Optimal Significance Level and Sample Size in Hypothesis Testing 1 - Tests of Means

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimal Significance Level and Sample Size in Hypothesis Testing 1 - Tests of Means

Uploaded by

Copyright:

Available Formats

Vol.

Optimal Significance Level and Sample Size in Hypothesis Testing. 1. Tests

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (1 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (2 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (3 / 45)

2.1. One-tailed Z-Test

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (4 / 45)

Figure 1. Probability density distribution of sample average values in a right-tail one-sample Z-

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (5 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (6 / 45)

The optimal significance level ( ) can be found by solving:

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (7 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (8 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (9 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (10 / 45)

where is the measurement resolution of , is the apparent standard deviation of the

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (11 / 45)

Figure 7. Optimal sample size of a right-tail Z-test considering (based on

Figure 8. Optimal significance level of a right-tail Z-test considering (based on

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (12 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (13 / 45)

2.2. Two-tailed Z-Test

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (14 / 45)

where the Cohen distance is now defined as:

The two-tailed Z-test becomes unviable for √ where . In addition, the

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (15 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (16 / 45)

Figure 13. Optimal sample size of a two-tailed Z-test considering (based on

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (17 / 45)

Figure 14. Optimal significance level of a two-tailed Z-test considering (based

2.3. Practical Considerations

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (18 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (19 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (20 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (21 / 45)

For a one-tailed Z-test:

For a two-tailed Z-test:

For a Z-test considering a Cohen resolution :

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (22 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (23 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (24 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (25 / 45)

Figure 20. Optimal degrees of freedom of a one-tailed T-test considering

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (26 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (27 / 45)

Figure 23. Optimal degrees of freedom of a two-tailed T-test considering

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (28 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (29 / 45)

and therefore, at the resolution limit ( ), the standard deviation in is for

 If then the null hypothesis can be rejected

4. Algorithm Implementation and Examples

4.1. General Procedure for Optimally Designing Tests of Means

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (30 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (31 / 45)

4.2. General Procedure for Optimally Testing Means

where for Z-tests, and

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (32 / 45)

4.3. Algorithm Implementation

The general procedures previously presented have been implemented as functions in R

Optimally Designing Tests of Means

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (33 / 45)

Optimally Testing Means

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (34 / 45)

20/05/2021 ForsChem Research Reports Vol. 6, 2021-06 (35 / 45)