Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/232962823

Minimum Sample Size Determination for Generalized Extreme Value


Distribution

Article  in  Communication in Statistics- Simulation and Computation · January 2011


DOI: 10.1080/03610918.2010.530368

CITATIONS READS
23 1,066

2 authors:

Yuzhi Cai Dominic Hames


Swansea University HR Wallingford
135 PUBLICATIONS   3,702 CITATIONS    18 PUBLICATIONS   160 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RE-DESIGNING THE COAST: MORPHodynamics of large bodies of sediment IN a macro-tidal Environment (MORPHINE) View project

PhD Statistical techniques for extreme wave condition analysis in coastal design View project

All content following this page was uploaded by Dominic Hames on 11 July 2014.

The user has requested enhancement of the downloaded file.


Communications in Statistics - Simulation and Computation

Fo
rP

Minimum sample size determination for generalized


extreme value distribution
ee

Journal: Communications in Statistics - Simulation and Computation


rR

Manuscript ID: LSSP-2010-0250

Manuscript Type: Original Paper


ev

Date Submitted by the


07-Aug-2010
Author:

Complete List of Authors: Cai, Yuzhi; University of Plymouth, School of Computing and
ie

Mathematics
Hames, Dominic; HR Wallingford Limited
w

Bootstraping, Generalized extreme value distribution, Return level,


Keywords:
Sample size

Sample size determination is an important issue in statistical


On

analysis. Obviously, the larger the sample size is, the better the
statistical results we have. However, in many areas such as coastal
engineering and environmental sciences, it can be
very expensive
or even impossible to collect large samples. In this paper, we
ly

Abstract: propose a general method for determining the minimum sample


size
required by estimating the return levels from a generalized extreme
value distribution. Both simulation studies and the applications
to real data sets show that the method is easy to implement and
the
results obtained are very good.

Note: The following files were submitted by the author for peer review, but cannot be converted

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 1 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4 fig.zip
5
6
7
8
9
10
11
12
13
14
Fo
15
16
17
18
rP
19
20
21
22
ee
23
24
25
26
rR

27
28
29
30
ev

31
32
33
34
ie

35
36
37
w

38
39
40
On

41
42
43
44
45
ly

46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca
Communications in Statistics - Simulation and Computation Page 2 of 17

1
2
3
4
5
6
7
Minimum sample size determination for
8
9 generalized extreme value distribution
10
11
a
12 Yuzhi Cai∗and b Dominic Hames
13 a
14 University of Plymouth, Plymouth, UK
b
15
16
HR Wallingford, Oxfordshire, UK
Fo

17
18
19
20
rP

21 Abstract
22
23 Sample size determination is an important issue in statistical analysis. Obviously,
24 the larger the sample size is, the better the statistical results we have. However, in
ee

25
26 many areas such as coastal engineering and environmental sciences, it can be very
27 expensive or even impossible to collect large samples. In this paper, we propose a
28 general method for determining the minimum sample size required by estimating the
rR

29 return levels from a generalized extreme value distribution. Both simulation studies
30 and the applications to real data sets show that the method is easy to implement and
31 the results obtained are very good.
32
33
ev

34 Key words: Bootstraping, Generalized extreme value distribution, Return level, Sample
35
36
size.
iew

37
38
39
40 1 Introduction
41
42
Sample size is very important in statistical analysis. However, the sample size required
43
On

44 for a study usually depends on different factors such as the probability models used and
45 the types of statistical tests applied. Various methods have been proposed in the literature
46 on sample size determination. For example, Shieh (2000) used the likelihood ratio test,
47 Self and Mauritsen (1988) and Lubin and Gail (1990) applied the score test, and Bickel
ly

48
49 and Doksum (2001) and Demidenko (2007) considered the Wald test. Ashour and Shalaby
50 (1983) derived some Bayesian and non-Bayesian estimators for sample size when the un-
51 derlying distribution is a Weibull idstribution, and Ashour et al. (1996) also derived these
52 estimators in the case of Burr type XII failure model. Abd-Elfattah and Bakoban (2003)
53
54 obtained the estimators for sample size in case of a generalized gamma distribution, and
55 ∗
Address for correspondence: Dr Yuzhi Cai, School of Computing and Mathematics, University of Ply-
56
mouth, Plymouth PL4 8AA, United Kingdom. Email: ycai@plymouth.ac.uk
57
58
59
60
1

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 3 of 17 Communications in Statistics - Simulation and Computation

1
2 Abd-Elfattah et al. (2007) considered the Marcus and Blumenthal (1975) approach to es-
3
4 timate the sample size in the case when the distribution is the Lomax distribution, i.e. the
5 Pareto distribution of the second kind.
6
7 The generalized extreme value (GEV) distribution has been used widely in many areas
8 because the extreme value theory guarantees that the maximum group values will follow a
9 GEV distribution if the group size tends to infinity. For example, in coastal engineering,
10 we could assume that the maximum annual sea-levels will approximately follow the GEV
11
12 distribution, and hence a GEV model is often fitted to the maximum annual data. In this
13 paper, we study the sample size determination in the case when the underlying distribution
14 is the GEV distribution, and we assume that the group size is large enough for us to use the
15 GEV model. However, in reality limited data sets are available , or it can be very expensive
16
or even impossible to collect a large number of such data. Therefore, it is very important
Fo

17
18 to develop a method for estimating the accuracy of any study or the minimum sample size
19 required for such a study.
20
rP

21 Let xi (i = 1, . . . , n) be the maximum value of group i. It is well known that under


22 certain regular conditions the maximum likelihood estimators (MLE) of the parameters
23 are normally distributed as n approaches infinity. However, for any finite values of n, the
24
distributions of the estimators are unknown, and hence the distributions of any functions of
ee

25
26 the estimators, such as return levels, are also unknown.
27
28 The main purpose of the paper is to develop a method for determining the minimum
rR

29 sample size based on the asymptotic distribution of MLEs. We will focus on the return level
30 as it is a very important quantity in coastal engineering and environmental sciences and it
31 provides crucial information in, for example, the design of flood defences. The developed
32
33 methodology can be easily extended to other models and other quantities of interest. The
ev

34 arrangement of the paper is as follows. In Section 2, we develop the methodology for


35 determining the sample size. Simulation studies are given in Section 3 and applications in
36 Section 4. Some discussions and conclusions are given in Section 5.
iew

37
38
39
40
41
2 The method
42
43 The GEV distribution function is defined by
On

44
45 −1/ξ
F (x; µ, σ, ξ) = e−[1+ξ( )]
x−µ
46 σ (1)
47 ¡ ¢
for 1 + ξ x−µ
ly

48 σ
> 0, where µ ∈ R is the location parameter, σ > 0 the scale parameter
49 and ξ ∈ R the shape parameter. Let β = (µ, σ, ξ) be the parameter vector. Suppose we are
50
51
interested in estimating the τ th quantile of the GEV, which is given by
52 σ
53 x=µ− [1 − (− ln τ )−ξ ], (2)
54 ξ
55
56 where 0 < τ < 1. So, if τ = 0.99, the 0.99th quantile may correspond (in an analysis of
57 annual maxima) to the 100-year return level. We need to decide the minimum sample size
58 so that the statistical inferences on the return level are proper. Note that in the following
59
60
2

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 4 of 17

1
2 we illustrate our method by using the 100-year return level, any other quantities of interest
3
4 can be dealt with similarly.
5 The basic idea of the proposed method is given below. If the sample size n is large
6
7 enough, then the maximum likelihood estimator (MLE) of β should be normally dis-
8 tributed, hence any functions of the MLE, such as the return level. As we only have one
9 observed data set, it would be difficult to check the normality of the MLE of the return
10 level. Therefore, we propose to apply the bootstrap method with replacement to the ob-
11
12 served data set. Suppose we generated M bootstrap samples each of size n. Then we
13 should expect that the MLEs of the return level based on the bootstrap samples should also
14 be normally distributed. Standard bootstrap methodology guarantees that the MLEs form
15 a random sample of a normal distribution for the return level. In this paper, we use the
16
Shapiro-Wilk test to check the normality of the MLEs of the return level. If we have to
Fo

17
18 reject the null hypothesis, i.e. the MLEs are normally distributed, then we may conclude
19 that the sample size n is not large enough at a given significant level. So we need to col-
20 lect more samples and to apply the above procedure to the enlarged sample. This process
rP

21
should be continued until we failed to reject the null hypothesis. The details of the method
22
23 are given below.
24
Let y1 = (y11 , . . . , y1n ) be the observed data.
ee

25
26 Step 1. Obtain M − 1 bootstrap samples (with replacement) from y1 , denoted by ym =
27
28 (ym1 , . . . , ymn ), where m = 2, . . . , M .
rR

29 (1) (1)
30
Step 2. For m = 1, . . . , M , let xmj be the first j values of ym . That is xmj =
31 (ym1 , . . . , ymj ), where j = n0 , . . . , n and n0 is large enough for parameter estimation.
32 (1)
33 Step 3. For fixed j, obtain K − 1 bootstrap samples (with replacement) from xmj ,
ev

34 (k)
denoted by xmj , where k = 2, . . . , K.
35
36 (k) (k)
Step 4. Fit a GEV model to each data set xmj , and calculate the return level `mj ,
iew

37
38 m = 1, . . . , M , j = n0 , . . . , n, k = 1, . . . , K.
39 (k)
Step 5. Carry out the Shapiro-Wilk normality test on `mj (k = 1, . . . , K) and record
40
41 the p-value, denoted by pmj , of the test statistic, where m = 1, . . . , M and j = n0 , . . . , n.
42
Step 6. For j = n0 , . . . , n calculate
43
On

44 M M
45 1 X 1 X
p̄j = pmj , s2j = (pmj − p̄j )2 .
46 M m=1 M − 1 m=1
47
ly

48 and construct a 100(1 − α)% confidence interval


49
50
p̄j ± t∗ sj ,
51
52
53
where t∗ is the critical value obtained from a t-distribution with M − 1 degrees of freedom.
54 Step 7. Let N be the number such that for any j > N we have p̄j − t∗ sj > α. As
55
56 p̄j − t∗ sj is the lower band of the confidence interval, we are 100(1 − α)% confident that
57 the minimum number of samples required for obtaining a properly estimated return level
58 is N .
59
60
3

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 5 of 17 Communications in Statistics - Simulation and Computation

1
2 It is worth mentioning that the standard theory of maximum likelihood was developed
3
4 for the case when the support of the distribution does not depend on unknown parameters,
5 see, for example, Rao (1993). For GEV distribution, this is not the case because the support
6 of the distribution must satisfy 1 + ξ(x − µ)/σ > 0. However, Smith (1985) studied this
7 problem thoroughly and showed that (a) if ξ > −0.5, then the MLEs of the model param-
8
9
eters are asymptotically normally distributed; (b) if −1 < ξ < −0.5, then the MLEs can
10 be obtained but are not asymptotically normally distributed; and (c) if ξ < −1, then it is
11 usually not possible to obtain the MLEs. Note that when ξ ≤ −0.5, the GEV distribution
12 has a very short bounded right tail which, as Cole (2001) pointed out, is rarely encountered
13
in applications of extreme value analysis. Therefore the theoretical limitations of the max-
14
15 imum likelihood method have little effect in practice. So in this paper we focus on the case
16 ξ > −0.5. Indeed, all the real data sets considered in this paper have a reasonable long
Fo

17 right tail, and hence the developed method can be safely applied.
18
19 To investigate the performance of the proposed method, we carried out extensive simu-
20 lation studies which are given in the next section.
rP

21
22
23
24 3 Simulation studies
ee

25
26
27 Simulation study 1
28
rR

29 Consider the GEV model given by


30
31 1/0.11

32
F (x; µ, σ, ξ) = e−[1−0.11(x−100)/13] (3)
33
ev

34 for 1 − 0.11 (x − 100) /13 > 0. So, the true parameter values are: µ = 100, σ = 13 and
35 ξ = −0.11. Furthermore, the 100-year return level is given by
36
iew

37 x = 100 + 13[1 − (− ln 0.99)0.11 ]/0.11 = 146.93, (4)


38
39
40
Note that the true parameter values were chosen arbitrarily.
41 The statistical software R was used to generate a random sample of size n = 200
42
43 from the above GEV model. In this simulation study, we take M = K = 50, n0 = 25 and
On

44 α = 0.05. By applying the developed method to this data set, we obtained the results shown
45 in Figure 1, where the lighter curve shows the average p-values p̄j against the sample size
46 j = 25, . . . , 200, the darker curves are the corresponding lower and upper limits of the 95%
47
confidence interval, i.e. p̄j ± t∗ sj (j = 25, . . . , 200). The horizontal line is at α = 0.05, and
ly

48
49 the vertical line is at N = 40. It is seen that for j ≥ 40 all curves are above the horizontal
50 line, indicating that the minimum sample size we could take is N = 40 in this case. So
51 if the data represent the maximum annual values of sea levels at a particular site, then 40
52
53
years’ data would give a for good estimate of the return levels in terms of the normality of
54 the MLE.
55
56 Now we compare the two fitted GEV models: one was fitted to the first 40 values of
57 the simulated data, the other to all the data. The true and the estimated values are shown
58 in Table 1. As expected, the standard errors of the MLEs based on the whole data sets are
59
60
4

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 6 of 17

1
2
3
4
5
6
7
8
9
0.6
10
11
0.5

12
13
0.4

14
p−values

15
0.3

16
Fo

17
0.2

18
19
0.1

20
rP

21
0.0

22
23 50 100 150 200

24 Sample size
ee

25
26
27
28 Figure 1: Plot of the average p-values of the Shapiro-Wilk Normality test statistics together
rR

29 with a 95% confidence interval in the Simulation Study 1.


30
31
32
33
ev

34
35
36
iew

37
38
39
40
41
42 True value MLE (s.e.) MLE (s.e.)
43
On

44
(n = 40) (n = 200)
45 µ 100.00 99.16 ( 2.13) 98.75 (0.95)
46 σ 13.00 12.38 ( 1.44) 12.10 (0.67)
47 ξ -0.11 -0.10 (0.08) −0.08(0.05)
ly

48
49
Return level 146.93 144.95 (8.02) 145.57 (4.44)
50
51
Table 1: True and estimated parameter values together with their standard errors (in brack-
52 ets).
53
54
55
56
57
58
59
60
5

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 7 of 17 Communications in Statistics - Simulation and Computation

1
2 smaller than those based on the first 40 values. However, the estimated values are all very
3
4 similar to the true parameter values. Note that the standard error of the estimated return
5 level was obtained by using the delta method, and that a 95% confidence interval for each
6 parameter can be constructed easily by using the estimated standard errors. The diagnostic
7 plots in Figure 2 also shows that the two fitted models are very similar.
8
9
10 Probability Plot Probability Plot
11
12
13

0.8
0.8

14
Model

Model
15

0.4
0.4

16
Fo

17

0.0
0.0

18
19 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

20 Empirical Empirical
rP

21
22
23 Return Level Plot Return Level Plot
24
ee

25
140

140
Return Level

Return Level

26
27
28
80 100

80 100
rR

29
30
31 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
32
Return Period Return Period
33
ev

34
35
36
Figure 2: The probability plot and the return level plot of the GEV models fitted to the first
iew

37
38 N = 40 simulated data (left column), and to the whole data set (right column) respectively
39 in the Simulation Study 1. On the return level plots, the top and the bottom curves give a
40 95% confidence bands for the return levels.
41
42
43
On

44 Simulation study 2.
45
46 In simulation study 1 we only dealt with one simulated data. In this simulation study,
47 we repeat the above simulation study on 70 independently simulated data sets, each of size
ly

48 200. So the average performance of the method can be assessed.


49
50 For each simulated data and for different sample sizes, we recorded the sample average
51 p-values of the test statistic. Then the final average p-values are obtained by averaging over
52 the 70 simulated data sets. Figure 3 shows the average p-values against the sample size and
53
54 the corresponding 95% confidence interval. The lower limit of the confidence interval at a
55 sample size N = 40 is 0.047, and for any sample size n > 40, all the curves are above the
56 horizontal line. This suggests that on average, the performance of the method is very stable
57 over different simulated data sets.
58
59
60
6

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 8 of 17

1
2
3
4

0.20
5
6
7

0.15
8
9

p−values
10

0.10
11
12
13
0.05
14
15
16
Fo
0.00

17
18 30 40 50 60 70

19 Sample size
20
rP

21
22
23 Figure 3: Average p-values over 70 different simulated data sets together with a 95% con-
24 fidence interval in the Simulation Study 2.
ee

25
26
27
28 4 Applications
rR

29
30
31 In the applications of the developed method, we consider four real data sets obtained from
32 Reiss and Thomas (2001). The first data set is the maximum annual sea-levels (in meters)
33
ev

in Venice from 1931 to 1981, containing 51 values. The mean sea level at Venice is 0.52m.
34
35 In this application we consider the sea levels relative to the mean sea level. The second
36 data set is the Iceland maximum annual wind speed (in meters per second) data in the years
iew

37 1912 to 1992 with measurements for two years missing. We ignore the missing data in
38 this application, hence the data set contains 79 values. The third data set is the maximum
39
40 annual discharges (in cubic meters per second) of the Harricana River at Amos (Quebec,
41 Canada) from 1915 to 1983, containing 69 values. The fourth data set is the maximum
42 annual De Bilt temperatures (in Celsius) from 1849 to 1981, containing 133 values. The
43 scatter plots of the four data sets are given in Figure 4.
On

44
45 All the four data sets are not long. We would like to estimate the minimum sample
46 size required for making reasonable statistical inferences on the 100-year return level, and
47
hence to see what confidence we can place in these estimates and whether further data are
ly

48
49 required.
50
51 Because the data are annual maximums, it is reasonable to fit a GEV model to each
52 data set, hence the developed method can be used. For each of them, the minimum sample
53 size was taken to be 25, because for smaller sample sizes we often have difficulties in
54 fitting a GEV model to them. The largest sample size was taken to be the total number
55
56 of observations in each data set. Furthermore, the number of bootstrap samples obtained
57 from each data set and from its subsets with different sample sizes is 50. That is, we let
58 M = K = 50. For larger values we have very similar results.
59
60
7

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 9 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4
5
6
7
8
9
10
11
12
13

Wind speeds (m/s)

80
1.2

14
Sea−levels (m)

15
16

60
Fo
0.8

17
18

40
19
0.4

20
rP

20
21
22
23 1930 1950 1970 1920 1940 1960 1980
24
ee

25 (a) Year (b) Year


26
27
28
rR

29
30
31
36

32
300
Discharges (m3/s)

Temperatures (C)

33
ev

34
32

35
200

36
iew

37
38
28

39
100

40
41 1920 1940 1960 1980 1860 1900 1940 1980
42
43
On

(c) Year (d) Year


44
45
46
47
Figure 4: Scatter plots of (a) the maximum annual sea-levels in Venice, (b) the maximum
ly

48
49 annual wind speed in Iceland, (c) the maximum annual discharges of the Harricana River
50
and (d) the maximum annual De Bilt temperature data.
51
52
53
54
55
56
57
58
59
60
8

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 10 of 17

1
2 Sample size Sea-levels Wind-speeds Discharges Temperatures
3
Return level 1 1.78 (0.113) 79.53 (2.79) 327.52 (31.100) 36.00 (0.889)
4
5 Return level 2 1.78 (0.110) 84.42 (3.13) 326.78 (23.454) 35.99 (0.358)
6
7 Table 2: Estimated return levels, where Return level 1 corresponds to the model fitted to
8 data of minimum sample size, while the Return level 2 corresponds to the model fitted to
9 all the data. Numbers in brackets are the corresponding standard errors.
10
11
12
13 Figure 5 shows the average p-values of the Shapiro-Wilk test statistic together with a
14
15
95% confidence interval for each data set, where the horizontal lines are at α = 0.05, and
16 the vertical lines are at 50, 72, 44 and 44 respectively, indicating the minimum sample
Fo

17 size required in each case. Note that the total sample size of Venice sea-levels is only 51.
18 Figure 5(a) shows it is difficult to make a sensible conclusion about the minimum sample
19
size required in this case. So more data need to be collected. Figure 5(b) shows that all three
20
rP

21 curves are above the horizontal line for sample sizes greater than 71. This suggests that the
22 minimum sample size can be taken as N = 72. Both Figure 5(c) and (d) suggest that the
23 minimum sample size can be taken as N = 44. Therefore, except for the Venice sea-levels
24 data set, all other data sets are large enough for making good statistical inferences on the
ee

25
26 return level.
27
28
To compare the model fitted to the data of minimum sample size with the model fitted
to the whole data set, we produced Table 2 which shows that no significant differences at
rR

29
30 a 95% level between the two estimated return levels for each data set: one is based on the
31 minimum sample sizes and the other is based on the whole data sets. This is because the
32
corresponding 95% confidence intervals overlap. Note that the larger standard error value
33
ev

34 in the second row of Table 2 for the Iceland maximum annual wind speed data is caused by
35 the approximation features of the delta method in calculating return levels for this data set.
36
Figure 6 shows the return level plots for the sea-level and wind-speed data sets, while
iew

37
38 Figure 7 is for the discharge and temperature data sets. In both figures, the left column
39 corresponds to the minimum sample size, and the right column corresponds to the whole
40
data set. These figures also show no significant differences in estimating return levels
41
42 caused by using the minimum sample sizes.
43
On

44
A further statistical inferences about the sample size effects on the accuracy of the
45 return levels can be carried out. We should expect that, generally, the standard errors of the
46 MLEs of the return level decreases as the sample size increases. These standard errors tell
47 us how much gain we will have if we increase the sample size. In other words, how much
ly

48
49
it would be worth paying to obtain a greater length of data. Figure 8 shows the average
50 standard errors of the MLEs of the return level when the sample size increases from the
51 minimum sample size up to another 300 data. Those 300 extra data were simulated from
52 the fitted model by using the observed samples of minimum length. The average standard
53
errors are based on 50 independently simulated samples each of size 300. From these plots,
54
55 say, from Figure 8(b), we see that if we require the standard error of the MLE is less than
56 2.0, then we need at least 120 data instead of 72. That means that extra 48 data need to be
57 collected in order to decrease the standard error of the MLE of the return level from 2.79
58
59
60
9

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 11 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4
5
6
7
8
9
10
11
0.4

0.4
12
13
0.3

0.3
14
p−values

p−values
15
0.2

0.2
16
Fo

17
18
0.1

0.1
19
20
0.0

0.0
rP

21
22 25 30 35 40 45 50 30 40 50 60 70 80
23
24 (a) Sample size (b) Sample size
ee

25
26
27
28
rR

29
30
0.4

0.4

31
32
0.3

0.3

33
ev
p−values

p−values

34
0.2

0.2

35
36
0.1

0.1
iew

37
38
0.0

0.0

39
40
41 30 40 50 60 70 40 60 80 100
42
43 (c) Sample size (d) Sample size
On

44
45
46
47 Figure 5: Average p-values and the corresponding 95% confidence interval for (a) the
ly

48
49 maximum annual sea-levels in Venice, (b) the maximum annual wind speed in Iceland, (c)
50 the maximum annual discharges of the Harricana River and (d) the maximum annual De
51 Bilt temperature data.
52
53
54
55
56
57
58
59
60
10

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 12 of 17

1
2
3
4
5
6
7
8
9
10 Return Level Plot Return Level Plot
11
12
13

1.2
1.2

14
Return Level

Return Level
15
16
Fo

0.8
0.8

17
18
19

0.4
0.4

20
rP

21
22 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
23
24 Return Period Return Period
ee

25
26
27
28 Return Level Plot Return Level Plot
rR

29
30
80

31
80

32
Return Level

Return Level

33
ev
60

34
60

35
36
40

40
iew

37
38
20

20

39
40
41 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
42
43 Return Period Return Period
On

44
45
46
47 Figure 6: The return level plots of the GEV models fitted to the data of minimum sample
ly

48
49 sizes (left column), and the return level plots of the GEV models fitted to the whole data
50 sets (right column). First row is for the Venice sea-level data, and the second row is for the
51 Iceland wind-speed data.
52
53
54
55
56
57
58
59
60
11

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 13 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4
5
6
7
8
9
10 Return Level Plot Return Level Plot
11
12
13
14
Return Level

Return Level

300
300

15
16
Fo

17

200
200

18
19
100

100
20
rP

21
22 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
23
24 Return Period Return Period
ee

25
26
27
28 Return Level Plot Return Level Plot
rR

29
30
31
36
28 30 32 34 36

32
Return Level

Return Level

33
ev

34
32

35
36
iew

37
28

38
39
40
41 1e−01 1e+01 1e+03 1e−01 1e+01 1e+03
42
43 Return Period Return Period
On

44
45
46
47 Figure 7: The return level plots of the GEV models fitted to the data of minimum sample
ly

48
49 sizes (left column), and the return level plots of the GEV models fitted to the whole data
50 sets (right column). First row is for the discharges of the Harricana River data, and the
51 second row is for the De Bilt temperature data.
52
53
54
55
56
57
58
59
60
12

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 14 of 17

1
2 to 2.0. This would provide useful information for a company to decide whether new data
3
4 are worth collecting.
5
6
7
8 5 Comments and conclusion
9
10
In this paper, we proposed a simple method to determine the minimum sample size re-
11
12 quired for making statistical inferences on a statistical quantity of interest. The developed
13 methodology is based on the standard theory of maximum likelihood and we illustrated
14 the method through the estimation of the 100-year return level by using a GEV model.
15 It is worth mentioning that for other probability models and other quantities of interest,
16
Fo

17 the method proposed in this paper can also be used, provided that the MLE is normally
18 distributed in large samples. However, we would expect that the minimum sample sizes
19 required by making statistical inferences on different quantities may be different. Simula-
20 tion studies and application results show that the method can be easily implemented and
rP

21
22 the fitted models based on the minimum sample size are very similar to the fitted models
23 based on all the available data.
24
In this paper we have also demonstrated different applications of the developed method.
ee

25
26 In summary, this method can be used (a) to check whether an existing data set is long
27 enough for making good statistical inference, (b) to decide how many more data need to
28 be collected for making good statistical inference and (c) to make sure how much gain we
rR

29
30 will have in terms of confidence if we increase the sample size. Therefore, we expect that
31 the developed method can be very useful in practice.
32
33
ev

34
Acknowledgement
35 We would like to express our sincere thanks to the referees for their very constructive
36
comments and suggestions which have greatly enhanced the quality and the presentation of
iew

37
38 the paper.
39
40
41
42 References
43
On

44
45
[1] Abd-Elfattah, AM., Alaboud, FM. and Alharby, AH. (2007). On sample size estimation
46 for Lomax distribution. Australian Journal of Basic and Applied Sciences 1: 373-378.
47
[2] Abd-Elfattah, AM. and Bakoban, RA. (2003). A study on the estimation of sample size
ly

48
49 for generalized Gamma distribution. The Scientific Journal for Economic and Com-
50 merce, Faculty of Commerce, Ain Shames University 2: 17-28.
51
52 [3] Ashour, SK., Abd-Elfattah, AM. and Mahmoud, MR. (1996). Estimation of sample size
53
54 for the Burr failure model. Tha Annual Conference in Statisitcs, Computer Science and
55 Operations Reserach, ISSR, Cairo University 31: 18-26.
56
57 [4] Ashoud, SK. and Shalaby, OA. (1983). Estimating sample size with Weibull failure.
58 Math. Operationsforsch. U. Statist. Ser. Statist., Berlin 2: 263-268.
59
60
13

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 15 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4
5
6
7
8
9
10
11
0.11

12
Std of return levels

Std of return levels


13

2.5
14
0.09

15

2.0
16
Fo

17
0.07

1.5
18
19
0.05

1.0
20
rP

21
22 50 150 250 350 100 200 300
23
24 (a) Sample size (b) Sample size
ee

25
26
27
28
rR

29
30
0.9

31
Std of return levels

Std of return levels


30

32
0.7

33
ev
25

34
35
20

0.5

36
iew

37
15

38
0.3

39
40
41 50 150 250 350 50 150 250 350
42
43 (c) Sample size (d) Sample size
On

44
45
46
47 Figure 8: The plots of the average standard errors of the MLEs of the return level against
ly

48
49 sample sizes for (a) the maximum annual sea-levels in Venice, (b) the maximum annual
50 wind speed in Iceland, (c) the maximum annual discharges of the Harricana River and (d)
51 the maximum annual De Bilt temperature data.
52
53
54
55
56
57
58
59
60
14

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 16 of 17

1
2 [5] Bickel PJ. and Doksum, KA. (2001). Mathematical Statistics (2nd edn). Prentice-Hall:
3
4 Upper Saddle River, NJ.
5
6 [6] Coles, S. (2001). An introduction to statistical modelling of extreme values. Springer
7 Series in Statistics.
8
9 [7] Demidenko, E. (2007). Sample size determination for logistic regression revisited.
10 Statistics in Medicine 26: 3385-3397.
11
12 [8] Lubin JH. and Gail, MH. (1990). On power and sample size for studying features of
13
14
the relative odds disease. American Journal of Epedemiology 131: 552-566.
15
16 [9] Marcus, R. and Blumenthal, S. (1975). Estimating population size with exponential
Fo

17 failure. Journal of the American Statistical Association 70: 913-922.


18
19 [10] Rao, RC. (2002), Linear Statistical Inference and Its Applications, 2nd edition, New
20 York: Wiley.
rP

21
22 [11] Reiss, RD. and Thomas M. (2001). Statistical analysis of extreme values. Birkhäuser
23
24
Verlag, Basel-Boston-Berlin.
ee

25
26 [12] Smith, RL. (1985). Maximum likelihood estimation in a class of non-regular cases.
27 Biometrika, 72, 67-90.
28
rR

29 [13] Shieh, G. (2000). On power and sample size calculations for likelihood ratio tests in
30 generalized linear models. Biometrics 56: 1192-1196.
31
32 [14] Self, SG. and Mauritsen, RH. (1988). Power/sample size calculations for generalized
33
ev

34
linear models. Biometrics 44: 79-86.
35
36
iew

37
38
39
40
41
42
43
On

44
45
46
47
ly

48
49
50
51
52
53
54
55
56
57
58
59
60
15

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Page 17 of 17 Communications in Statistics - Simulation and Computation

1
2
3
4 Responses to the referees’ comments on the paper
5
6
“Minimum sample size determination for generalized extreme value
7 distribution”
8
9 Yuzhi Cai, University of Plymouth, UK.
10
11 Dominic Hames, HR Wallingford, UK
12
13
14 We sincerely thank the referees for their very helpful comments and
15
16 suggestions, according to which the paper has been modified. A separate
Fo

17
18 paragraph has been added at the end of Section 2 to explain why the
19
20 developed method can be used to GEV distributions. We also re-emphasized
rP

21
22
the asymptotic normality condition at the last section of the paper. An
23 acknowledgement has also been added into the paper to thank the referees’
24
ee

25 comments which has improved the quality and the presentation of the paper
26
27 significantly. The detailed responses are given below.
28
rR

29
30 Referee’s comments:
31
32
33
ev

34 The standard theory of maximum likelihood is developed for the case when
35
36 the distribution domain does not depend on unknown parameters, e.g.
iew

37
38
C.R.Rao (1993), "Linear Statistical Inference and Its Applications," New York:
39 Wiley. Unfortunately, GEM, as defined in equation (1), does, namely, 1+ksi*(x-
40
41 mu)/sigma>0. The authors should provide theoretical justification why MLE is
42
43 normally distributed in large sample.
On

44
45
46 Responses:
47
ly

48
49
50 This is an excellent comment, many thanks! A separate paragraph has been
51
52 added into the paper to discuss this problem. Generally speaking, Smith
53
(1985) studied this problem thoroughly and showed that (a) kxi>-0.5, then the
54
55 MLEs of the model
56
57 parameters are asymptotically normally distributed; (b) if -1<kxi<-0.5, then the
58
59 MLEs can be obtained but are not asymptotically normally distributed; and (c)
60
if kxi<-1, then it is usually not possible to obtain the MLEs. Note that when kxi

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca


Communications in Statistics - Simulation and Computation Page 18 of 17

1
2
3
<= -0.5 , the GEV distribution has a very short bounded right tail which, as
4
5 Cole (2001) pointed out, is rarely encountered in applications of extreme
6
7 value analysis. Therefore the theoretical limitations of the maximum likelihood
8
9 method have little effect in practice. So in this paper we focus on the case kxi
10
11 > - 0.5. Indeed, all the real data sets considered in this paper have a
12 reasonable long right tail, and hence the developed method can be safely
13
14 applied.
15
16
Fo

17
18 The last section of the paper has also been modified to re-emphasize the
19
importance of the asymptotic normality condition.
20
rP

21
22
23 The reference C.R.Rao (1993) has also been added into the paper.
24
ee

25
26
27
28
rR

29
30
31
32
33
ev

34
35
36
iew

37
38
39
40
41
42
43
On

44
45
46
47
ly

48
49
50
51
52
53
54
55
56
57
58
59
60

URL: http://mc.manuscriptcentral.com/lssp E-mail: comstat@univmail.cis.mcmaster.ca

View publication stats

You might also like