Sas Simulation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A Remark on Efficient Simulations in SAS

Author(s): Ilya Novikov


Source: Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 52, No. 1
(2003), pp. 83-86
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/4128171
Accessed: 10-03-2017 04:52 UTC

REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/4128171?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted
digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about
JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms

Royal Statistical Society, Wiley are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series D (The Statistician)

This content downloaded from 141.211.4.224 on Fri, 10 Mar 2017 04:52:29 UTC
All use subject to http://about.jstor.org/terms
The Statistician (2003)
52, Part 1, pp. 83-86

A remark on efficient simulations in SAS

Ilya Novikov
Gertner Institute for Epidemiology and Health Policy Research, Tel Hashomer, Israel

[Received February 2002. Revised August 2002]

Summary. SAS software is often used for statistical simulations. The paper demonstrates a
simple and effective approach for performing simulations in SAS. The method that is usually
used generates data sets and performs the calculations sequentially in time, using the macro
%DO loop to execute each cycle. The more efficient approach is to generate one large data set
that includes all the individual data sets from each cycle as subsets and to perform the calcu-
lations on each subset in one pass, using the BY command. The paper presents an example
of both methods of simulation, with their SAS codes. Both programs give the same numerical
results, but the BY approach is 80 times faster than the macro %DO approach.

Keywords: Bootstrap; SAS; SAS programming; Simulations

1. Simulations with SAS

The efficient programming of simulations has become an important part of statistical


Among 234 original papers published in Statistics in Medicine in 2000, as many as
indexed under the subjects 'bootstrap' or 'simulation'. Almost none of them discussed c
tational algorithms of simulation; only one presented a full program (in S-PLUS) and on
showed a part of the SAS program used.
SAS (SAS Institute, 1999) is a very commonly used statistical package. Among the 81
mentioned above, it was used in 16 of the 33 papers that specified the software e
Because of its common use, here we describe an approach in SAS that leads to shor
simulation programs that can be used in many practically important situations. We sha
present an example. The recommended approach is already known and may be fou
literature (e.g. Ambrosius and Hui (2000)), but it is not used as widely as it deserves.
There are different ways of organizing simulations in SAS. The usual space minimizin
proach to the mass calculations required is to select or generate an individual data set f
calculation, to perform the calculations on this data set and to store the results in
'results' data set. One proceeds in this manner sequentially for all data sets, and, finall
forms a summary analysis on the results data set. This approach can be executed by us
macro %DO loop, which performs the data generation and analysis for each cycle.
The more efficient time minimizing approach is to generate one large (maybe very l
set, which includes all the individual data sets from each cycle as subsets, identified by t
number (SERIAL) of the subset. The necessary calculations are then performed only
using the SAS construction BY SERIAL (SAS Institute, 1999). This yields the results
which is then submitted to summary calculations.

Address for correspondence: Ilya Novikov, Biostatistical Unit, Gertner Institute for Epidemiology
Policy Research, Tel Hashomer, 52621, Israel.
E-mail: ilian@gertner.health.gov.il

? 2003 Royal Statistical Society 0039-0526/03/52083

This content downloaded from 141.211.4.224 on Fri, 10 Mar 2017 04:52:29 UTC
All use subject to http://about.jstor.org/terms
84 I. Novikov
An example is the estimation of the coverage level for a normal-based confidence interval
(sample mean plus or minus 1.96 standard errors) for the expectation of a variable with a non-
normal distribution. Suppose that Y is a variable that with probability q equals 0, and with
probability p (= 1 - q) has a log-normal distribution with parameters m and V. Such problems
appear often in medical applications (Rahme et al., 2001).
Appendix A shows two SAS programs that estimate this coverage level, which give exactly the
same numerical results, but, for 5000 samples with 200 subjects per sample, the usual approach
takes 7 min 2.37 s, whereas the recommended approach takes 5.00 s. The programs were run
in SAS 8.12 on an IBM personal computer with a Pentium III 733 MHz processor and 128
Mbytes of random access memory, operating under Windows 98.

Acknowledgements
Thanks are due to Laurence Freedman (Gertner Institute, Israel) and Phil Gibbs (SAS Institute,
Cary, USA) who encouraged me to write this note.

Appendix A: Example of an SAS program


* Testing coverage probability of a normal-based confidence
interval for a two-component model;
* Y=X*Z, where P(x=l)=pl, z=exp(r) and r-N(m,V);
*------------------------------ --

*(a) The usual macro approach;


options nosource nonotes; * == otherwise the log-window may overflow ==;
%macro cover (nobs,nsam,me,va,alpha,p,seedl,seed2);
data c; file print;
* ===== fixing the time of start of the macro approach ========;
time=time(); put 'START MACRO PROCESSING:' time=time 16.6; drop time;
* ===== initialization of a summary data set ==========;
my=. ; tm=.; sey=. ; _type_=. ; _freq_=.;
run;
%do i=1 %to &nsam; * ========== main loop =========-
data a; * ==== generation of one individual data set ========;
pl=&p; m=&me; v=&va; sd=sqrt(v); seedx=&seedl; seedz=&seed2;
keep y truemean;
truemean=pl*exp(m+v/2);
do j=l to &nobs;
call ranbin(seedx,l,pl,y);
if y=l then do;
call rannor(seedz,y); y=exp(m+sd*y);
end;
output;
end; *=== end of generation of an individual data set ======;
*=== saving the current seed for obtaining the same seed flow as in the
BY approach ==;
call symput('seedl', trim(right(seedx)));
call symput('seed2', trim(right(seedz)));
run;
proc means data=a noprint; *=== calculations for an individual data set;
vary truemean;
output out=b mean(y truemean)=my tm stderr(y)=sey; *==saving current
results;
run;

This content downloaded from 141.211.4.224 on Fri, 10 Mar 2017 04:52:29 UTC
All use subject to http://about.jstor.org/terms
Efficient Simulations in SAS 85

proc append base=c force; *==== adding the last results to summary data
set c;
%end; * ===== end of the main loop
data d; file print; * ===== summarizing calculations ===;
set c end=eof;
retain coverage 0 nsam;
coverage=coverage+((my-1 . 96*sey<=tm)&(my+l . 96*sey>tm))/&nsam;;
if eof then do;
put 'coverage=' coverage 8.5;
time=time(); * === fixing the time of the end of the macro approach ==;
put 'END MACRO PROCESSING:' time=time 16.6;
end;
run;
%mend cover;
%cover(200,5000,1.5,1,0.05,0.7,4635209,3762973);

*(b) The more efficient BY approach;


data para; * ==== setting the parameters for simulations;
file print; *====== fix the starting time for the BY approach =====;
time=time(); put 'START BY PROCESSING:' time=time 16.6;
pl=0.7; m=1.5; V=1; NOBS=200; NSAM=5000; seedx=4635209; seedz=3762973;
data a; * generation of a data set with NSAM subsets of NOBS records
in each one;
set para;
retain pl m v seedx seedz nobs nsam;
keep SERIAL j y truemean;
sd=sqrt(V);
truemean=pl*exp(m+v/2); * ====== truemean=mean of z ============;
do SERIAL=1 to nsam; * ======= generation of a large data set =====;
do j=l to nobs; * ======= generation of a subset ======;
call ranbin(seedx,l,pl,y);
if y=l then do;
call rannor(seedz,y); y=exp(m+sd*y);
end;
output;
end;
end;
run;

proc means data=a noprint; * ==calculation of means and storing in


data set b;
var y truemean;
output out=b mean(y truemean)=my tm stderr(y)=sey;
by SERIAL;
run;
data c; * ========= summarizing calculation ========;
retain coverage 0 nsam;
file print;
if eof^=l then set para end=eof;
do while (eofb=0);
set b end=eofb;
coverage=coverage+((my-1.96*sey<=tm)&(my+l.96*sey>tm))/nsam;;
if eofb then do;
put 'coverage=' coverage 8.5;
time=time();
put 'END BY PROCESSING:' time=time 16.6;
end;
end;
run;

This content downloaded from 141.211.4.224 on Fri, 10 Mar 2017 04:52:29 UTC
All use subject to http://about.jstor.org/terms
86 I. Novikov

A. 1. SAS output
START MACRO PROCESSING: time=12:45:40.340000
coverage=0.92700
END MACRO PROCESSING: time=12:52:42.710000
START BY PROCESSING: time=12:52:42.770000
coverage=0.92700
END BY PROCESSING: time=12:52:47.770000

References

Ambrosius, W. T. and Hui, S. L. (2000) A quality control measure for longitudinal studies with c
outcomes. Statist. Med., 19, 1339-1362.
Rahme, E., Joseph, L., Kong, S. X., Watson, D. J. and LeLourier, J. (2000) Gastrointestinal health car
use and costs associated with non-steroidal anti-inflammatory drugs versus acetaminophen. Arth. Rh
917-924.
SAS Institute (1999) SASO Language Reference, Version 8. Cary: SAS Institute.

This content downloaded from 141.211.4.224 on Fri, 10 Mar 2017 04:52:29 UTC
All use subject to http://about.jstor.org/terms

You might also like