DataStepPDV in Sas

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Understanding Data Step Processing

using PDV
Mohamed Mehatab
Hewlett-Packard Canada
Introduction
Following are the key topics we will cover
 Data Step Processing
 PDV
 Drop/Keep SAS statements
 Drop/Keep dataset options
 Drop/Keep statements VS dataset options
 Where / If differences
 _Null_ SAS statements
 Understanding I/O and Dataset Size
 What is a SAS Data Set?
A table, created in or for SAS, that SAS can recognize
and knows how to process. It is usually created from
datalines in one's code, or as the result of data
extraction/manipulation from either a database, a SAS
dataset, an external raw file or another program

 What is a SAS Data Step?


A programming step used in SAS to perform data
manipulation activities and as a result creates a SAS
dataset
Data Step Processing
 Datastep processing consists of 2 phases.
o Compilation Phase
o Execution Phase
Compilation Phase
 During this phase, each of the statements within
the data step are scanned for syntax errors.
 Descriptor portion of SAS dataset gets created at
the end of compilation phase
 Following are the other 2 objects which get
created at the end of compilation phase
o Input Buffer
o PDV
What is the PDV?
 The Program Data Vector is a logical area of memory
that is created during the data step processing.
 SAS builds a SAS dataset by reading one observation at
a time into the PDV and, unless given code to do
otherwise, writes the observation to a target dataset.
 The program data vector contains two types of variables.
o Permanent (data set and computed variables)
o Temporary (automatic and option defined)
Automatic (_N_ and _ERROR_)
Option defined (e.g., first.by-variable, last.by-
variable, in=variable, end=variable)
Execution Phase
 During execution phase, a dataset's data
portion is created.
Compile Program Compilation Phase

Execution Phase
Initialize Variables to
Missing

Yes
Execute INPUT End of
Next Step
Statement File?
Execute Other
Statements No

Output to SAS Data


Set
Data Flow Process through PDV
Input Buffer

eno ename Dept _N_ _ERROR_

PDV
10 scott Marketing 1 0

eno ename Dept


10 Scott Marketing
20 John Finance
SAS Dataset
30 Sam IT
40 David IT
50 Jordan Sales
Sample Program
Using different SAS statements to perform following
actions
 Selecting Variables
 Drop / Keep
 Selecting Observations
 Where / If
Drop/Keep Statement
 Drop Statement
 It Indicates which variables have to be dropped
from output datasets
 It applies to all the output datasets in a datastep
 All variables listed on the DROP statement are on
the program data vector and available for use in the
current data step until the observation is output.
 Keep Statement
 It Indicates which variables have to be dropped
from output datasets
 Variables not listed in the KEEP statement are on
the program data vector and available for use in the
current data step until the observation is output.
Drop/Keep Data set options
 Drop Data set Option
 The DROP= data set option used on an input data set
specifies the variables not to be read from the data set to
the program data vector (on the way in)
 The DROP= data set option used with an output data set
it lists the variables on the program data vector that are
not to be written to the output data set (on the way out).
 Keep Data set Option
 The KEEP= data set option used on an input data set
lists those variables to be read from the data set to the
program data vector (on the way in).
 The KEEP= data set options used with an output data set
it specifies variables to be written from the program data
vector to the output data set (on the way out).
Drop/Keep Statements
Vs Drop/Keep Dataset Options
 Drop/Keep statements only affect the output datasets
but Drop/Keep data set options affect output as well
as input datasets
 Drop/Keep statements applies to all the output
datasets but Drop/Keep data set options can be used
to drop/ keep only on the specific output datasets
 Drop/Keep statements or Drop/Keep options, when
used together, drop will take effect first
Where/IF Statement Differences
Where IF
Where statement selects the observations If statement selects the observations from
before they are read into PDV the PDV
When the WHERE statement is used with When a subsetting IF is used with a BY
a BY statement, the WHERE statement is statement, the BY statement is processed
executed first. first.

The WHERE statement selects IF statement can select observations based


observations in SAS data sets only so the on the variables created in a datastep
variables created in a datastep cannot be
used with Where
Where statement cannot be used to select If statement can be used to select records
records from external file that contains from external file through PDV
raw data
Where statement improves the efficiency
of SAS programs because SAS is not
required to read all the observations from
i/p dataset
Accessing PDV variables in a where
statement

Subsetting raw data using where statement


_NULL_ SAS Statement
 This SAS statement is used to compile and execute
data step without creating an output dataset.
 This statement is used for debugging datasteps.
 It is used to write data into external files from a SAS
dataset
 It is also used to create macro variables containing
values from the dataset variables using CALL
SYMPUT statement
 Put statement along with _null_ is used to debug any
programming / data errors
Understanding I/O
 I/O is a measurement of the read and write operations
that are performed as data and programs are copied
from a storage device to memory (input) or from
memory to a storage or display device (output).
Understanding I/O
 The amount of data that can be transferred to one
buffer in a single I/O operation is referred to as Page
Size. Commonly referred a Buffer Size.(BUFSIZE)
 Larger page size can reduce execution time by
reducing the number of times SAS has to read from or
write to the storage medium. However, the
improvement in execution time comes at the cost of
increased memory consumption
Dataset Size
proc contents data=sashelp.cars;
run;

Dataset Size(bytes) = pagesize*number


of pages
References
 The Use and Abuse of the Program Data Vector
Jim Johnson, Ephicacy Corporation, North Wales,
PA, USA
http://support.sas.com/resources/papers/proceedings12/255-2012.pdf
Thank You

You might also like