Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

LECTURE NOTES OF AN INTRODUCTION TO SAS

ARBA MINCH UNIVERSITY

COLLEGE OF NATURAL SCIENCES

DEPARTMENT OF STATISTICS

STATISTICAL COMPUTING II (STAT 2082)

2018/19
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

1. INTRODUCTION
1.1. Meaning and functions of SAS
What is SAS?

The Statistical Analysis System (SAS) is software with an integrated set of programs for
manipulating, analyzing, and presenting data. At the heart of SAS is a programming language
composed of statements that specify how data are to be processed and analyzed. The statements
correspond to operations to be performed on the data or instructions about the analysis. A SAS
program consists of a sequence of SAS statements grouped together into blocks, referred to as
“steps.” These fall into two basic steps namely: data steps and procedure (proc) steps. A data
step is used to prepare data for analysis. It creates a SAS data set and may reorganize the data
and modify it in the process. A proc step is used to perform a particular type of analysis, or
statistical test, on the data in a SAS data set. To program effectively using SAS, you need to
understand basic concepts about SAS programs and the SAS files that they process. In
particular, you need to be familiar with SAS data sets.

1.2. Starting and Ending SAS

STARTING SAS: Double-click on the SAS icon on the desktop or use the start menu
and find SAS 9.2 (English) as shown below:

Quitting SAS
Page 1 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

• Click on Exit in the File menu as shown in the figure below

• Click on the × button in the upper right corner

• Use Alt + F4
You will see the following warning upon exiting:

1.3. The SAS Environment


When SAS is started, there are five main windows open, namely the Editor, Log, Output,
Results, and Explorer windows. In Figure 1, the Editor, Log, and Explorer windows are visible.
The Results window is hidden behind the Explorer window and the Output window is hidden
behind the Program Editor and Log windows.
Page 2 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

At the top, below the SAS title bar, is the menu bar. On the line below that is the tool bar with
the command bar at its left end. The tool bar consists of buttons that perform frequently used
commands. The command bar allows one to type in less frequently used commands. At the
bottom, the status line comprises a message area with the current directory and editor cursor
position at the right. Double-clicking on the current directory allows it to be changed.
Briefly, windows in SAS and the purpose of the main windows are as follows.
i. Editor: The Editor window is for typing in editing, and running programs. When a SAS
program is run, two types of output are generated: the log and the procedure output, and
these are displayed in the Log and Output windows.
ii. Log: The Log window shows the SAS statements that have been submitted together with
information about the execution of the program, including warning and error messages.
iii. Output: The Output window shows the printed results of any procedures. It is here that
the results of any statistical analyses are shown.
iv. Results: The Results window is effectively a graphical index to the Output window
useful for navigating around large amounts of procedure output. Right-clicking on a
procedure, or section of output, allows that portion of the output to be viewed, printed,
deleted, or saved to file.
v. Explorer: The Explorer window allows the contents of SAS data sets and libraries to be
examined interactively, by double-clicking on them.

Figure 1: Windows in SAS

Page 3 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

1.3. The SAS Language/Program


Many software applications are either menu driven, or command driven. SAS is neither. With
SAS, you use statements to write a series of instructions called a SAS program. The program
communicates what you want to do and is written using the SAS language. Hence, learning to
use the SAS language is largely a question of learning the statements that are needed to do the
analysis required and of knowing how to structure them into steps. SAS Programs is sequence of
statements executed in order and each statement gives information or instructions to SAS and
must be appropriately placed in the program. To execute or run a SAS program, highlight or
select the statements you want to execute in the PROGRAM EDITOR, then either:
– click on Submit in the Run menu or

– click on the Submit icon in the Tool Bar


1.4. Rules, Comments and Errors in SAS Statements
There are a few general principles that are useful to know to write SAS program. Most SAS
statements begin with a keyword that identifies the type of statement. The enhanced editor
recognizes keywords as they are typed and changes their color to blue. If a word remains red,
this indicates a problem. The word may have been mistyped or is invalid for some other reason.
Rules for SAS Statements
SAS programs consist of SAS statements. A SAS statement has the following important
characteristics:
• It usually begins with a SAS keyword as DATA statement begins with the keyword DATA
and PROC statement begins with the keyword PROC.
• Every SAS statement ends with a semicolon. (;)
• SAS statements can be in upper- or lowercase
• Every SAS program must ends with a run statement
• Statements can extend over more than one line and there may be more than one statement
per line. However, keeping to one statement per line, as far as possible, helps to avoid errors
and to identify those that do occur.

Page 4 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Comments in SAS
Comments, with any number or type of characters, can be inserted anywhere into your program;
they will be bypassed by SAS during processing. There are two ways of writing comments in
SAS; comments can starts with an asterisk (*) and ends with a semicolon (;) or starts with a
slash asterisk (/*) and ends with an asterisk slash (*/)
Errors in SAS statement
Each time a step is executed, SAS generates a log of the processing activities and the results of
the processing. The SAS log collects messages about the processing of SAS programs and about
any errors that occur. Hence, when your program fails to execute properly (e.g., no output is
displayed), check the LOG window as SAS will give an error message in the LOG window;
typically, the error message is specific enough for you to figure out and fix the error. It is good
practice to always check the LOG window for any errors; SAS sometimes executes program
statements even after an error is encountered (unless you check the LOG window, you will have
the impression that everything is fine).
Some Tips for Preventing and Correcting Errors
Before submitting a program:
1. Check that each statement ends with a semicolon.
2. Check that all opening and closing quotes match.
3. Check any statement that does not begin with a keyword (blue, or navy blue) or a variable
name (black).
1.5. Basic Building Blocks in writing SAS program
As discussed in section 1.1 above a SAS program consists of a sequence of SAS statements
grouped together into blocks, referred to as “steps.” These fall into two basic steps namely: data
steps and proc steps and detail descriptions of each step and general syntax are given below:
i. DATA step
It creates the SAS data set, which is then passed to the PROC step for processing. It begins with
a keyword DATA statement and used to create/read/modify a SAS data set. Before data can be
analyzed in SAS, they need to be read into a SAS data set. Creating a SAS data set for
subsequent analysis is the primary function of the data step. The data can be “raw” data or come
from a previously created SAS data set. A data step is also used to manipulate, or reorganize the
data. This can range from relatively simple operations (e.g., transforming variables) to more

Page 5 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

complex restructuring of the data. In many practical situations, organizing and preprocessing the
data takes up a large portion of the overall time and effort. The power and flexibility of SAS for
such data manipulation is one of its great strengths. We begin by describing how to create SAS
data sets from raw data as an example in data step.
General Structure/syntax of a DATA Step is:
DATA dataname ;
INPUT or INFILE ;
DATALINES or CARDS ;
... ... ...
... ... ...
... ... ...
;
RUN ;

It should be noted that Data step requires up to four different SAS statements namely: DATA,
INPUT, INFILE and CARDS or Datalines and description of each of the statements are given as
follows:
DATA Statement
The DATA statement is usually the first statement in a SAS job. It begins with the word
DATA and is followed by a name that you choose for the data set. Data set names must begin
with a letter, and can be no more than eight characters in length.
INPUT Statement
• Each line of data in a SAS program can be an observation.
• Each value in this observation represents a variable, and the INPUT statement is used to
name these variables.
• The INPUT statement follows the DATA statement.
As long as one blank space is inserted between each variable value in an observation, SAS
reads them as separate values.
CARDS or DATALINES Statement
• When data is entered as an internal part of a SAS program, the CARDS statement
immediately precedes the data lines.
• It is simply entered as CARDS and tells the SAS system that the data follows.
• Note that when a CARDS statement is used, the line length cannot exceed 80 characters.

Page 6 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

INFILE Statement
• Data may also be imported from a disk or tape into your SAS program.
• In this case, both the computer operating system and SAS must know where the data can be
found.
• The INFILE statement goes before the INPUT statement.
• It consists of INFILE followed by the file reference name and this identifies the name of the
file to be used. For example, if you are using a file called STUDENTS, the statement would
be: INFILE STUDENTS;
Note that: The CARDS and INFILE statements are not used together and when using the
INFILE statement, the computer's operating system must be told where the data can be
located.
Example 1: Create SAS Dataset from weights of patients for random samples five are measured
before and after medication at a given hospital as shown table below:
Ptnt 1 2 3 4 5
Wgt1 81 71 65 66 59
Wgt2 85 76 79 67 62
/*Sas program;
DATA weight;
INPUT ptnt wgt1 wgt2;
DATALINES;
1 81 85
2 71 76
3 65 79
4 66 67
5 59 62
;
RUN;
This DATA step creates a data set called “weight” (weight.sas7bdat).
• The keyword INPUT gives the names of the 3 variables in the data set.
• The keyword DATALINES (or CARDS) indicates the start of the data values. There are 5 data
lines; thus, there will be 5 observations in this data set.
Note: There are NO semicolons at the end of each line of the data values, but there is a single
semicolon after all the data lines.
• The keyword RUN tells SAS to execute the block of statements.
Page 7 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

ii. PROC step


It performs a particular task on the created SAS data set and begins with a keyword PROC
statement and performs a specific task, function or analysis.
General Structure of a PROC Step is:
PROC name data=name of SAS dataset;
... ; (additional statements specific
... ; to the particular procedure)
RUN ;
• The name in the PROC statement is the name of the procedure you want to execute.
• DATA=name of SAS dataset ensure that the procedure will use the SAS data set specified
that to the PROC statement. If not specified the procedure, by default, will be executed on
the most recently created SAS data set.
Example: Print above created SAS data set named weight using the following SAS proc print.
PROC PRINT DATA=weight;
TITLE ’Weight Data’;
RUN;
• This PRINT procedure displays the contents of the SAS data set “weight” in the OUTPUT
window.
• The option DATA=weight in the PROC statement identifies the SAS data set to be printed.
• The TITLE statement adds a title to the printout. (When no TITLE statement is used, SAS
puts the title “The SAS System” at the top of each page.)
1.6. Variable Names and Data Set Names
In writing a SAS program, names must be given to variables and data sets. These can contain
letters, numbers, and underline characters, and can be up to 32 characters in length but cannot
begin with a number. SAS data sets are basically tables, with observations contained in the rows
and variables contained in the columns. SAS Data Set has file extension - .sas7bdat. Variable
names can be in upper or lower case, or a mixture, but changes in case are ignored. Thus Height,
height, and HEIGHT would all refer to the same variable. When a list of variable names is
needed in a SAS program, an abbreviated form can often be used. A variable list of the form sex
- - weight refers to the variables sex and weight and all the variables positioned between them in

Page 8 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

the data set. A second form of variable list can be used where a set of variables have names of
the form score1, score2, … score10. That is, there are ten variables with the root score in
common and ending in the digits 1 to 10. In this case, they can be referred to by the variable list
score1 - score10 and do not need to be contiguous in the data set. In general, SAS can handle
up to 32,767 variables in a single data set, however, the number of observations is limited only
by your computer’s capacity.
Rules for SAS Variable Names
• Names can be 32 characters or fewer in length.
• Names must start with an English letter (A to Z) or an underscore ( ). Subsequent characters
can be letters, numeric digits (0 to 9), or underscores.
• Names can contain upper- and/or lowercase letters.
• Names cannot contain blanks and other special characters such as %, $, !, #, and @.
• Certain names are reserved for use by SAS, e.g. _N_, _TYPE_ and _NAME_. Similarly,
logical operators such as gt, lt, and, eq should not be used as variable names.
Examples of illegal names:
1000seedwt Does not begin with a letter or underscore
Bodyfat% Contains an illegal character
Contains blank space
Weight of cow

Weight_of_cow_before_in_May_1990 Too long


1.7. Data Types and Missing Data in SAS
Data types can be:
• Numeric data type
– Numbers and can have any number of decimal places, algebraic signs, or E notation
– can be subjected to mathematical operations (e.g., can be added/subtracted)
• Character data type
– can contain numbers, letters and/or special characters
– can be up to 32,767 characters long
If a variable contains letters and special characters, it must be character data. However, if it
contains only numbers, then it may be numeric or character. By default, SAS assumes that a
variable is numeric. To tell SAS that a variable is character, add a dollar sign ($) after the name
of the variable in the input statement.

Page 9 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Missing Data
Some entries or fields in your data set may be missing. In a SAS data set, the representation of a
missing entry depends on the type of variable for instances missing value for numeric type is
represented by a single period (.) whereas for character data type is represented by a blank.
1.8. Reading Data into SAS
Data source for SAS can be typing the data into the program (not very efficient), using an
external data file with the SAS statement infile “ file location”; and writing a data-generating
program (useful for simulations).
a) Reading Raw Data in to SAS
It is typing the data directly in the PROGRAM EDITOR and read into SAS using the
DATALINES or CARDS statement in a DATA step. In this case, actual values can be copied
and pasted from elsewhere (e.g., a website, a text file, a Word document, etc.) It is useful only
for small to medium-sized data sets and can get messy when working with large data sets. There
are a variety of different styles of INPUT code that can be used to read raw data. Data that can be
entered may be in the following common INPUT specifications.
List Input specification
Data are read in order of variables given in input statement or simply list variables after the
INPUT keyword in the order they appear on file. It is used when raw data is separated by spaces.
All missing data must be indicated by period. If variables are character format, place a $ after the
variable name for instances list input specification will looks like: INPUT Name $ City $ Age
Height Weight Sex $; This is most common way of creating SAS datasets.
Example 1: list input specification method of creating SAS data set
DATA weight;
INPUT ptnt name $ wgt1 wgt2;
DATALINES;
1 Shaw 101 95
2 Serrano 91 96
3 Nance 95 89
4 Sinha 86 87
5 Henderson 89 82
;
RUN;

Page 10 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Column Input specification


In this case starting column of the variable can be indicated in input statement and used when
raw data file does not have delimiters between values (large data sets). In INPUT Statement: for
numeric variables – list variable name then list column or range of columns where the variable is
found on the raw data file and character variables – list variable name, dollar sign, and then
column or range of columns for instance column input specification will have form of:
INPUT Name $ 1-10 Age 26-28 Sex $ 35; Advantages of column input specification is that
ability to skip unwanted variables and missing values will be left blank. Note: By default, SAS
assumes that a variable is numeric. To declare a variable as a character type, add a dollar sign
($) after the name of the variable in the INPUT statement.
b) Importing External Raw Data in to SAS
SAS has also the capability to import data sets stored in external sources like text or excel file
into SAS using the INFILE statement in a DATA step. Hence, INFILE statement must be after
the DATA statement and before the INPUT statement. The path and filename, enclosed in single
quotes, follow the INFILE keyword.
a) Creating SAS files from importing text files using INFILE statement
Example 1: Write and save the following data into a text file (weight.txt) in Notepad. Read in
these data using the INFILE statement to create a SAS data set called “weight”, with the
following 4 variables: ptnt, name, wgt1 wgt2. Also, print out the resulting SAS data set.
1 Shaw 101 95
2 Serrano 91 96
3 Nance 95 89
4 Sinha 86 87
5 Henderson 89 82
DATA weight;
INFILE ’ D:\all in ones\Computing-II\weight.txt’;
INPUT ptnt name $ wgt1 wgt2;
RUN;
Proc print data=weight; run;
Note: To ensure that your raw data have been read correctly into SAS, check the resulting SAS
data set. You can do this in the following ways:
• Print out the data set using PROC PRINT or Check the LOG window to see if the file was
created correctly. NOTE: The data set WORK.WEIGHT has 5 observations and 4 variables.

Page 11 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Example 2: Write and save the following data into a text file (wgtclub1.txt) in Notepad. Read in
these data using the INFILE statement to create a SAS data set called “wghtclub”, with the
following 4 variables: idno, team, startweight, weightnow. Also, print out the resulting SAS data
set.
1023 red 189 165
1049 yellow 145 124
1219 red 210 192
1246 yellow 194 177
1078 red 127 118
1221 yellow 220 .
1095 blue 135 127
1157 green 155 141
data wghtclub;
infile ’ D:\all in ones\Computing-II\weight.txt’;
input idno team $ startweight weightnow;
run;
Proc print data=wgt; run;
b) Creation SAS files from importing external excel in csv file extension
Example 1: The following data refers to the seasonal wheat yield per acre at eight different
locations, all having roughly the same quality soil. The data relate the wheat yield at each
location to the seasonal amount of rainfall and the amount of fertilizer used per acre.
RF(inches) 15.4 18.2 17.6 18.4 24 25.2 30.3 31
Fertilizer amt(pound/acre) 100 85 95 140 150 100 120 80
Wheat yield 46.6 45.7 50.4 66.5 82.1 63.7 75.8 58.9
• Write and save the above data as csv file (wheat.csv) in excel. Read in these data using the
INFILE statement to create a SAS data set called “wheat”, with the following three variables:
RF, Amtfert, wheatyield. Also, print out the resulting SAS data set.
/*SAS Program to import csv file after entering in excel and saving as csv file*/
DATA wheat;
INFILE “D:\all in ones\Computing-II\practice SAS\wheat.csv” dlm=‘,’;
INPUT RF Amtfert wheatyield;
RUN;
Proc print data=wheat; run;
❖ To read data from a SAS data set, rather than from a raw data file, the set statement is used in
place of the infile and input statements. The statement
data wgtclub2;

Page 12 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

set wghtclub;
run;
Creates a new SAS data set wgtclub2 reading in the data from wghtclub. It is also possible for
the new data set to have the same name; for example, if the data statement above were replaced
with data wghtclub; This would normally be used in a data step that also modified the data in
some way.
To Open the SAS data set itself by the following steps
1. Go to the EXPLORER window and double-click on Libraries.
2. Double-click on the library in which the file was created (this is either the specific libref you
assigned for the session or WORK).
3. Once you are within the pertinent library, double-click on the SAS data set you want to
check. This will open the SAS data set in a spreadsheet-type window.
1.9. SAS Files and Data Libraries
Every SAS file is stored in a SAS library, which is a collection of SAS files. A SAS data library
is the highest level of organization for information within SAS. SAS libraries have different
implementations depending on your operating environment, but a library usually corresponds to
the level of organization that your host operating system uses to access and store files.
Depending on the library name that you use when you create a file, you can store SAS files
temporarily or permanently. When you create a SAS data set, it is stored temporarily in a SAS
Data Library. You can store it permanently by assigning a particular location on your computer
to a SAS Data Library.
Storing Temporary SAS Files
If you don't specify a library name when you create a file (or if you specify the library name
Work), the file is stored in the temporary SAS data library. When you end the session, the
temporary library and all of its files are deleted. In general, temporary SAS Files:
– exist only during the current SAS session
– are stored in a special SAS library called WORK
– are automatically erased when you exit SAS
– have one-level names. This is the default case.
Storing files permanently:

Page 13 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

To store files permanently in a SAS data library, you specify a library name other than the
default library name Work. To reference a permanent SAS data set in your SAS programs, you
use a two-level name: libref.filename. In the two-level name, libref is the name of the SAS data
library that contains the file, and filename is the name of the file itself. A period separates the
libref and filename.

Figure 3: show SAS file with two-level name, of the form as : libref.filename
For example, in our sample program, Clinic.Admit is the two-level name for the SAS data set
Admit, which is stored in the library named Clinic. Sample program creates Clinic.Admit2, it
stores the new Admit2 data set permanently in the SAS library Clinic. Hence, all SAS data sets
have a two-level name, of the form as : libref.filename. Level 1 represents the libref which points
to a particular location and Level 2 which represents the actual filename as shown below:

Note that: if the libref is not explicitly stated, by default a temporary library (libref WORK)
will be used. SAS data sets in the WORK library will NOT be permanently saved and will be
erased at the end of the current session.
Example 2: The following DATA statements are equivalent: Create temporary SAS Datasets
DATA distance;
DATA WORK.distance;
• Although the first version does not use a two-level name, SAS automatically assigns the two-
level name WORK.distance.
• If you want this data set to be stored permanently in the folder c:\student, then use:
LIBNAME cls ’c:\student’;

Page 14 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

DATA cls.distance;
Example 3: Creating permanent SAS Datasets
libname db “D:\all in ones\Computing-II”;
data db.wghtclub;
set wghtclub;
run;
In the sample SAS Code: the libname statement specifies that the libref is db to the directory
‘D:\all inones\Computing-II’. Thereafter, a SAS data set name prefixed with ‘db.’ refers to a data
set stored in that directory. When used on a data statement, the effect is to create a SAS data set
in that directory. The data step reads data from the temporary SAS data set wghtclub and stores it
in a permanent data set of the same name.
2. DATA MANIPULATION USING SAS
Data manipulation involves creating, formatting and retrieving data sets by SAS. These can be
accomplished through data entered internally or externally. Internal data are lines embedded in
the SAS program, while external data are contained in a separate file. SAS can subset, split,
merge, concatenate, transpose, and aggregate data sets into formats appropriate for subsequent
analysis. Hence, an existing SAS data set can be modified by, for instance, creating new
variables from the original variables, removing/renaming some of the variables, deleting some
observations from the data set,…,etc. We first create the original SAS data set, and then create a
new one from it with the necessary modifications with SAS keyword:
SET name1 name2 . . . ;
The SET statement tells SAS to read from the enumerated SAS data files name1 name2 . . . and
uses them to build a new SAS data set. Some possible actions for data modification and their
corresponding SAS statements include:
– keep a selection of variables:
KEEP var1 var2 ... ;
– delete a selection of variables
DROP var1 var2 ... ;
– rename a variable
RENAME oldname=newname ;

Page 15 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

– select only some observations


WHERE condition ;
2.1. Creating and Modifying Variables
The assignment statement can be used both to create new variables and modify existing ones.
The statement weightloss=startweight-weightnow; creates a new variable weigtloss and sets its
value to the starting weight minus the current weight, and
startweight=startweight * 0.4536; will convert the starting weight from pounds to kilograms.
2.2. Subsetting Data Sets and Deleting Variables
If analysis of a subset of the data is needed, it is often convenient to create a new data set
containing only the relevant observations. This can be achieved using either the subsetting if
statement or the where statement. The subsetting if statement consists simply of the keyword if
followed by a logical condition. Only observations for which the condition is true are included in
the data set being created.
data men;
set survey;
if sex=’M’;
run;
The statement where sex=’M’; has the same form and could be used to achieve the same effect.
The difference between the subsetting if statement and the where statement will not concern
most users, except that the where statement can also be used with proc steps. In addition,
variables can be removed from the data set being created by using the drop or keep statements.
The drop statement names a list of variables that are to be excluded from the data set, and the
keep statement does the converse, that is, it names a list of variables that are to be the only ones
retained in the data set, all others being excluded. So the statement drop x y z; in a data step
results in a data set that does not contain the variables x, y, and z, whereas keep x y z; results in a
data set that contains only those three variables.

Page 16 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Example 1: /*illustrates subsetting, deleting data basket2;


and renaming as shown in basket2 data*/ set basket;
data basket; where gender='F'; /* subsetting dataset*/
Input id $ match gender $ height points; drop height; /* delete height variable*/
cards; rename id=employee; /to rename id*/
001 1 M 72 10 proc print;
001 2 M 72 12 run;
001 3 M 72 8
002 1 F 68 14
004 2 M 69 6
004 1 M 69 2
003 1 F 74 20
;
proc print;
run;
2.3. Merging Files
Data sets for the same subject may come in different files and it will be desirable to combine
them into only one data set for analysis. Merging data means combining two or more data sets
horizontally. Imagine two SAS data sets. The first contains n1 observations and V1 variables, the
second n2 observations and V2 variables. When you set the two data sets the new data set will
contain n1 + n2 observation and max(V1, V2) variables. Variables that are not in both data sets
receive missing values for observations from the data set where the variable is not present.
To combine the data sets vertically, use the SET data set statement and list the data sets you wish
to combine. The data sets are placed in the new data set in the order in which they appear in the
SET statement. Merging the two data sets in a DATA step combines the variables and
observations horizontally. If the first data set has n1 observations and V1 variables and the second
data set has n2 observations and v2 variables, the merged data set will have max (n1,n2)
observations. Observations not present in the smaller data set are patched with missing values.
The number of variables in the combined data set depends on whether the two data sets share
some variables. If variables are present in either data set, they are retained from the data set in

Page 17 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

the merge list that contains the variable last. In order to merge two or more files it is necessary
for both the files to have a common variable say the ID number, To accomplish a one-to-one
match merge. The would be merged files must first be sorted by the common variables using the
SORT statement. Then the observations for the merged files are joined by using the MERGE
statement in a separate DATA step that creates a new file. The IN= data set option enables you
to track the source of file for merged records i.e. keep observations from desired file. Make sure
you have records of your choice from both files. An example will make this clearer.
Example 1 : one-to-one match merging files in SAS
data baskini; data baskage;
input id $ gender $ height; input id $ age;
cards; cards;
001 M 72 001 19
002 F 68 002 17
003 F 74 003 18
004 M 69 004 20
005 F 67 005 17
; ;
To merge above two Datasets
data newbask;
merge baskini baskage;
by id;
run;
proc print;
run;
SAS Output of merged datasets
BASKET Data Set
Obs id gender height age
1 001 M 72 19
2 002 F 68 17
3 003 F 74 18
4 004 M 69 20
5 005 F 67 17
Example 2: demographic details from a questionnaire may need to be combined with the results
of laboratory tests. To deal with this situation, the data are read into separate SAS data sets and
then combined using a merge with a unique subject identifier as a key. Assuming the data have
been read into two data sets, demographics and labtests, and that both data sets contain the
subject identifier idnumber, they can be combined as follows:
Page 18 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

proc sort data=demographics;


by idnumber;
proc sort data=labtests;
by idnumber;
data combined;
merge demographics (in=indem) labtest (in=inlab);
by idnumber;
if indem and inlab;
run;
In above Sample SAS program, first, both data sets must be sorted by the matching variable
idnumber. This variable should be of the same type, numeric or character, and same length in
both data sets. The merge statement in the data step specifies the data sets to be merged. The
option in parentheses after the name creates a temporary variable that indicates whether that data
set provided an observation for the merged data set. The by statement specifies the matching
variable. The subsetting if statement specifies that only observations having both the
demographic data and the lab results should be included in the combined data set. Without this,
the combined data set may contain incomplete observations, that is, those where there are
demographic data but no lab results, or vice versa.

2.4. Basic Operators in SAS


Very often data is not in a desired form, or calculations based on the data are desired. SAS
allows you to do this by using program statements. Available program statements include
arithmetic operations, IF statements, comparison operators, and other various a mathematical and
statistical functions. The tables below list the arithmetic and comparison operators available and
their SAS equivalent.
i. Arithmetic operators
Symbol Definition
+ Addition
- Subtraction
* Multiplication
/ Division
** Exponentiation

Page 19 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

ii. Comparison operators


Symbol Mnemonic Equivalent Definition
= eq equal to
^= Ne Not equal to
> gt Greater than
< lt Less than
>= ge Greater than or equal to
<= Le Less than or equal to
iii. Logical operators
Symbol Mnemonic Equivalent Definition
& And true if both sides are true
! or │ Or true if either side is true
^ or ~ Not true if the quantity following NOT is false
Note: mnemonic operators are comparison operators written with letters. You can use both in
SAS. Also, it should be noted that comparison operators are used to compare two quantities,
Logical operators are frequently used in if-then statements.
Example 1: How to create new variables by different operators in SAS
data cars; data compop;
input manufact $ mileage reliable; set cars;
cards; w = (reliable = 2);
CHEVROLET 33 5 x = (reliable < mileage);
HONDA 29 5 y = (reliable ne 3);
TOYOTA 30 5 z1 = (reliable in (1,3,5));
FORD 27 3 z2 = (manufact in ("FORD", "HONDA"));
DODGE 34 . run;
FORD 24 1
CHRYSLER 23 3
BUICK 21 3
PLYMOUTH 24 3
CHEVROLET 25 2
PONTIAC . 1
TOYOTA 24 5
HONDA 26 .
FORD 20 3
;
Run;
SAS Output:

Page 20 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

2.5. If-then/else statement in SAS


To apply an assignment statement to some observations under certain conditions:
if expression then statement1 ;
else statement2;
This assignment executes statement1 if the condition specified in the if clause is satisfied. An
optional else may give an alternative action (statement 2) If the condition is not satisfied.
Example 1: Suppose we want to create the binary variable RELBIN: then, to create binary
dummy variables from cars dataset

1 IF RELIABLE = 3
RELBIN = 
0 otherwise
• The corresponding SAS code would then be:
data cars2;
set cars;
if reliable <= 3 then relbin=1;
else relbin=0;
run;
SAS Output

Page 21 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

3. THE PROC STEP IN SAS


Once data have been read into a SAS data set, SAS procedures can be used to analyze the data.
Hence, the proc step is a block of statements that specify the data set to be analyzed, the
procedure to be used, and any further details of the analysis. The step begins with a proc
statement and ends with a run statement when the next data or proc step starts. It is
recommended to include a run statement for every proc step. In general, the SAS PROCs
(procedures) are used for many purposes including carrying out statistical analysis (e.g. PROC
REG, PROC MEANS), displaying information about a SAS data set (e.g. PROC CONTENTS,
PROC PRINT), and creating graphs (PROC PLOT). The PROC(s) must appear after a data
step which creates the SAS data set used in the procedure. A SAS PROC begins with a word
PROC followed by the name of the specific procedure (e.g. PROC REG). Some PROCs have
options or subcommands which allow the user to output information into a SAS data set (e.g.,
PROC UNIVARIATE, PROC REG). The default data set used by a PROC is a data set created
by the last data step or PROC before the current PROC. To change the data set used by a PROC,
use the DATA= option on the PROC line.
HERE are some common statements in a PROC step.
PROC procname <DATA=filename> <option(s)> ;
<required statements> ;
<optional statements> ;
<BY variable-list> ;
<WHERE condition> ;
RUN ;
• The DATA= option is optional (if omitted, SAS will use the most recently created data set)
• The BY statement, which is generally optional, can be used to perform separate analysis for
each combination of values of the BY variables, the only procedure that requires a BY
statement is PROC SORT
• The WHERE statement, which is optional, can be used to perform analysis only for a subset
of the data.

Page 22 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

3.1. Basic SAS Statements and Procedures


➢ The var Statement
The var statement specifies the variables that are to be processed by the proc step. For example:
proc print data=wghtclub;
var name team weightloss;
run;
restricts the printout to the three variables mentioned, whereas the default would be to print all
variables.
➢ The where Statement
The where statement selects the observations to be processed. The keyword where is followed by
a logical condition and only those observations for which the condition is true are included in the
analysis.
proc print data=wghtclub;
where weightloss > 0;
run;
➢ The by Statement
The by statement is used to process the data in groups. The observations are grouped according
to the values of the variable named in the by statement and a separate analysis is conducted for
each group. To do this, the data set must first be sorted in the by variable.
proc sort data=wghtclub;
by team;
proc means;
var weightloss;
by team;
run;
➢ The class Statement
The class statement is used with many procedures to name variables that are to be used as
classification variables, or factors. The variables named can be character or numeric variables
and will typically contain a relatively small range of discreet values. There may be additional
options on the class statement, depending on the procedure.

Page 23 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

➢ PROC PRINT: prints the observations in a SAS data set, using all or some variables.
➢ PROC SORT: sorts the observations in a SAS data set by one or more variables.
➢ PROC CONTENTS: provides a description of a SAS data set
❑ PROC PRINT
The print procedure prints the observations in a SAS data set, using all or some variables.
proc print data=... ;
run;
• If you do not want all the variables in the data set to be printed, you can use the var
statement to name the variables to be printed.
❑ Example (basket data)
proc print data=basket;
var match points;
run;
❑ PROC SORT
• The sort procedure allows to sort observations in a SAS data set by one or more variables.
proc sort data=... out=...;
by ...;
run;
• specify the name of the data set to be sorted after the data= option.
• the out= option allows to put the newly sorted version of the data in a new data set. Without
specifying this option, the original data set is replaced.
• proc sort will sort the observations by the variables listed in the by statement.
❑ Example (basket data)
proc sort data=basket out=basketsort;
by gender;
run;
proc print data=basketsort;
run;

Page 24 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

❑ PROC CONTENTS
The contents procedure provides a description of a SAS data set. It gives information on a SAS
data set, including the name of the data set, the number of observations, the names of variables,
the type of each variable (numeric-num or character-char), and any labels or formats that have
been assigned to variables.
proc contents data=...;
run;
• You just need to specify the name of the data set you want to describe.
❑ Example (basket data)
proc contents data=basket;
run;

3.2. Basic Statistical Analysis Using SAS


3.2.1. Descriptive Statistics
Before you begin any analyses, you will want to get a feel for your data. What is the mean and
standard deviation of certain variables? Are the data normally distributed? You can use
summary statistics and visual aids, such as histograms and box plots to help you see the
distribution of your data. If a variable is categorical, we construct a table using the values of the
variable and recording the frequency (count) and perhaps the relative frequency (proportion) of
each value in the data. Often it is recommended to employ some SAS descriptive statistics
procedures like PROC FREQ, MEANS, SUMMARY, and UNIVARIATE to check for validity
of data and for data description. Preliminary data analysis also involves plotting some graph to
observe the characteristics of the data. These can be used to check whether the observations have
been collected carefully, or to display the distribution. Many statistics are best viewed
graphically. The SAS procedures, PROC CHART, UNIVARIATE, and PLOT will allow us to
draw histograms, box plots and scatter plots. As explained above, producing statistical summary
involves computing descriptive statistics and graphically displaying the data to view some
features. Based on types of variables, quantitatively measured variables are summarized by their
centers (mean- the arithmetic average, median- the positional average that cuts an array of
ascending/descending measurement into two equal groups, and mode- the most frequent
observation, and the standard deviation- measures to what extent an observation spreads from the

Page 25 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

central value). On the other hand qualitatively measured variables are summarized by
counts/proportions.
a) FREQ procedure to tabulate data for qualitative variables
The FREQ procedure produces one-way, contingency (cross tabulation) tables to n-way
frequency tables. In addition to summarizing data in form of table, for contingency tables, PROC
FREQ can compute various statistics to examine the relationships between two classification
variables. For n-way tables, PROC FREQ provides stratified analysis by computing statistics
across, as well as within, strata. In PROC FREQ step to summarize and display data in one way
table, use the TABLES statement to specify the variables to be included in the frequency counts.
These are typically variables that have a limited number of distinct values. General form of a
PROC FREQ step with a TABLES statement to produce one-way table is:

Cross tabular Frequency Reports


Furthermore, a two-way, or cross tabular, frequency report analyzes all possible combinations of
the distinct values of two variables. The asterisk (*) operator in the TABLES statement is used to
cross variables. General form of the FREQ procedure to create a cross tabular report/table is:

In a cross tabular report, the values of the first variable in the TABLES statement form the rows
of the frequency table and the values of the second variable form the columns.
b) PROC FREQ step and contingency table
i. PROC FREQ with raw data to calculate chi-square statistic
You should recall or reread the discussion of the proc freq statement in above section, as we
mention there are additional features related to carrying out the chi-square test. Although this
section is all about descriptive statistics, here additional features of PROC FREQ step related to
carrying out the chi-square test will be discussed. For example, suppose that for 100 observations
Page 26 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

in a SAS data set one we have a categorical variable on the sex taking the values 0=male and
1=feamle and a categorical variable religion taking the values 0=Orthodox, 1=protestant, and
2=Muslim. Then, the statements
proc freq data=one;
tables sex*religion;
run;
record the counts in the six cells of a table with sex indicating row and religion indicating
column . In addition to tabulating data in two way table, the association between two variables
will be assessed by conducting a chi-square test of association, which can be carried out using
the chisq option to the tables statement. Hence, the following SAS statements produce the results
of chi-square test of association between those two variables.
proc freq data=one;
tables sex*religion/chisq;
run;
Example 1: Consider the following data on the gender and satisfaction of 21 randomly selected
AMU 2nd year statistics department students: where in the data f=female, m=male for gender
group and Y=Yes, N=no for status satisfaction on the department.
Gender f M m m m F f m F m f m m f m F f m m f f
Satisfaction Y Y Y Y N Y N Y Y Y N N N Y Y Y Y Y Y N Y
Then, a) enter the data in SAS b) Describe data using appropriate summary statistics c) Is there
any association between gender of students and satisfaction on the department at 5% level of
significance?
/* SAS program for Creating SAS Dataset called g2stat*/
Data g2stat;
Input Gender $ Satisfaction$;
Datalines;
fY
mY
mY
fN
mY
fY
mY
fN
mN
Page 27 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

mN
fY
mY
fY
fY
mY
mY
fN
fY
;
Run;
/* SAS program for cross tabulate data*/
proc freq data=g2stat;
tables Gender*Satisfaction;
run;
/* SAS program to perform chi-square test of association*/
proc freq data=g2stat;
tables Gender*Satisfaction/chisq expected;
run;
ii. PROC FREQ step with tabulated data to calculate chi-square statistic
If the data come to you already tabulated, then you must use the weight statement in proc freq
together with the chisq option in the tables statement to compute the chi-square statistic to carry
out chi-square test of association as shown in examples below.
Example 2: A random sample of 17096 households is asked their opinion on an early marriage
from a certain city XYZ as a result of survey on their opinion by sex group is shown below:
Opinion Men Women
No 5550 8232
Yes 1630 1684
Test the hypothesis that the opinion on the early marriage is independent of sex group of
households at 5% level
/* SAS Program to conduct chis-square test of association*/
data one;
input Opinion$ gender $ count;
cards;
yes men 1630
yes women 1684
no men 5550
no women 8232
Page 28 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

proc freq data=one;


weight count;
tables binge*gender/chisq;
run;
Then, the program uses the weight statement to record the counts in each of the cells of the 2×2
table formed by gender and opinion in the variable counts. The output for this program is shown
below. We see that the chi-square test for this table gives a P -value of .001, so we would reject
the null hypothesis of no relationship between the variables gender and opinion on the early
marriage.
Table 1: SAS output of chi-square test of association for the data

Example 2: A random sample of statistics department students at AMU is asked their opinion on
a proposed core curriculum changes as results of survey shown below:
Class Favoring Opposing
Freshman 140 100
Junior 80 130
Senior 70 110
Enter the data in SAS and test the hypothesis that the opinion on the changes is independent of
class standing at 5% level.
/* SAS program for cross tabulate data*/
data retabulate;
input class $ Opinion$ count;
cards;
Freshman Favoring 140
Page 29 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Freshman Opposing 100


Junior Favoring 80
Junior Opposing 130
Senior Favoring 70
Senior Opposing 110
;
Run;
/* SAS program to cross tabulate and perform chi-square test of association*/
proc freq data= retabulate;
weight count;
tables class*Opinion/chisq;
run;

c) Calculating summary statistics for quantitative variables


a) PROC MEANS step
It is sometimes preferable to record just a few numbers that summarize key features of the
distribution. One possible and common choice is to compute the mean and standard deviation to
summarize a data on the quantitative variables. Hence, we can use the procedure proc means to
do this in SAS since the MEANS procedure displays simple descriptive statistics for the numeric
variables in a SAS data set. The default behavior of PROC MEANS when invoked without any
options will produce sample mean, sample standard deviation, sample minimum and maximum
for all numeric variables in the data set. To restrict calculation of statistics to specific variables,
use the VAR statement in PROC MEANS step. General form of a PROC MEANS step is:
PROC MEANS data=SAS-dataset;
Run;

By default, PROC MEANS analyzes every numeric variable in the SAS data set, prints the
statistics N, MEAN, STD, MIN, and MAX and excludes missing values before calculating
statistics. The VAR statement identifies the analysis variables and the CLASS statement in the
MEANS procedure groups the observations of the SAS data set for analysis.
PROC MEANS data=SAS-dataset;
Var SAS quantitative variables;
Class SAS qualitative variables;
Run;
As mentioned, the default output displays for PROC MEANS step are mean, standard deviation,
minimum value, maximum value of the variable; we can choose the required statistics from the
Page 30 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

options of PROC MEANS. For example if we require mean, standard deviation, median,
coefficient of variation, coefficient of skewness, coefficient of kurtosis,….,etc then we can write
as:
PROC MEANS mean std median cv skewness kurtosis data=SAS-dataset;
Var SAS variables;
Run;
To control the maximum number of decimal places for PROC MEANS to use in printing results,
use the MAXDEC= option in the PROC MEANS statement. General form of the PROC MEANS
statement with the MAXDEC= option is:
PROC MEANS data=SAS-dataset MAXDEC=number;
Run;
Example 1: Consider the following data on the height(in cm), weight(in kg) and gender group of
patients for a random sample of 10 from Arba Minch general hospital emergency room.
Height 173 179 197 195 173 184 162 169 164 168
Weight 57 58 62 84 64 74 57 55 56 60
Gender F F F M F M F F M M
a) Create SAS dataset named ptdata from the above variables
Data ptdata; /* SAS program to create temporary SAS dataset in SAS*/
Input Height Weight Gender$;
Datalines;
173 57 F
179 58 F
167 62 F
195 84 M
173 64 F
184 74 M
162 57 F
169 55 F
164 56 M
168 60 M

Page 31 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

run;
b) Compute descriptive summary statistics for weight and height of patients
/*Summarize the patient data*/
Proc Means data=ptdata;
var Height Weight;
Run;
The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum


ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Height 10 173.4000000 10.1017050 162.0000000 195.0000000
Weight 10 62.7000000 9.3220169 55.0000000 84.0000000
Ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Interpret the above descriptive statistics of the weights and heights of students!!!!!!!!!!!!!!!!
c) Compute the descriptive summary statistics for weight and height of patients by gender
It should be noted that the data must be sorted by grouping variables to compute
descriptive summary statistics of quantitative variables by grouping variables using by
statement in SAS or use class statement which will not require to sort data as shown below
as an example.
Proc Sort data=ptdata; /* sorted data replaces the original data ptdata*/
by Gender;
run;
*Print the Data;
Proc Print data=ptdata;
Run;
The descriptive summary statistics for weight and height of patients by gender will be computed
as:
Proc Means data=ptdata;
by Gender;
Var Height Weight;
run;

Proc Means data=ptdata;


Class Gender;
Var Height Weight;
run;

Page 32 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Interpret the above descriptive statistics of the weights and heights of students by gender group!!
b) PROC SUMMARY Steps
For obtaining descriptive statistics for a given data one can use PROC SUMMARY procedure.
The PROC SUMMARY procedure uses the same program codes as PROC MEANS, however
PROC SUMMARY does not produce report by default. In order to produce the report, you need
to add PRINT as the option as shown below. It saves the summary to a SAS data set, if no print
option in PROC SUMMARY.
PROC SUMMARY data=SAS-dataset print;
Run;
As in PROC MEANS step if we wants to obtain mean, standard deviation, median, coefficient of
variation, coefficient of skewness, coefficient of kurtosis, then one may utilize the following in
PROC SUMMARY step.
PROC SUMMARY mean std median cv skewness kurtosis data=SAS-dataset print;
Var SAS variables;
Run;
The VAR statement identifies the analysis variables and the CLASS statement also can be
included in the PROC SUMMARY procedure to groups the observations of the SAS data set for
analysis.
Example 1: Consider the above data on the height, weight and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room.
a) Compute descriptive summary statistics for weight and height of patients
/*Summarize the patient data*/
Proc summary data=ptdata print;
var Height Weight;
Run;
The SUMMARY Procedure

Variable N Mean Std Dev Minimum Maximum


ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Height 10 173.4000000 10.1017050 162.0000000 195.0000000
Weight 10 62.7000000 9.3220169 55.0000000 84.0000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
b) Compute the descriptive summary statistics for weight and height of patients by gender
Proc summary data=ptdata print;
Calss Gender;
Page 33 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Var Height Weight;


run;

c) PROC UNIVARIATE Step


The proc univariate procedure is used to produce descriptive summary statistics for quantitative
variables. It is similar to proc means, but it has some unique features. On the other hand proc
univariate allows for many more descriptive statistics to be calculated. Features in proc
univariate include detail on the extreme values of a variable, quantiles, several plots to picture
the distribution, and a test that the data are normally distributed. Most of the statistical
procedures require that the data should be normally distributed. In SAS, In addition to computing
descriptive statistics, for testing the normality of data, PROC UNIVARIATE step may be
utilized and SAS program will have a form:
PROC UNIVARIATE data=SAS-dataset normal;
Var sas quantitative variables;
Run;
Furthermore, if different plots are required then one may use:

PROC UNIVARIATE data=SAS-dataset normal plot;


Var sas quantitative variables;
Run;
Here plot option displays stem-leaf, boxplot and normal probability plot for the variables
Example 1: Consider the above data on the height, weight and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room.
a) Compute descriptive summary statistics for weight and height of patients
/*Summarize the patient data*/
Proc univariate data=ptdata;
var Height Weight;
Run;

b) Compute the descriptive summary statistics for weight and height of patients by gender
Page 34 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Proc univariate data=ptdata;


Class Gender;
Var Height Weight;
run;
3.2.1.3 Summarizing data by graphing in SAS
One of the most informative ways of presenting data is via a graphs. There are several methods
for obtaining graphs in SAS.
a) Normal Probability Plot/proc univariate step with plot;
As discussed earlier proc univariate step with plot option produce the usual output from this
procedure together with a stem-and-leaf plot, a box plot, and a normal probability plot for the
data. Some statistical procedures require that we assume that values for some variables are a
sample from a normal distribution. A normal probability plot is a diagnostic that checks for the
reasonableness of this assumption. For instance to create such a plot for a weight of patients for
above data ptdata we use proc univariate with the plot option as shown:
proc univariate data=ptdata plot;
var weight;
run;
Produce as part of the output the normal quantile plot as given below:

Interpret normal probability plot of the weight of the students!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


b) Histogram/proc univariate step

Page 35 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Furthermore, proc univariate step can be used to display data by using histogram as shown.
Proc univariate data=ptdata;
var weight;
histogram/normal;
run;
Produce as part of the output histogram with normal curve.
c) Bar and Pie Charts/PROC GCHART
Sometimes it is useful to show frequencies in a graphical display. With SAS, you have several
options: First, there is SAS procedure called GCHART, which is part of the SAS/GRAPH
collection of procedures. PROC GCHART is used to produce vertical or horizontal bar charts
and pie charts of categorical variables. General form of the PROC GCHART statement:
PROC GCHART DATA=SAS--data--set;
HBAR chart-variable . . . </options>; /* Produce a horizontal bar chart*/
VBAR chart-variable . . . </options>; /*Produce a vertical bar chart*/
PIE chart-variable . . . </options>; /* Produce a pie chart*/
Run;
Example 1: Consider the above data on the height(in cm), weight(in kg) and gender group of
patients for a random sample of 10 from Arba Minch general hospital emergency room.
Then, construct pie chart and simple bar graph for gender of patients
PROC GCHART DATA=ptdata;
pie gender;
run;
quit;

Page 36 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

PROC GCHART DATA=ptdata;


VBAR gender;
run;
quit;

Interpret simple bar chart of the gender of students!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


You can create a similar bar chart or pie chart using PROC CHART. The syntax is almost
identical to PROC GCHART. You enter the keyword VBAR, PIE followed by one or more
variables for which you want to create a bar or pie chart.
d) Box plot/proc boxplot
The PROC BOXPLOT procedure creates side-by-side box-and-whiskers plots for a continuous
variable, displayed for each level of a categorical variable. The data set must first be sorted by
the categorical variable. The plot statement first lists the continuous variable you wish to display,
the second variable after the * is the categorical variable that will form the X-Axis categories. A
box-and-whiskers plot displays the mean, median quartiles, and minimum and maximum
observations for a group.
Example 1: For the data on the height(in cm), weight(in kg) and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room, construct a
comparative box plot using PROC boxplot procedure for a variable called weight by gender from
a SAS data set called ptdata.
Proc sort data=ptdata; by gender;run;
proc boxplot data=ptdata;
Page 37 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

plot weight*gender;
title "Side-by-Side Boxplot Using Proc Boxplot";
run;

Interpret the above comparative boxplot of weight of the students by gender!!!!!!!!!!!!!!!!!!!!!!!!!


You can create a similar Side-by-Side Boxplot using Proc Univariate procedure with plot option
as shown below
Proc sort data=ptdata; by gender;run;
proc univariate data=ptdata plot;
by gender;
var weight;
title "Side-by-Side Boxplot Using Proc Univariate";
run;
If by statement is avoided from above SAS program, then simple boxplot will be produced as
part of output in addition to descriptive statistics.
e) Scatter plot / PROC PLOT and PROC GPLOT
A scatter plot of two quantitative variables is a very useful technique when looking for a
relationship between two variables. By a scatter plot we mean a plot of one variable on the y axis
against the other variable on the x axis. The PROC PLOT and PROC GPLOT are procedure
graphs one variable against another, producing a scatter plot.
Page 38 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Example 1: For the data on the height(in cm), weight(in kg) and gender group of patients for a
random sample of 10 from Arba Minch general hospital emergency room, construct a simple
scatter plot between weight and height from a SAS data set called ptdata
proc gplot data=ptdata;
plot weight*height='dot';
run;
produces the simple scatter plot between weight and height as shown:

Interpret the relationship between weight and height of the students from the scatter plot!!!!

3.2.2. Inferential statistics


i. Inference about a single population mean/one sample t-test/ PROC TTEST
PROC TTEST allows you to perform tests and compute confidence limits for one sample, paired
observations or two independent samples.
Example 1: A University bookstore claims that, on average, a University student will expend
101.75 birr per semester to photocopy lecture notes. A student group investigates this claim by
randomly selecting ten students and finding the lecture notes photocopy expenditure per semester
as given:
Expenditure 140 125 150 124 143 170 125 94 127 53

Create SAS dataset and then, does this data support the null hypothesis that the expenditure is
101.75 or the alternative, that it is different from 101.75?

data studdata;
input expenditure;

Page 39 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

cards;
140
125
150
124
143
170
125
94
127
53
;
Run;
proc ttest data=studdata h0=101.75;
var expenditure;
run;
The TTEST Procedure
Statistics
Lower CL Upper CL Lower CL Upper CL
Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err Minimum Maximum
expenditure 10 102.04 125.1 148.16 22.169 32.23 58.839 10.192 53 170
T-Tests
Variable DF t Value Pr > |t|
expenditure 9 2.29 0.0477

Interpret the above SAS output!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


d) Inference about a Single Population Proportion/ Proc FREQ
Assume p0 reflects the historically true proportion of some variable of interest. A researcher may
wish to test whether the current unknown proportion, p, is different from p0. A test of proportion
would check the null hypothesis: H0: P=P0, versus HA:p>P0 or HA:p<p0, or HA:p≠p0
Example 1: In university there are 5000 students and claimed that 20% of them are left handed.
Test a claim random sample of 250 students are taken and among them 38 are found to be left
handed. Test a claim that the true population proportion of left handed students in the university
is different from 20%.
/* SAS program to create SAS dataset from given information*/
data oneprop;
input handtype$ cases;
cards;
left 38
right 212
Page 40 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

;
run;
proc print data=oneprop;run;
/* SAS program to compute test statistic*/
proc freq data=oneprop order=data;
tables handtype/binomial(p=0.2);
weight cases;
run;

Interpret the above SAS output!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


e) Inference about the difference between two population means/ PROC TTEST
i. In the case of Independent-Samples
The Independent-Samples T Test procedure compares means for two groups of cases. SAS
provides a facility to compute t-test for variables grouped into two levels using PROC TTEST.
Here, the pooled t-test can be performed in a variety of ways in a sense that the data must be set
up differently from the data for one sample or paired t-test. In this case, created SAS dataset have
two variables. A variable that identifies the group in which each observation belongs and should
have only two levels. A second variable contains the measured values for each observation. In
PROC TTEST, the CLASS statement will be used to identify the variable which defines group
membership. The VAR statement lists the variable(s) for which the t-test is to be computed.
Example 1: The following are the changes on the blood pressure from patients when given two
different drugs and followed for 18 months
Standard 18.2 20.1 17.6 16.8 18.8 19.7 19.1
New 19.4 22.7 19.1 18.4 25.9 20.4 21.7

Page 41 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Create SAS dataset and test whether the differences among mean change on the blood pressure
are attributed to types of drugs at 5% level of significance.
/* SAS program to create SAS dataset*/
data indettest;
input changeBP Drugtype;
cards;
18.2 1
20.1 1
17.6 1
16.8 1
18.8 1
19.7 1
19.1 1
19.4 2
22.7 2
19.1 2
18.4 2
25.9 2
20.4 2
21.7 2
;
run;
/* SAS program to compute independent sample t test*/
proc ttest data= indettest;
var changeBP;
class Drugtype;
run;

Page 42 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Interpret the above SAS output with respect to equality of means and variances of change in
blood pressure for two drugs!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ii. In the case of dependent samples/ PROC TTEST
Paired t-test compares the means of the same group at different times (e.g. before and after an
event) and paired samples occur if there are two measurements on the same experimental unit. In
SAS, this can be done with PROC TTEST and a PAIRED statement followed by the pairs joined
with the ' * ' symbol.
Example 1: In 10 women the systolic blood pressure (mm Hg) is measured at the beginning of a
clinical trial. Afterwards they have a fertility treatment with hormones. During this treatment
they are again measured.

Create SAS dataset and test whether the mean systolic blood pressure of women differs between
the two measurements at 5% level of significance.
/*SAS program to create SAS data set*/
data pairedttest;
input before during ;
cards;
115 128
112 115
107 106
119 128
115 122
138 145
126 132
105 109
104 102
115 117
;
run;
Page 43 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

/* SAS program to compute paired t test statistic*/


proc ttest data= pairedttest;
paired during*before;
run;

Interpret the above SAS output with respect to equality of means of systolic blood pressure of
women between the two measurements!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
f) Inference about Two Population Proportions/Proc FREQ
Proc FREQ is used to compare equality of two population proportions. If we have sample
proportions for two random samples, a significance test of H0:p1=p2, HA:p1<p2 or p1>p2, or
p1≠p2 can be carried out using Proc FREQ in SAS.
Example 1: In the year 2001, a poll of 600 people found that 250 supported early marriage. A
2003 poll of 500 found 250 in support. Do a test of significance to see whether the difference in
proportions is statistically significant.
/* SAS program to create SAS dataset from the given information*/
data twoprop;
input row column count;
cards;
1 1 250
1 2 350
2 1 250
2 2 500
;
Run;
/* SAS program to perform a test and 95% CI for difference in proportions*/
proc freq data= twoprop;
tables row*column/riskdiff chisq;
weight count;
run;

Page 44 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Interpret the above SAS output with respect to equality of proportions of people who supported
early marriage in two years!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
g) Correlation and linear regression analysis using SAS
a) Correlation analysis using SAS / PROC CORR
While a scatter plot is a convenient graphical method for assessing whether or not there is any
relationship between two variables, we would also like to assess their relationship numerically.
Hence, correlation coefficient provides a numerical summarization of the degree to which a
linear relationship exists between two quantitative variables, and can be calculated using the
proc corr command in SAS. If you want to produce correlations only for specific combinations
of the variables var statement will be used to specify variables to be analyzed under proc corr
step.
Example 1: Consider the following data collected to investigate the relationship between income,
age and hours worked per day for of households from a given town XYZ

Page 45 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Income (Y) Age (x1) Hours worked per day ( x2)


2841 29 12
1876 21 8
2934 62 10
1552 18 10
3065 40 11
3670 50 11
2005 65 5
3215 44 8
1930 17 8
2010 70 6
3111 20 9
2882 29 9
1683 15 5
1817 14 7
4066 33 12
Then
a) create SAS dataset from the above variables given above
b) Compute the correlations between income and hours worked per day
c) compute correlation matrix among the variables given above
Solution
(a) /* SAS program for creating SAS data set*/
data MLRdata;
input Income Age Hourworked;
cards;
2841 29 12
1876 21 8
2934 62 10
1552 18 10
3065 40 11
3670 50 11
2005 65 5
3215 44 8
1930 17 8
2010 70 6
3111 20 9
2882 29 9
1683 15 5
1817 14 7
4066 33 12
;
run;
(b) To compute the correlations between only variables Income and hours worked per day, use
var statement as:
Page 46 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

proc corr data=mlrdata;


var Income Age;
run;

Test significance of correlation coefficient and interpret it from the above SAS output!!!!!!!!!!!!
In the above SAS program, if the var statement is omitted, then correlations are computed
between all numeric variables in the data set. If you want to produce correlations only for
specific combinations, then include variables to be analyzed under var statement as above.
(c) /* To compute correlation matrix for the data*/
proc corr data=mlrdata;
run;

Test significance of correlation coefficients between income and age and income and hours
worked per day and interpret if it is significant from the above SAS output!!!!!!!!!!!!!!!!!!!!!!!!
Page 47 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

b) Linear Regression Analysis using SAS/PROC REG


Regression Analysis is a statistical technique that can be used to develop a mathematical
equation showing how variables are related and it is also used to predict a value of response for
the given values of predictors. Linear regression analysis is a technique for assessing a linear
relationship between two or more variables when dependent variable is a continuous having
approximately normal distribution but the explanatory variables can be continuous, discrete or
categorical. Regression is called as logistic regression when your dependent variable is
categorical (will be discussed later) and Poisson regression when your dependent variable is
count data (will not be discussed here). Either the GLM procedure or the REG procedure can be
used to perform simple and multiple linear regressions in SAS. The difference is that Proc reg
allows only quantitative variables for predictor variables, whereas proc glm allows both
quantitative and categorical predictor variables. Categorical variables are called class variables in
SAS, and in proc glm they are identified in a class statement so that SAS knows to treat them
appropriately. It should be noted that PROC GLM can also handle a wide variety of other linear
models like ANOVA.
Example 1: A group of 13 children and adolescents (considered healthy) participated in a
psychological study designed to investigate the relationship between age (in years) and average
total sleep time (min/24 hr). To obtain a measure for average total sleep time in minutes were
recorded from each subject for three consecutive nights and averaged as results obtained
displayed in the following table.

ATST(Y) 586 461.75 491.1 565 462 532 477.6 515.2 493 528.3 575.9 532.5 530.5
Age(X) 4.4 14 10.1 6.7 11.5 9.6 12.4 8.9 11.1 7.75 5.5 8.6 7.2

a) Create a SAS data set and fit regression equation that relates average total sleep time and
age of children
/* SAS program to create SAS data set*/
Data SLRM;
Input ATST Age;
Datalines;
586 4.4
461.75 14
491.1 10.1
565 6.7

Page 48 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

462 11.5
532 9.6
477.6 12.4
515.2 8.9
493 11.1
528.3 7.75
575.9 5.5
532.5 8.6
530.5 7.2
;
Run;
proc reg data = SLRM;
model ATST = Age;
run;
give the following simple linear regression output.

Test significance of the model and coefficient of age at 5% level of significance, also interpret
the result
b) Check assumptions of normality of error terms and homoscedasticity by constructing a
normal probability plot of the residuals and plot the residual versus fitted values and
interpret them.
We now generate a new dataset called OUTREG1 that contains all of the original variables, plus
the predicted value
for each observation (PREDICT) and the residual (RESID) to check assumptions of normality of
error terms and homoscedasticity
/*Check assumptions of normality, homoscedasticity and no multicollinearity for the fitted
model above*/
/*To compute residuals, predicted values of income and vif of the model*/
proc reg data = SLRM;

Page 49 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

model ATST = Age;


OUTPUT OUT=OUTREG1 P=PREDICT R=RESID;
run;
/* to check for Residuals for Normality*/
proc univariate data=outreg1 PLOT NORMAL;
var RESID;
histogram RESID / normal;
qqplot RESID / normal;
run;

Interpret the above SAS outputs with respect to normality of assumption for error!!!!!!!!!!!!!!!!!!!
/* to check for homoscedasticity of error terms*/
proc gplot data=outreg1;
Plot RESID*PREDICT='dot';
run;
quit;

Interpret the above SAS outputs with respect to homoscedasticity of assumption for error!!!!!!!!
Page 50 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

(d) What will be the average total sleep time of an individual whose age is 3, 4 and 16 year old?
/* SAS Program to make prediction on ATST given age =3,4 and 16*/
data SLRM2;
set SLRM;
output; /* to place new observations at bottom after last value of age 7.2*/
if age=7.2 then do;
age = 3;ATST= .; output;/* to add new observations at bottom of original dataset*/
age = 4; ATST= .; output;/* to add new observations at bottom of original dataset*/
age= 16; ATST= .;output;
end;
run;
proc reg data=SLRM2 alpha=0.05;
model ATST = Age / clm; /*clm is used to get predicted value for new observation*/
run; quit;
Example 2: Consider the Above data collected to investigate the relationship between income,
age and hours worked per day for of households from a given town XYZ, then
(a) Fit multiple linear regression model that relates the income of households with their age and
hours worked per day
(b) Check assumptions of normality, homoscedasticity and no multicollinearity for the fitted
model above in (a)
(c) What will be the income of household for age=54 and hours worked per day=4
Solution
(a) /* To fit multiple linear regression model that relates the income of
households with their age and hours worked per day*/
proc reg data=mlrdata;
model Income=Age Hourworked;
run;

Test significance of the model and individual regression coefficients, interpret outputs!!!!!!!
Page 51 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

(b) /*Check assumptions of normality, homoscedasticity and no


multicollinearity for the fitted model above*/
/*To compute residuals, predicted values of income and vif of the model*/
proc reg data=mlrdata;
model Income=Age Hourworked/vif;
OUTPUT OUT=OUTREG1 P=PREDICT R=RESID;
run;
The vif option in the model statement requests that variance inflation factors be included in the
output.
/* to check for Residuals for Normality*/
proc univariate data=outreg1 PLOT NORMAL;
var RESID;
histogram RESID / normal;
qqplot RESID / normal;
run;

Interprat the above SAS output with respect to assumptions of normality and no
multicollinearity among predictors!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/* to check for homoscedasticity of error terms*/
proc gplot data=outreg1;
Plot RESID*PREDICT='dot';
run;

Page 52 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Interprat the above SAS output with respect to assumption of homoscedasticity for error term
(c) /* SAS Program to make prediction on income given age = 54 and X2 = 4*/
data mlrdata2;
set mlrdata;
output; /* to place new observations at bottom after last value of X1=52.3*/
if age=33 then do;
age = 54; Hourworked =4; income = .; output;/* to add new observations at
bottom of original dataset*/
end;
run;
proc reg data= mlrdata2;
model income= age Hourworked/ clm; /*clm is used to get predicted value for
new observation*/
run; quit;
In the above SAS programs, the CLM option tells PROC reg to print confidence limits for
individual predicted values for each observation. The ALPHA= option specifies the alpha level
for confidence intervals. By default the alpha level is 0.05. The option P will cause the observed,
predicted, and residual values to be printed for each observation that does not contain missing
values.
The OUTPUT statement can be used to create a SAS data set that contains all the input data, as
well as predicted values, confidence limits, residuals, and regression diagnostics. The form of the
OUTPUT statement is:
OUTPUT OUT=datasetname keyword=name ;
The keywords are used to specify which values to store in the output data set. Useful keyword
options include PREDICTED (or P) for predicted values, and RESIDUAL (or R) for residual
values. An example of an OUTPUT statement is:
OUTPUT OUT=stats P=pred R=res;

Page 53 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

The OUTPUT statement is useful when creating a data set that will be used later by another SAS
procedure (such as PROC PLOT). A BY statement can be used with PROC GLM to obtain
separate plots on observations in groups defined by the BY variables. When a BY statement
appears, PROC GLM expects the data to be sorted in the order of the BY variables.
Qualitative predictors in linear regression model
In fitting linear regression with qualitative predictor variables, such as gender or marital status,
religion,…,etc can be included in model by incorporating so-called indicator variables or dummy
variables that take on the values 0 or 1 to identify the categories of the qualitative predictor
variable. In general for a categorical variable with k levels, you need to create (k− 1) dummy
variable and expected to have (k− 1) regression coefficients in the model so that the remaining
category will serve as reference to interpret the results.
Example 1: The following data are on plasma lipid levels of total cholesterol (in mg/100ml),
weights (in kg), age (in years) and frequency of performing exercises per week( no exercise,
exercise some time or exercise often) for a sample of 25 patients suffering from hyper
lipoproteinemia before drug therapy.
Total cholesterol(y) Weight(x1) Age(x2) Freq exercises(x3)
354 84 46 No
190 73 20 Some times
405 65 52 Some times
263 70 30 No
451 76 57 Some times
302 69 25 often
288 63 28 often
385 72 36 No
402 79 57 often
365 75 44 No
209 27 24 Some times
290 89 31 Some times
346 65 52 No
254 57 23 No
395 59 60 Some times
434 69 48 often
220 60 34 Some times
374 79 51 Some times
308 75 50 Some times
220 82 34 often
311 59 46 Some times

Page 54 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

181 67 23 No
274 85 37 often
303 55 40 No
244 63 30 often
Then, fit multiple regression models that relate total cholesterol level weight, age and frequency
of performing exercises per week.
/*Fitting Regression model with Categorical Predictor in SAS by using either Proc Reg or Proc
Glm*/
data MLRdataCatpr;
/*Y=Cholestrol level, x1=age, x2=weight, x3=frequency of exercises*/
input Y X1 X2 X3 $;
datalines;
354 84 46 No
190 73 20 Esometim
405 65 52 Esometim
263 70 30 No
451 76 57 Esometim
302 69 25 often
288 63 28 often
385 72 36 No
402 79 57 often
365 75 44 No
209 27 24 Esometim
290 89 31 Esometim
346 65 52 No
254 57 23 No
395 59 60 Esometim
434 69 48 often
220 60 34 Esometim
374 79 51 often
308 75 50 Esometim
220 82 34 often
311 59 46 Esometim
181 67 23 No
274 85 37 often
303 55 40 No
244 63 30 often
;
run;
proc print data=MLRdataCatpr;
run;
/* Create dummy variable for frequency of exercise to use PROC REG procedure in SAS*/
data MLRdataCatpr2;

Page 55 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

set MLRdataCatpr;
if x3='No' then noex=1;
else noex=0;
if x3='Esometim' then somex=1;
else somex=0;
/* Frequency of Exercise=Often used as Reference*/
run;
proc print data=MLRdataCatpr2;run;
/* Mlrmodel By Proc Reg Procedure after Creating Dummies for Categorical Predictor
Frequency of Exercise*/
PROC REG DATA=MLRdataCatpr2;
MODEL Y=X1 X2 NOEX SOMEX; /* often category is taken as reference*/
RUN;
/* MLR MODEL BY PROC GLM PROCEDURE THROUGH CLASS STATEMENT */
PROC GLM DATA=MLRdataCatpr;
CLASS X3;
MODEL Y=X1 X2 X3/SS3;
RUN;
/* The proc glm doing anova automatically provides the information provided by the test
statement*/
/*If we like, we can also request the parameter estimates by adding the option solution after the
model statement*/
/* MLRMODEL BY PROC GLM PROCEDURE*/
PROC GLM DATA=MLRdataCatpr;
CLASS X3;
MODEL Y=X1 X2 X3/ SOLUTION;
RUN;
h) Generalized Linear Models using SAS/ Proc GENMOD
The goal of modeling is to find the best fitting and most parsimonious model to describe the
relationship between an outcome (dependent or response variable) and a set of independent
(predictor or explanatory) variables. The most common example of modeling is the usual linear
regression model where the outcome variable is continuous. In a lot of cases however, the
outcome variable is discrete or categorical. In that case generalized linear models are often used.
Fitting GLM model, require only an additional parameter to specify the probability distribution
and link functions. There are the following choices of family and link functions:

Page 56 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Logistic regression is an example of a generalized linear model where by response variable in


is categorical. Binary logistic regression is similar to a linear regression model but is suited to
model where the dependent variable is dichotomous as presence or absence of characteristics on
interest which is often coded as 1 or 0. Proc GENMOD in SAS fits generalized linear models.
proc genmod data=nameofdataset;
model response /predictors/dist= link= ;
run;
The DIST option specifies the distribution of the random component: “bin” for binomial , “poi”
for Poisson and “nor” for normal. The LINK option specifies the link function: “identity”,
“logit” and “probit”.
Example 1: Let us consider the following data collected to model the impact of gender and
educational level on the saving status of households for a random sample 20 from certain city
XYZ. Here in this data the dependent variable is saving status of household(Y) which is
dichotomous and coded as 0=yes, 1=no and independent variables are gender(x1) coded as
0=male, 1=female and educational level(x2) coded as 0=elementary or less,1=High school and
2=high school and above.
Saving(Y) Gender(x1) Edulev(x2)
0 1 2
1 0 0
0 0 1
0 0 2
1 1 1
0 0 2
0 0 1
1 1 0
0 0 0
1 1 2
0 0 2
1 1 0
0 0 1
0 0 2
0 1 2
1 0 1
1 0 0
0 0 2
0 1 1
0 1 2
Create SAS dataset and fit binary logistic regression model for the data, identify the factors
affecting saving habit of households at city XYZ
Page 57 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

/*SAS Program for creating SAS Dataset*/


Data savingdata;
Input saving gender edulev;
Datalines;
0 12
100
001
002
111
002
001
110
000
112
110
001
002
012
101
100
002
011
012
;Run;
/*SAS Program to fit multiple binary logistic regression model with proc genmod step*/
proc genmod descending data=savingdata;
class gender edulev;
model saving=gender edulev/link=logit dist=bin;
run;

Page 58 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Test the significance of individual regression coefficients and identify the factors affecting
saving habit of households!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
You can fit multiple binary logistic regression model with proc logistic option and get a similar
result with PROC genmod SAS program shown below.
/*SAS Program to fit multiple binary logistic regression with proc logistic option*/
proc logistic descending data= savingdata;
class gender edulev;
model saving= gender edulev;
run;
i) ANOVA Models using SAS/ PROC ANOVA or PROC GLM
a) One-way Analysis of Variance
This deals with methods for making inferences about the relationship between a single numeric
response variable and a single categorical explanatory variable. For this we use the procedure
proc glm or PROC ANOVA. A disadvantage of PROC ANOVA is that there are limitations
with respect to residual analysis since it will not allow output statement. Due to the
importance of checking assumptions in a statistical analysis, we prefer to use PROC GLM.
Example 1: An experiment was conducted to determine the effects of three types of fertilizer on
the first year growth rate of wheat seedlings as data shown below.
Types of fertilizer growth rate
A 11 13 16 10
B 15 17 20 12
C 10 15 13 10
a. Enter the data in SAS and create temporary SAS dataset named oneanova
b. Do the data provide sufficient evidence to indicate a difference in the average growth
rate depending on the type of fertilizer?
c. If the difference in mean growth rate is statistically significant perform pair wise
comparison and draw your conclusions.
(a) /* SAS Program to create SAS dataset */
Data oneanova;
Input growth fert;
Datalines;

Page 59 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

11 1
13 1
16 1
10 1
15 2
17 2
20 2
12 2
10 3
15 3
13 3
10 3
; run;
(b) /* SAS Program to carry out one ANOVA and multiple comparison*/
PROC ANOVA DATA=oneanova;
CLASS fert;
MODEL growth = fert;
MEANS fert / TUKEY CLDIFF;
RUN;
In the above command, the SAS dataset oneanova has a variable fert containing fertilizer types (
A, B, and C) and a variable growth which contains first year growth rate of wheat seedlings.
Assume a number of plots were planted with the three types of fertilizers and their first year
growth of wheat seedlings measured. The procedure specifies a one-factor model, testing
whether the levels of fert have different effects on the first year growth of wheat seedlings.
The MEANS statement:

The MEANS statement computes means and performs tests for the indicated effects. Possible
options include:
BON Bonferroni tests
SCHEFFE Scheffe theory tests
T Fisher’s LSD (least significant difference)
TUKEY Tukey's studentized range test
CLM presents output from BON, SCHEFFE, and T tests as confidence intervals
CLDIFF presents output from BON, SCHEFFE, TUKEY and T tests as confidence
intervals for pairwise differences (default).
Example 2: The following data are in an experiment to determine the effect of nutrition on the
attention spans of elementary schools students, a group of 15 students were randomly assigned to

Page 60 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

each of the three meal plans: no breakfast, light breakfast and full breakfast. Their attention
spans (in minutes) were recorded during a morning reading period.
No breakfast Light breakfast Full break fast
8 14 10
7 16 12
9 12 16
13 17 15
10 11 12
a. Enter the data in SAS
/* SAS Program to enter data/create SAS dataset*/
Data oneanova;
Input attention breakfast;
Datalines;
8 1
7 1
9 1
13 1
10 1
14 2
16 2
12 2
17 2
11 2
10 3
12 3
16 3
15 3
12 3
; run;
b. Do the data provide sufficient evidence to indicate a difference in the average attention
spans depending on the type of breakfast eaten by the students?
/* SAS program to carry out one way ANOVA and multiple comparison*/
PROC ANOVA DATA=oneanova;
CLASS breakfast;
MODEL attention = breakfast;
MEANS breakfast / TUKEY CLDIFF;
RUN;

Page 61 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

Interpret above SAS output with respect to difference in the average attention spans
depending on the type of breakfast eaten by the students!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
c. If the difference in mean attention span is statistically significant perform pair wise
comparison and draw your conclusions.

Identify the pairs of means responsible for rejection of null hypothesis in the above ANOVA
result!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
b) Two-way Analysis of Variance
This deals with methods for making inferences about the relationship existing between a single
numeric response variable and two categorical explanatory variables. In one-way ANOVA we
discussed the effect of only one factor and now let us pass on to a two way ANOVA. In effect
two-way ANOVA deals with two factors, whose treatments are along the row and column. In
such a situation it is logical to think in terms of treatment combinations being received by each
experimental unit. It is also assumed that each factor contribute to the response without the

Page 62 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

influence of the other. Finally the analysis is done in such a way that the main effects of column
and row factors and their interaction are estimated and tested.
Example 1: It is of interest to investigate whether four different forms of a standardized reading
test are in fact equivalent. To this end, a total of twenty students is randomly selected from five
different schools (four from each) to do these tests. The heterogeneity in experimental units
(students) that is present since they are from different schools is removed by blocking. Within a
block (the four students from school j), the treatments (type of form) should be randomly
assigned. The resulting achievements of the students are given as follows:
School 1 2 3 4 5
Form
1 75 73 59 69 84
2 83 72 56 70 92
3 86 61 53 72 88
4 73 67 62 79 95

a. Create SAS dataset


b. Test whether the average score for four different forms of a standardized reading are equal or
not.
c. Test if blocking factor was important in this experiment to study the mean score in students.
d. If the difference in mean score is statistically significant perform pair wise comparison and
draw your conclusions.
(a) /* SAS Program to enter data/create SAS dataset*/
Data twoanova;
Input mark school form;
Datalines;
75 1 1
73 2 1
59 3 1
69 4 1
84 5 1
83 1 2
72 2 2
56 3 2
70 4 2
92 5 2
86 1 3
61 2 3
53 3 3
72 4 3
88 5 3
73 1 4
67 2 4
62 3 4
79 4 4
95 5 4

Page 63 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

; run;
/* SAS Program to perform two-way ANOVA without Interaction and multiple
comparison*/
PROC ANOVA DATA=twoanova;
CLASS school form;
MODEL mark= school form;
MEANS form / TUKEY CLDIFF;
RUN;

Answer the above questions from (b) to (d) based on the above SAS
output!!!!
Example 2: A researcher interested in investigating two life style factors (Sodium intake in the
diet and smoking habit) on the systolic blood pressure. She/he took a random sample of 24
individuals and determines smoking habit, the amount of sodium in their diet and their systolic
blood pressure (in mmHg) as data from the experiment shown below:

Page 64 of 66
Statistical Computing-II Stat 2082
AMU, college of natural sciences, Department of Statistics: Lecture Notes on Introduction to SAS

• Then, do the average systolic blood pressure depends on the sodium intake and smoking
habit?
• Is there interaction between sodium intake smoking habits?
/* SAS Program to enter data and perform two-way ANOVA with interaction */
Data twoanovaint;
Input SBP Sointake smoking;
Datalines;
129 1 1
125 1 1
129 1 1
132 1 1
125 1 1
128 1 1
140 1 2
126 1 2
120 1 2
137 1 2
142 1 2
147 1 2
132 2 1
137 2 1
130 2 1
148 2 1
154 2 1
158 2 1
165 2 2
152 2 2
140 2 2
167 2 2
142 2 2
177 2 2
; run;
PROC ANOVA DATA=twoanovaint;
CLASS Sointake smoking;
MODEL SBP= Sointake smoking Sointake*smoking;
MEANS Sointake smoking / TUKEY CLDIFF;
RUN;

Page 65 of 66
Statistical Computing-II Stat 2082

You might also like