Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Data Sets: Subsetting, Combining and Updating

Subsetting Datasets
Using the SET statement with the IF, you can easily create a subset of an existing SAS dataset. Consider the following example. An amusement park is collecting data about their train ride. The data below include the time of day, the number of cars on the train, and the number of people on the train. Suppose at some point there is to be an analysis of the afternoon rides only. Create a subset of the trains dataset which includes only the afternoon train rides. DATA trains; INPUT Time TIME5. Cars People; DATALINES; 10:10 6 21 12:15 10 56 15:30 10 25 11:30 8 34 13:15 8 12 10:45 6 13 20:30 6 32 23:15 6 12 ; RUN; DATA afternoon; SET trains; IF time>'12:00'T; RUN; PROC PRINT; FORMAT time TIME5.; RUN;

/*keep only obs with times after noon*/

Notice that the statement IF time>'12:00'T; is essentially the same as writing if time>'12:00'T THEN OUTPUT;
or, equivalently,

if time>'12:00'T THEN OUTPUT afternoon;

STA4133/5133.01 Page 1 of 9

Combining Datasets
You may find that the variables that are needed for your analysis are coming from separate datasets. It may also be the case that your complete set of records are coming from separate files. In either case, you will need to combine SAS datasets, either by stacking, interleaving or merging them. For example, longitudinal studies are usually stored in annual files. In order to perform cross year studies you will want to create a single dataset which contains only the variables you need for your analysis from all years of the study. This may require stacking or interleaving existing datasets, while renaming and dropping variables.

Using the SET statement


Stacking SAS datasets Any number of SAS datasets can be stacked, or concatenated, using the SET statement. The syntax is the same, although you will be specifying two or more datasets in the SET statement: DATA new; SET dataset1 dataset2 ... datasetn; Some important notes: The input datasets do not have to be sorted, and in most cases the output dataset will not be sorted. The total number of records resulting from the SET statement will be the sum of records in each input dataset. The total number of variables resulting will depend on the number of matching variables. If one of the datasets has a variable that is not in another dataset, the observations from that other dataset will have missing values for that variable. Example. Consider the datasets master and test. The master dataset contains social security number and name, and the test dataset contains social security number and a test score. What is a reasonable way to combine these datasets? It would make sense to match score to name by social security. We will do this later. For now, let us see what happens if we use the SET statement, and, in fact, these datasets are concatenated. ***Programs to create data sets MASTER and TEST; DATA MASTER; INPUT SS NAME : $9.; DATALINES; 123456789 CODY 987654321 SMITH 111223333 GREGORY 222334444 HAMER
STA4133/5133.01 Page 2 of 9

777665555 ; DATA TEST; INPUT SS DATALINES; 123456789 987654321 222334444 ;

CHAMBLISS

SCORE; 100 67 92

DATA combine; SET master test; run; Examine the output from this dataset using the PROC PRINT statement. Now switch the input datasets and compare the outputs.

Interleaving SAS datasets When datasets are concatenated as in the above example, records will be ordered simply by how they arrive. The records from the first listed dataset in the SET statement will be first, and they will be ordered according to the input dataset. Records from the dataset listed last in the SET statement will be last, and ordered according to the input dataset. In most cases this results in an unsorted dataset, even if the input datasets are ordered. When two input datasets are ordered by some variable and you wish the combined dataset to be ordered by that same variable, it is more efficient to interleave the datasets (than to stack them and then sort them). The syntax for interleaving datasets is the same as that for stacking, with the addition of a BY statement: DATA new; SET dataset1 dataset2 ... datasetn; BY variable1 ... variablen; Some important notes: The input datasets be sorted by the same variables in the BY statement above. The output dataset will be sorted according to the BY statement. The total number of records resulting from the SET statement will be the sum of records in each input dataset. The total number of variables resulting will depend on the number of matching variables. If one of the datasets has a variable that is not in another dataset, the observations from that other dataset will have missing values for that variable.

STA4133/5133.01 Page 3 of 9

Example. Interleave the datasets master and test. Since these datasets have only one variable in common social security number the interleaving will be by that variable. The datasets are not sorted by this variable, so this should be done before the interleaving. PROC SORT DATA=master; BY ss; PROC SORT DATA=test; BY ss; DATA combine; SET master test; BY ss; run; Examine the output from this dataset using the PROC PRINT statement.

Using the MERGE statement


Combining datasets may involve matching on a particular variable. When this is the case, the MERGE statement should be used instead of the SET statement, for specifying input datasets. The general format of a MERGE statement is: DATA new; MERGE dataset1 dataset2; BY variable1 variable2 ... variablen; Run; Important If the two input datasets have variables with the same name (other than the BY variables) the value of the variables from the second dataset will overwrite the value of the variables from the first dataset. Use the RENAME= option to assign new names to the variables in one of the datasets. The two input datasets MUST be sorted in order of the BY variables, and the (BY) variables must have identical names in both datasets. If they do not have the same name, use the RENAME= option. There are a variety of ways to merge two datasets using PROC MERGE: 1. One-to-one match merge, with a BY value 2. Non-matches with a by value 3. Limiting observations by using the IN= option 4. One-to-many match merge One-to-one match merge with a BY value
STA4133/5133.01 Page 4 of 9

The following example merges a file containing demographic data and a file of health statistics into a single file called pophlth. In this case both datasets have the exact same zip codes but different variables. The file of demographic data has the following observations: zip 78201 78202 78203 popsize 111111 222222 333333 hhsize 121212 343434 565656

The health file has the following observations: zip 78201 78202 78203 diab .04 .05 .06 access .8 .7 .6

The files will be merged together by zip code, and will contain ALL variables in both input files. /*Always remember to sort the input datasets*/ PROC SORT DATA=demog; BY zip; PROC SORT DATA=health; BY zip; DATA pophlth; MERGE demog health; BY zip; RUN;

/* demography file */ /* health stat. file */

The new dataset pophlth will have the following observations: zip 78201 78202 78203 popsize 111111 222222 333333 hhsize 121212 343434 565656 diab heart .04 .8 .05 .7 .06 .6

Note: Suppose the variable for zip code is named differently for the two input files. If the health dataset uses the name zcode, for example, you can use the RENAME= option to change it: Merge demog health (RENAME = (zcode = zip));

Non-matches with a BY value


STA4133/5133.01 Page 5 of 9

Suppose in the previous example that the health dataset has a zip code, say 78204, that is not in the dataset demog. The MERGE statement will keep all values for the health variables and set the demographic variables to missing. As an extreme case, consider the case where no zip codes match. Consider a second health dataset with the following observations: zip 78214 78215 78216 diab .04 .03 .02 access .7 .8 .9

The same code used previously will generate a new dataset pophlth with the following observations: zip 78201 78202 78203 78214 78215 78216 popsize 111111 222222 333333 . . . hhsize 121212 343434 565656 . . . diab . . . .04 .03 .02 heart . . . .7 .8 .9

Limiting observations by using the IN= option In the previous examples, all of the input records from both files belonged in the new output dataset. With the IN= option, you can specify which observations you wish to keep. Suppose the demog dataset that we used earlier is a subset of a much larger dataset with contains demographic data for all US zip codes. Lets call this dataset Usdemog. We can merge this dataset with the health dataset by zip code, but we may not wish to keep all the observations with zip codes for which there is no health data. Consider the following code: DATA pophlth; MERGE demog (IN=indem) health; BY zip; IF indem THEN OUTPUT; RUN; The IN= option specifies the temporary variable indem, which is set to true only for observations for which the demog file contributed to the merge. The output file will contain an observation for

STA4133/5133.01 Page 6 of 9

the three zip codes in the demog dataset, along with the associated variables from the health file. If there was not a matching zip code in the health file, then the health variables will be set to missing. If you wanted to output records only if there was a demography AND a health record for the same zip code you would use the following: DATA pophlth; MERGE demog (IN=indem) health (IN=inhlth); BY zip; IF indem AND inhlth THEN OUTPUT; RUN; You can take advantage of the IN= option to create several output datasets in one data step. Suppose you are merging the same files (Usdemog and health) and you wish to keep a dataset of records that matched as well as a dataset of records that did not match. The code can easily be modified to accomplish this: DATA phmatch /*records that matched*/ unm_hlth /*health records with no match in demog*/ unm_dem; /*demog records with no match in health*/ MERGE demog (IN=indem) health (IN=inhlth); BY zip; IF indem THEN DO; IF inhlth THEN OUTPUT phmatch; ELSE OUTPUT unm_dem; END; ELSE IF inhlth THEN OUTPUT unm_hlth; RUN;

One-to-Many match merge Sometimes you want to merge several observations in one data set with a single observation in another data set. For example, suppose you have data by zip code and you need to get state information for those zip codes. Since each state has multiple zip codes, you will need a one-tomany merge. The difference between this type of merge and the one-to-one merge is not in the code, but in the data. For this reason, it is important to know your datasets before you merge them! If, in fact, your datasets are not in a one-to-many format, but in a many-to-many format, your new dataset may not be what you expect it to be! Consider the following demographic dataset, which has a state identifier: Zip 27703 zipsize 111111 state 12
STA4133/5133.01 Page 7 of 9

78202 78203

222222 333333

28 28

Consider the following dataset with access to care information, by state: state 12 23 28 access 0.7 0.8 0.6

The following code will perform the one-to-many merge: DATA new; MERGE demog access; BY state; RUN; The new Data Set will contain the following observations: zip 27703 . 78202 78203 zipsize 111111 . 222222 333333 state 12 23 28 28 access 0.7 0.8 0.6 0.6

Updating Datasets
You may have a master dataset that need to be updated periodically. Study data is an example of data that need to be updated with corrections, follow-up data, or other new information. The general form of the UPDATE statement is the same as for the MERGE statement, except it only specifies two input datasets: DATA master; UPDATE master transactions; BY variable1 ... variablen; RUN; Important notes: Input datasets must be sorted according to the BY statement. The output dataset will be sorted. The values of the BY variables must be unique in the master dataset (but not necessarily in the transaction/updates dataset). Missing values in the transaction dataset DO NOT replace existing values in the master dataset.
STA4133/5133.01 Page 8 of 9

Example. (B 6.8) A hospital maintains a master database with patient information. Each record contains the patients account number, last name, address, date of birth, sex, insurance code, and the date that patients information was last updated. Whenever a patient is admitted to the hospital, a transaction record is created, containing new information and status changes. Some of the patients are new (and are not yet on the master database.) The code below illustrates how the master dataset is updated: DATA master; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; DATALINES; 620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998 645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999 645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993 874329 Kazoyan 76-C La Vista . . MCD 01-15-2003 ; RUN; DATA transactions; INPUT Account LastName $ 8-16 Address $ 17-34 BirthDate MMDDYY10. Sex $ InsCode $ 48-50 @52 LastUpdate MMDDYY10.; DATALINES; 620135 . . . . HLT 06-15-2003 874329 . . 04-24-1954 m . 06-15-2003 235777 Harman 5656 Land Way 01-18-2000 f MCD 06-15-2003 ; RUN; PROC SORT DATA = transactions; BY Account; * Update patient data with transactions; DATA master; UPDATE master transactions; BY Account; RUN; PROC PRINT DATA = master; FORMAT BirthDate LastUpdate MMDDYY10.; TITLE 'Admissions Data'; RUN;

STA4133/5133.01 Page 9 of 9

You might also like