Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

SAS Global Forum 2009

Posters

Paper 197-2009

Merging Data Eight Different Ways


David Franklin, Independent Consultant, New Hampshire, USA
ABSTRACT
Merging data is a fundamental fun tion arried out when manipulating data to !ring it into a form for either storage or anal"sis# $he use of the M%&'% statement inside a datastep is the most ommon wa" this task is done within the SAS language !ut there are others# $his paper looks at eight possi!le methods, in luding the use of the M%&'% statement, for a one to one, or one to man" merge, introdu ing the SAS( ode needed to om!ine the data#

I TR!D"CTI!
Merging varia!les from one dataset into another is one of the !asi data manipulation tasks that a SAS programmer has to do# $he most ommon wa" to merge on data is using the M%&'% statement in the DA$A step !ut there are si) other wa"s that an help# First though, some data* Dataset: PATDATA SUBJECT TRT_CODE 124263 A 124264 A 124265 B 124266 B Dataset: ADVERSE SUBJECT EVENT 124263 HEADACHE 124266 FEVER 124266 NAUSEA 124267 FRACTURE $his data will !e used throughout the paper for ea h method des ri!ed#

MER#I # T$E DATA


MER#E I A DATA STEP

$he most ommonl" used statement used when merging data within SAS is the M%&'% statement used inside a datastep, an e)ample of whi h is given !elow* DATA all ata!" #ER$E a %e&se '()*a+ ,at ata '()*-+" B. s/-0e1t" 2F a" RUN" $his method is the most ommon wa" of merging data as it gives ontrol the the wa" data is to !e merged# In the e)ample, the AD+%&S% and ,A$DA$A re ords are merged !" SU-.%C$ and those re ords that ome from the AD+%&S% dataset are output / this results in su!0e t 123245 not having a $&$6C7D% value in the A88DA$A9 dataset, and su!0e ts 123243 and 12324: not !eing represented in the same# Most ommonl" the datastep is pre eded !" a all to the S7&$ pro edure making sure that !oth AD+%&S% and ,A$DA$A are in the same sort order, !ut it is possi!le to use inde)ed datasets instead !ut the inde)es must !e defined !efore the datastep that does the merging of the data#

SAS Global Forum 2009

Posters

MER#E WIT$ S%&

SAS version 4#95 introdu ed S;8 into SAS whi h gave the a!ilit" to merge data using the S;8 language# $o merge our two datasets and get the same result as in the ode a!ove, the S;8 ode would look something similar to that !elow* PROC S34" CREATE TAB4E all ata! AS SE4ECT a567 -5t&t_18 e FRO# a %e&se a 4EFT JO2N ,at ata ON a5s/-0e1t*-5s/-0e1t" 3U2T" RUN" S;8 is a well known language that is ver" good at working with data!ases and is liked !" man" who deal with large datasets#
MER#E WIT$ SET-'E(

7ver the "ears man" options have !een added to the S%$ statement whi h !rings the third method for merging data, using the <%=> option as shown in the following e)ample* DATA all ata!" SET a %e&se" SET ,at ata 9E.*s/-0e1t :UN23UE" DO" 2F _2ORC_ THEN DO" _ERROR_*!" t&t_18 e*;;" END" END" RUN" -efore the third e)ample is run the dataset ,A$DA$A must have an inde) reated inside it, using either the IND%? statement inside a DA$AS%$S or S;8 pro edure, or IND%? option inside a DA$A step# It is important to have the D7 loop is if no mat h is found then $&$6C7D% will !e set to missing / if this is not done then une)pe ted results ma" o ur#
MER#E WIT$ )!RMAT

$he fourth method that is useful reates a format from the ,A$DA$A dataset and and sets the treatment from the reated format* DATA <=t" RETA2N <=t)a=e ;TRT_F#T; t>,e ;C;" SET ,at ata" RENA#E s/-0e1t*sta&t t&t_18 e*la-el" RUN" PROC FOR#AT CNT42N*<=t" RUN" DATA all ata!" SET a %e&se" ATTR2B t&t_18 e 4EN$TH*?1 4ABE4*;T&eat=e)t C8 e;" t&t_18 e*PUT's/-0e1t7?t&t_<=t5+" RUN" In the e)ample a hara ter format $&$6FM$ is reated from the ,A$DA$A dataset, and then this format is used to set the $&$6C7D% varia!le within the AD+%&S% dataset# $his method is useful as the data does not have to !e sorted or inde)ed !eforehand#

SAS Global Forum 2009

Posters

MER#E WIT$ $AS$ TAB&E

Sin e version @#1 another possi!ilit" that has !een availa!le is the use of hash ta!les# Man" papers have !een written a!out this re ent feature, how it works, and their use within SAS / referen es to some nota!le papers are in the &eferen e se tion !elow# $he ode !elow does the merge reAuired* DATA all ata!" 2F _)_*! THEN SET ,at ata" 2F _)_*1 THEN DO" DEC4ARE HASH _@1 ' ataset: APATDATAA+" &1*_@15 e<()eBe>'ASUBJECTA+" &1*_@15 e<()e ata'ATRT_CODEA+" &1*_@15 e<()e 8)e'+" 1all =(ss()C'SUBJECT7TRT_CODE+" END" SET a %e&se" &1*_@15<() '+" 2F &1D*! THEN t&t_18 e*A A" DROP &1"" RUN" In the e)ample a!ove, the dataset ,A$DA$A gets loaded into a hash ta!le, then the AD+%&S% dataset is loaded into the datastep and the mat h is made using the FINDBC method#
MER#E WIT$ ARRA(

A variation on the hash ta!le is to load the dataset with uniAue re ords into an arra" and then do the mat h, as the following e)ample demonstrates* DATA _)/ll_" SET sas@el,5%ta-le" EHERE l(-)a=e*;EOR9;" EHERE A4SO =e=)a=e ()';PATDATA;7;ADVERSE;+" CA44 S.#PUT';F;GG=e=)a=e7,/t')8-s7H5++" RUN" DATA all ata!" 4EN$TH t&t_18 e ?1" ARRA. <IJK,at ata572L ?6 _TE#PORAR._" DO (*1 TO JK,at ata5" SET ,at ata 'RENA#E*'t&t_18 e*t&t_18 e_ (1t++" <I(71L*PUT's/-0e1t765+" <I(72L*t&t_18 e_ (1t" END" DO (*1 TO JKa %e&se5" SET a %e&se" t&t_18 e*;;" DO 0*1 TO JK,at ata5" 2F s/-0e1t*2NPUT'<'071+7-est5+ THEN DO" t&t_18 e*<I072L" OUTPUT" END" 2F D#2SS2N$'t&t_18 e+ THEN 4EAVE" END" 2F #2SS2N$'t&t_18 e+ THEN OUTPUT" END" DROP ( 0 t&t_18 e_ (1t" RUN" $he first datastep finds the num!er of re ords within ,A$DA$A and AD+%&S% so that the orre t num!er of elements an !e set for the arra" and the orre t num!er of iterations is used when alling the AD+%&S% dataset#

SAS Global Forum 2009

Posters

$he method a!ove does have one surprising feature / the line* 2F s/-0e1t*2NPUT'<'071+7-est5+ THEN DO" where the a tual ompare is done, an !e hanged to use an" omparison, whether it !e an IND%? fun tion or greater thanDless than operators#
MER#E WIT$ M!DI)(

$he last merge onsidered uses the M7DIF= statement E this is an interesting te hniAue as it is ne essar" to do a loop within a loop due to the AD+%&S% dataset having multiple re ords per su!0e t* DATA a %e&se" DO , * 1 TO t8t8-s" _2ORC_ * !" SET ,at ata PO2NT*, NOBS*t8t8-s" DO EH24E'_2ORC_*MS.SRC'_SO9++" #OD2F. a %e&se 9E.*s/-0e1t" SE4ECT '_2ORC_+" EHEN 'MS.SRC'_SO9++ DO" :6#ATCH FOUND6: SET ,at ata PO2NT*," t&t1*t&t18 e" REP4ACE" END" EHEN 'MS.SRC'_DSENO#++ _ERROR_ * !" :6NO #ATCH FOUND6: END" OTHERE2SE DO" :6A #AJOR PROB4E# SO#EEHERE6: PUT ;ERROR: _2ORC_ * ; _2ORC_ : ;PRO$RA# HA4TED5;" _ERROR_ * !" STOP" END" END" END" END" STOP" RUN" It is ne essar" to note that this method modifies the e)isting AD+%&S% dataset and will not reate the A88DA$A9 dataset as the previous methods gave# Note also that the dataset AD+%&S% has an inde) alled SU-.%C$ applied !efore the datastep is run#
CA&& E*EC"TE

$he last merge that is presented in this paper is something that I have seen and at !est is onl" good if "ou are dealing with ases where "ou are dealing with small su!sets of ver" large datasets* DATA _)/ll_" SET ,at ata" CA44 EFECUTE'ADATA all at"AGG A SET a %e&se"AGG A EHERE s/-0e1t*;AGGSTR2P's/-0e1t+GGA;"AGG A t&t_18 e*;AGGSTR2P't&t_18 e+GGA;"AGG APROC APPEND BASE*all ata! DATA* at! FORCE"AGG ARUN"A+"" RUN" $his method uses CA88 %?%CU$% to add $&$6C7D% from ,A$DA$A to AD+%&S% !" SU-.%C$, appending the result ea h time to the dataset A88DA$A9# Unfortunatel" this method will onl" produ e a dataset with the interse tion of data from ,A$DA$A and AD+%&S%, !ut is something to o asionall" use#

C! C&"SI!
As shown in the paper there are a num!er of methods whi h an !e used to merge data, !e"ond the M%&'% statement within a DA$A step# No one method is !etter than another, and the methods shown here are !" no means e)haustive# It is onl" though tr"ing these different methods at "ou site that "ou will see resour e effi ien ies !etween the methods#

SAS Global Forum 2009

Posters

RE)ERE CES
'etting Started with the DA$A Step Hash 7!0e t / .ason Se osk" and .ani e -loom, SAS Institute In #, Car", NC BSAS 'lo!al Forum 2995C How Do I 8ove Hash $a!lesF 8et Me Count $he Ga"sH / .ud" 8oren, Independent Consultant, ,ortland, M% BN%SU' 2994C

C! TACT I )!RMATI!
=our omments and Auestions are valued and en ouraged# Conta t the author at* David Franklin 14 &o!erts &oad 8it hfield, NH 9I9:2 $elDFa)* 49I/242/@149 %mail* 199I14#I3:1J ompuserve# om Ge!* http*DDourworld# ompuserve# omDhomepagesDdfranklinuk SAS and all other SAS Institute In # produ t or servi e names are registered trademarks or trademarks of SAS Institute In # in the USA and other ountries# ( indi ates USA registration# 7ther !rand and produ t names are trademarks of their respe tive ompanies#

You might also like