Patterns For Parallel Programming

"Ifyoubuildit,theywillcome.
"
Andsowebuiltthem.Multiprocessorworkstations,massivelyparallelsupercomputers,aclusterin
everydepartment...andtheyhaven'tcome.Programmershaven'tcometoprogramthesewonderful
machines.Oh,afewprogrammersinlovewiththechallengehaveshownthatmosttypesofproblems
canbeforcefitontoparallelcomputers,butgeneralprogrammers,especiallyprofessional
programmerswho"havelives",ignoreparallelcomputers.
Andtheydosoattheirownperil.Parallelcomputersaregoingmainstream.Multithreaded
microprocessors,multicoreCPUs,multiprocessorPCs,clusters,parallelgameconsoles...parallel
computersaretakingovertheworldofcomputing.Thecomputerindustryisreadytofloodthemarket
withhardwarethatwillonlyrunatfullspeedwithparallelprograms.Butwhowillwritethese
programs?
Thisisanoldproblem.Evenintheearly1980s,whenthe"killermicros"startedtheirassaulton
traditionalvectorsupercomputers,weworriedendlesslyabouthowtoattractnormalprogrammers.
Wetriedeverythingwecouldthinkof:highlevelhardwareabstractions,implicitlyparallel
programminglanguages,parallellanguageextensions,andportablemessagepassinglibraries.But
aftermanyyearsofhardwork,thefactofthematteristhat"they"didn'tcome.Theoverwhelming
majorityofprogrammerswillnotinvesttheefforttowriteparallelsoftware.
Acommonviewisthatyoucan'tteacholdprogrammersnewtricks,sotheproblemwillnotbesolved
untiltheoldprogrammersfadeawayandanewgenerationtakesover.
Butwedon'tbuyintothatdefeatistattitude.Programmershaveshownaremarkableabilitytoadopt
newsoftwaretechnologiesovertheyears.LookathowmanyoldFortranprogrammersarenow
writingelegantJavaprogramswithsophisticatedobjectorienteddesigns.Theproblemisn'twithold
programmers.Theproblemiswitholdparallelcomputingexpertsandthewaythey'vetriedtocreatea
poolofcapableparallelprogrammers.
Andthat'swherethisbookcomesin.Wewanttocapturetheessenceofhowexpertparallel
programmersthinkaboutparallelalgorithmsandcommunicatethatessentialunderstandinginaway
professionalprogrammerscanreadilymaster.Thetechnologywe'veadoptedtoaccomplishthistaskis
apatternlanguage.Wemadethischoicenotbecausewestartedtheprojectasdevoteesofdesign
patternslookingforanewfieldtoconquer,butbecausepatternshavebeenshowntoworkinwaysthat
wouldbeapplicableinparallelprogramming.Forexample,patternshavebeenveryeffectiveinthe
fieldofobjectorienteddesign.Theyhaveprovidedacommonlanguageexpertscanusetotalkabout
theelementsofdesignandhavebeenextremelyeffectiveathelpingprogrammersmasterobject
orienteddesign.
Thisbookcontainsourpatternlanguageforparallelprogramming.Thebookopenswithacoupleof
chapterstointroducethekeyconceptsinparallelcomputing.Thesechaptersfocusontheparallel
computingconceptsandjargonusedinthepatternlanguageasopposedtobeinganexhaustive
introductiontothefield.
Thepatternlanguageitselfispresentedinfourpartscorrespondingtothefourphasesofcreatinga
parallelprogram:
*
FindingConcurrency.Theprogrammerworksintheproblemdomaintoidentifytheavailable
concurrencyandexposeitforuseinthealgorithmdesign.
*
AlgorithmStructure.Theprogrammerworkswithhighlevelstructuresfororganizingaparallel
algorithm.
*
SupportingStructures.Weshiftfromalgorithmstosourcecodeandconsiderhowtheparallel
programwillbeorganizedandthetechniquesusedtomanageshareddata.
*
ImplementationMechanisms.Thefinalstepistolookatspecificsoftwareconstructsfor
implementingaparallelprogram.
Thepatternsmakingupthesefourdesignspacesaretightlylinked.Youstartatthetop(Finding
Concurrency),workthroughthepatterns,andbythetimeyougettothebottom(Implementation
Mechanisms),youwillhaveadetaileddesignforyourparallelprogram.
Ifthegoalisaparallelprogram,however,youneedmorethanjustaparallelalgorithm.Youalsoneed
aprogrammingenvironmentandanotationforexpressingtheconcurrencywithintheprogram's
sourcecode.Programmersusedtobeconfrontedbyalargeandconfusingarrayofparallel
programmingenvironments.Fortunately,overtheyearstheparallelprogrammingcommunityhas
convergedaroundthreeprogrammingenvironments.
*
OpenMP.AsimplelanguageextensiontoC,C++,orFortrantowriteparallelprogramsfor
sharedmemorycomputers.
*
MPI.Amessagepassinglibraryusedonclustersandotherdistributedmemorycomputers.
*
Java.Anobjectorientedprogramminglanguagewithlanguagefeaturessupportingparallel
programmingonsharedmemorycomputersandstandardclasslibrariessupportingdistributed
computing.
Manyreaderswillalreadybefamiliarwithoneormoreoftheseprogrammingnotations,butfor
readerscompletelynewtoparallelcomputing,we'veincludedadiscussionoftheseprogramming
environmentsintheappendixes.
Inclosing,wehavebeenworkingformanyyearsonthispatternlanguage.Presentingitasabookso
peoplecanstartusingitisanexcitingdevelopmentforus.Butwedon'tseethisastheendofthis
effort.Weexpectthatotherswillhavetheirownideasaboutnewandbetterpatternsforparallel
programming.We'veassuredlymissedsomeimportantfeaturesthatreallybelonginthispattern
language.Weembracechangeandlookforwardtoengagingwiththelargerparallelcomputing
communitytoiterateonthislanguage.Overtime,we'llupdateandimprovethepatternlanguageuntil
ittrulyrepresentstheconsensusviewoftheparallelprogrammingcommunity.Thenourrealwork
willbeginusingthepatternlanguagetoguidethecreationofbetterparallelprogramming
environmentsandhelpingpeopletousethesetechnologiestowriteparallelsoftware.Wewon'trest
untilthedaysequentialsoftwareisrare.
ACKNOWLEDGMENTS
Westartedworkingtogetheronthispatternlanguagein1998.It'sbeenalongandtwistedroad,
startingwithavagueideaaboutanewwaytothinkaboutparallelalgorithmsandfinishingwiththis
book.Wecouldn'thavedonethiswithoutagreatdealofhelp.
ManiChandy,whothoughtwewouldmakeagoodteam,introducedTimtoBeverlyandBerna.The
NationalScienceFoundation,IntelCorp.,andTrinityUniversityhavesupportedthisresearchat
varioustimesovertheyears.HelpwiththepatternsthemselvescamefromthepeopleatthePattern
LanguagesofPrograms(PLoP)workshopsheldinIllinoiseachsummer.Theformatofthese
workshopsandtheresultingreviewprocesswaschallengingandsometimesdifficult,butwithout
themwewouldhaveneverfinishedthispatternlanguage.Wewouldalsoliketothankthereviewers
whocarefullyreadearlymanuscriptsandpointedoutcountlesserrorsandwaystoimprovethebook.
Finally,wethankourfamilies.Writingabookishardontheauthors,butthatistobeexpected.What
wedidn'tfullyappreciatewashowharditwouldbeonourfamilies.WearegratefultoBeverly's
family(DanielandSteve),Tim'sfamily(Noah,August,andMartha),andBerna'sfamily(Billie)for
thesacrificesthey'vemadetosupportthisproject.
TimMattson,Olympia,Washington,April2004
BeverlySanders,Gainesville,Florida,April2004
BernaMassingill,SanAntonio,Texas,April2004
Chapter1.APatternLanguageforParallelProgramming
Section1.1.INTRODUCTION
Section1.2.PARALLELPROGRAMMING
Section1.3.DESIGNPATTERNSANDPATTERNLANGUAGES
Section1.4.APATTERNLANGUAGEFORPARALLELPROGRAMMING
Chapter2.BackgroundandJargonofParallelComputing
Section2.1.CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS
Section2.2.PARALLELARCHITECTURES:ABRIEFINTRODUCTION
Section2.3.PARALLELPROGRAMMINGENVIRONMENTS
Section2.4.THEJARGONOFPARALLELCOMPUTING
Section2.5.AQUANTITATIVELOOKATPARALLELCOMPUTATION
Section2.6.COMMUNICATION
Section2.7.SUMMARY
Chapter3.TheFindingConcurrencyDesignSpace
Section3.1.ABOUTTHEDESIGNSPACE
Section3.2.THETASKDECOMPOSITIONPATTERN
Section3.3.THEDATADECOMPOSITIONPATTERN
Section3.4.THEGROUPTASKSPATTERN
Section3.5.THEORDERTASKSPATTERN
Section3.6.THEDATASHARINGPATTERN
Section3.7.THEDESIGNEVALUATIONPATTERN
Section3.8.SUMMARY
Chapter4.TheAlgorithmStructureDesignSpace
Section4.2.CHOOSINGANALGORITHMSTRUCTUREPATTERN
Section4.3.EXAMPLES
Section4.4.THETASKPARALLELISMPATTERN
Section4.5.THEDIVIDEANDCONQUERPATTERN
Section4.6.THEGEOMETRICDECOMPOSITIONPATTERN
Section4.7.THERECURSIVEDATAPATTERN
Section4.8.THEPIPELINEPATTERN
Section4.9.THEEVENTBASEDCOORDINATIONPATTERN
Chapter5.TheSupportingStructuresDesignSpace
Section5.2.FORCES
Section5.3.CHOOSINGTHEPATTERNS
Section5.4.THESPMDPATTERN
Section5.5.THEMASTER/WORKERPATTERN
Section5.6.THELOOPPARALLELISMPATTERN
Section5.7.THEFORK/JOINPATTERN
Section5.8.THESHAREDDATAPATTERN
Section5.9.THESHAREDQUEUEPATTERN
Section5.10.THEDISTRIBUTEDARRAYPATTERN
Section5.11.OTHERSUPPORTINGSTRUCTURES
Chapter6.TheImplementationMechanismsDesignSpace
Section6.1.OVERVIEW
Section6.2.UEMANAGEMENT
Section6.3.SYNCHRONIZATION
Section6.4.COMMUNICATION
Endnotes
AppendixA:ABriefIntroductiontoOpenMP
SectionA.1.CORECONCEPTS
SectionA.2.STRUCTUREDBLOCKSANDDIRECTIVEFORMATS
SectionA.3.WORKSHARING
SectionA.4.DATAENVIRONMENTCLAUSES
SectionA.5.THEOpenMPRUNTIMELIBRARY
SectionA.6.SYNCHRONIZATION
SectionA.7.THESCHEDULECLAUSE
SectionA.8.THERESTOFTHELANGUAGE
AppendixB:ABriefIntroductiontoMPI
SectionB.1.CONCEPTS
SectionB.2.GETTINGSTARTED
SectionB.3.BASICPOINTTOPOINTMESSAGEPASSING
SectionB.4.COLLECTIVEOPERATIONS
SectionB.5.ADVANCEDPOINTTOPOINTMESSAGEPASSING
SectionB.6.MPIANDFORTRAN
SectionB.7.CONCLUSION
AppendixC:ABriefIntroductiontoConcurrentProgramminginJava
SectionC.1.CREATINGTHREADS
SectionC.2.ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
SectionC.3.SYNCHRONIZEDBLOCKS
SectionC.4.WAITANDNOTIFY
SectionC.5.LOCKS
SectionC.6.OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA
STRUCTURES
SectionC.7.INTERRUPTS
Glossary
Bibliography
AbouttheAuthors
Index
APatternLanguageforParallelProgramming>INTRODUCTION
Chapter 1. A Pattern Language for Parallel

Programming
1.1INTRODUCTION
1.2PARALLELPROGRAMMING
1.3DESIGNPATTERNSANDPATTERNLANGUAGES
1.4APATTERNLANGUAGEFORPARALLELPROGRAMMING
1.1. INTRODUCTION
Computersareusedtomodelphysicalsystemsinmanyfieldsofscience,medicine,andengineering.
Modelers,whethertryingtopredicttheweatherorrenderasceneinthenextblockbustermovie,can
usuallyusewhatevercomputingpowerisavailabletomakeevermoredetailedsimulations.Vast
amountsofdata,whethercustomershoppingpatterns,telemetrydatafromspace,orDNAsequences,
requireanalysis.Todelivertherequiredpower,computerdesignerscombinemultipleprocessing
elementsintoasinglelargersystem.Thesesocalledparallelcomputersrunmultipletasks
simultaneouslyandsolvebiggerproblemsinlesstime.
Traditionally,parallelcomputerswererareandavailableforonlythemostcriticalproblems.Sincethe
mid1990s,however,theavailabilityofparallelcomputershaschangeddramatically.With
multithreadingsupportbuiltintothelatestmicroprocessorsandtheemergenceofmultipleprocessor
coresonasinglesilicondie,parallelcomputersarebecomingubiquitous.Now,almostevery
universitycomputersciencedepartmenthasatleastoneparallelcomputer.Virtuallyalloilcompanies,
automobilemanufacturers,drugdevelopmentcompanies,andspecialeffectsstudiosuseparallel
computing.
Forexample,incomputeranimation,renderingisthestepwhereinformationfromtheanimationfiles,
suchaslighting,textures,andshading,isappliedto3Dmodelstogeneratethe2Dimagethatmakes
upaframeofthefilm.Parallelcomputingisessentialtogeneratetheneedednumberofframes(24
persecond)forafeaturelengthfilm.ToyStory,thefirstcompletelycomputergeneratedfeature
lengthfilm,releasedbyPixarin1995,wasprocessedona"renderfarm"consistingof100dual
processormachines[PS00].By1999,forToyStory2,Pixarwasusinga1,400processorsystemwith
theimprovementinprocessingpowerfullyreflectedintheimproveddetailsintextures,clothing,and
atmosphericeffects.Monsters,Inc.(2001)usedasystemof250enterpriseserverseachcontaining14
processorsforatotalof3,500processors.Itisinterestingthattheamountoftimerequiredtogenerate
aframehasremainedrelativelyconstantascomputingpower(boththenumberofprocessorsand
thespeedofeachprocessor)hasincreased,ithasbeenexploitedtoimprovethequalityofthe
animation.
ThebiologicalscienceshavetakendramaticleapsforwardwiththeavailabilityofDNAsequence
informationfromavarietyoforganisms,includinghumans.Oneapproachtosequencing,championed
andusedwithsuccessbyCeleraCorp.,iscalledthewholegenomeshotgunalgorithm.Theideaisto
breakthegenomeintosmallsegments,experimentallydeterminetheDNAsequencesofthesegments,
andthenuseacomputertoconstructtheentiresequencefromthesegmentsbyfindingoverlapping
areas.ThecomputingfacilitiesusedbyCeleratosequencethehumangenomeincluded150fourway
serversplusaserverwith16processorsand64GBofmemory.Thecalculationinvolved500million
trillionbasetobasecomparisons[Ein00].
+
TheSETI@homeproject[SET,ACK
02
]providesafascinatingexampleofthepowerofparallel
computing.Theprojectseeksevidenceofextraterrestrialintelligencebyscanningtheskywiththe
world'slargestradiotelescope,theAreciboTelescopeinPuertoRico.Thecollecteddataisthen
analyzedforcandidatesignalsthatmightindicateanintelligentsource.Thecomputationaltaskis
beyondeventhelargestsupercomputer,andcertainlybeyondthecapabilitiesofthefacilitiesavailable
totheSETI@homeproject.Theproblemissolvedwithpublicresourcecomputing,whichturnsPCs
aroundtheworldintoahugeparallelcomputerconnectedbytheInternet.Dataisbrokenupintowork
unitsanddistributedovertheInternettoclientcomputerswhoseownersdonatesparecomputingtime
tosupporttheproject.EachclientperiodicallyconnectswiththeSETI@homeserver,downloadsthe
datatoanalyze,andthensendstheresultsbacktotheserver.Theclientprogramistypically
implementedasascreensaversothatitwilldevoteCPUcyclestotheSETIproblemonlywhenthe
computerisotherwiseidle.Aworkunitcurrentlyrequiresanaverageofbetweensevenandeight
hoursofCPUtimeonaclient.Morethan205,000,000workunitshavebeenprocessedsincethestart
oftheproject.Morerecently,similartechnologytothatdemonstratedbySETI@homehasbeenused
foravarietyofpublicresourcecomputingprojectsaswellasinternalprojectswithinlargecompanies
utilizingtheiridlePCstosolveproblemsrangingfromdrugscreeningtochipdesignvalidation.
Althoughcomputinginlesstimeisbeneficial,andmayenableproblemstobesolvedthatcouldn'tbe
otherwise,itcomesatacost.Writingsoftwaretorunonparallelcomputerscanbedifficult.Onlya
smallminorityofprogrammershaveexperiencewithparallelprogramming.Ifallthesecomputers
designedtoexploitparallelismaregoingtoachievetheirpotential,moreprogrammersneedtolearn
howtowriteparallelprograms.
Thisbookaddressesthisneedbyshowingcompetentprogrammersofsequentialmachineshowto
designprogramsthatcanrunonparallelcomputers.Althoughmanyexcellentbooksshowhowtouse
particularparallelprogrammingenvironments,thisbookisuniqueinthatitfocusesonhowtothink
aboutanddesignparallelalgorithms.Toaccomplishthisgoal,wewillbeusingtheconceptofa
patternlanguage.Thishighlystructuredrepresentationofexpertdesignexperiencehasbeenheavily
usedintheobjectorienteddesigncommunity.
Thebookopenswithtwointroductorychapters.Thefirstgivesanoverviewoftheparallelcomputing
landscapeandbackgroundneededtounderstandandusethepatternlanguage.Thisisfollowedbya
moredetailedchapterinwhichwelayoutthebasicconceptsandjargonusedbyparallel
programmers.Thebookthenmovesintothepatternlanguageitself.
1.2. PARALLEL PROGRAMMING

Thekeytoparallelcomputingisexploitableconcurrency.Concurrencyexistsinacomputational
problemwhentheproblemcanbedecomposedintosubproblemsthatcansafelyexecuteatthesame
time.Tobeofanyuse,however,itmustbepossibletostructurethecodetoexposeandlaterexploit
theconcurrencyandpermitthesubproblemstoactuallyrunconcurrently;thatis,theconcurrency
mustbeexploitable.
Mostlargecomputationalproblemscontainexploitableconcurrency.Aprogrammerworkswith
exploitableconcurrencybycreatingaparallelalgorithmandimplementingthealgorithmusinga
parallelprogrammingenvironment.Whentheresultingparallelprogramisrunonasystemwith
multipleprocessors,theamountoftimewehavetowaitfortheresultsofthecomputationisreduced.
Inaddition,multipleprocessorsmayallowlargerproblemstobesolvedthancouldbedoneona
singleprocessorsystem.
Asasimpleexample,supposepartofacomputationinvolvescomputingthesummationofalargeset
ofvalues.Ifmultipleprocessorsareavailable,insteadofaddingthevaluestogethersequentially,the
setcanbepartitionedandthesummationsofthesubsetscomputedsimultaneously,eachonadifferent
processor.Thepartialsumsarethencombinedtogetthefinalanswer.Thus,usingmultipleprocessors
tocomputeinparallelmayallowustoobtainasolutionsooner.Also,ifeachprocessorhasitsown
memory,partitioningthedatabetweentheprocessorsmayallowlargerproblemstobehandledthan
couldbehandledonasingleprocessor.
Thissimpleexampleshowstheessenceofparallelcomputing.Thegoalistousemultipleprocessors
tosolveproblemsinlesstimeand/ortosolvebiggerproblemsthanwouldbepossibleonasingle
processor.Theprogrammer'staskistoidentifytheconcurrencyintheproblem,structurethe
algorithmsothatthisconcurrencycanbeexploited,andthenimplementthesolutionusingasuitable
programmingenvironment.Thefinalstepistosolvetheproblembyexecutingthecodeonaparallel
system.
Parallelprogrammingpresentsuniquechallenges.Often,theconcurrenttasksmakinguptheproblem
includedependenciesthatmustbeidentifiedandcorrectlymanaged.Theorderinwhichthetasks
executemaychangetheanswersofthecomputationsinnondeterministicways.Forexample,inthe
parallelsummationdescribedearlier,apartialsumcannotbecombinedwithothersuntilitsown
computationhascompleted.Thealgorithmimposesapartialorderonthetasks(thatis,theymust
completebeforethesumscanbecombined).Moresubtly,thenumericalvalueofthesummationsmay
changeslightlydependingontheorderoftheoperationswithinthesumsbecausefloatingpoint
arithmeticisnonassociative.Agoodparallelprogrammermusttakecaretoensurethat
nondeterministicissuessuchasthesedonotaffectthequalityofthefinalanswer.Creatingsafe
parallelprogramscantakeconsiderableeffortfromtheprogrammer.
Evenwhenaparallelprogramis"correct",itmayfailtodelivertheanticipatedperformance
improvementfromexploitingconcurrency.Caremustbetakentoensurethattheoverheadincurredby
managingtheconcurrencydoesnotoverwhelmtheprogramruntime.Also,partitioningthework
amongtheprocessorsinabalancedwayisoftennotaseasyasthesummationexamplesuggests.The
effectivenessofaparallelalgorithmdependsonhowwellitmapsontotheunderlyingparallel
computer,soaparallelalgorithmcouldbeveryeffectiveononeparallelarchitectureandadisasteron
another.
Wewillrevisittheseissuesandprovideamorequantitativeviewofparallelcomputationinthenext
chapter.
1.3. DESIGN PATTERNS AND PATTERN LANGUAGES

Adesignpatterndescribesagoodsolutiontoarecurringprobleminaparticularcontext.Thepattern
followsaprescribedformatthatincludesthepatternname,adescriptionofthecontext,theforces
(goalsandconstraints),andthesolution.Theideaistorecordtheexperienceofexpertsinawaythat
canbeusedbyothersfacingasimilarproblem.Inadditiontothesolutionitself,thenameofthe
patternisimportantandcanformthebasisforadomainspecificvocabularythatcansignificantly
enhancecommunicationbetweendesignersinthesamearea.
DesignpatternswerefirstproposedbyChristopherAlexander.Thedomainwascityplanningand
architecture[AIS77].Designpatternswereoriginallyintroducedtothesoftwareengineering
communitybyBeckandCunningham[BC87]andbecameprominentintheareaofobjectoriented
programmingwiththepublicationofthebookbyGamma,Helm,Johnson,andVlissides[GHJV95],
affectionatelyknownastheGoF(GangofFour)book.Thisbookgivesalargecollectionofdesign
patternsforobjectorientedprogramming.Togiveoneexample,theVisitorpatterndescribesawayto
structureclassessothatthecodeimplementingaheterogeneousdatastructurecanbekeptseparate
fromthecodetotraverseit.Thus,whathappensinatraversaldependsonboththetypeofeachnode
andtheclassthatimplementsthetraversal.Thisallowsmultiplefunctionalityfordatastructure
traversals,andsignificantflexibilityasnewfunctionalitycanbeaddedwithouthavingtochangethe
datastructureclass.ThepatternsintheGoFbookhaveenteredthelexiconofobjectoriented
programmingreferencestoitspatternsarefoundintheacademicliterature,tradepublications,and
systemdocumentation.Thesepatternshavebynowbecomepartoftheexpectedknowledgeofany
competentsoftwareengineer.
AneducationalnonprofitorganizationcalledtheHillsideGroup[Hil]wasformedin1993topromote
theuseofpatternsandpatternlanguagesand,moregenerally,toimprovehumancommunication
aboutcomputers"byencouragingpeopletocodifycommonprogramminganddesignpractice".To
developnewpatternsandhelppatternwritershonetheirskills,theHillsideGroupsponsorsanannual
PatternLanguagesofPrograms(PLoP)workshopandseveralspinoffsinotherpartsoftheworld,
suchasChiliPLoP(inthewesternUnitedStates),KoalaPLoP(Australia),EuroPLoP(Europe),and
MensorePLoP(Japan).Theproceedingsoftheseworkshops[Pat]providearichsourceofpatterns
coveringavastrangeofapplicationdomainsinsoftwaredevelopmentandhavebeenusedasabasis
forseveralbooks[CS95,VCK96,MRB97,HFR99].
Inhisoriginalworkonpatterns,Alexanderprovidednotonlyacatalogofpatterns,butalsoapattern
languagethatintroducedanewapproachtodesign.Inapatternlanguage,thepatternsareorganized
intoastructurethatleadstheuserthroughthecollectionofpatternsinsuchawaythatcomplex
systemscanbedesignedusingthepatterns.Ateachdecisionpoint,thedesignerselectsanappropriate
pattern.Eachpatternleadstootherpatterns,resultinginafinaldesignintermsofawebofpatterns.
Thus,apatternlanguageembodiesadesignmethodologyandprovidesdomainspecificadvicetothe
applicationdesigner.(Inspiteoftheoverlappingterminology,apatternlanguageisnota
programminglanguage.)
1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING

Thisbookdescribesapatternlanguageforparallelprogrammingthatprovidesseveralbenefits.The
immediatebenefitsareawaytodisseminatetheexperienceofexpertsbyprovidingacatalogofgood
solutionstoimportantproblems,anexpandedvocabulary,andamethodologyforthedesignof
parallelprograms.Wehopetolowerthebarriertoparallelprogrammingbyprovidingguidance
throughtheentireprocessofdevelopingaparallelprogram.Theprogrammerbringstotheprocessa
goodunderstandingoftheactualproblemtobesolvedandthenworksthroughthepatternlanguage,
eventuallyobtainingadetailedparalleldesignorpossiblyworkingcode.Inthelongerterm,wehope
thatthispatternlanguagecanprovideabasisforbothadisciplinedapproachtothequalitative
evaluationofdifferentprogrammingmodelsandthedevelopmentofparallelprogrammingtools.
ThepatternlanguageisorganizedintofourdesignspacesFindingConcurrency,Algorithm
Structure,SupportingStructures,andImplementationMechanismswhichformalinearhierarchy,
withFindingConcurrencyatthetopandImplementationMechanismsatthebottom,asshowninFig.
1.1.
Figure 1.1. Overview of the pattern language
TheFindingConcurrencydesignspaceisconcernedwithstructuringtheproblemtoexpose
exploitableconcurrency.Thedesignerworkingatthislevelfocusesonhighlevelalgorithmicissues
andreasonsabouttheproblemtoexposepotentialconcurrency.TheAlgorithmStructuredesignspace
isconcernedwithstructuringthealgorithmtotakeadvantageofpotentialconcurrency.Thatis,the
designerworkingatthislevelreasonsabouthowtousetheconcurrencyexposedinworkingwiththe
FindingConcurrencypatterns.TheAlgorithmStructurepatternsdescribeoverallstrategiesfor
exploitingconcurrency.TheSupportingStructuresdesignspacerepresentsanintermediatestage
betweentheAlgorithmStructureandImplementationMechanismsdesignspaces.Twoimportant
groupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesandthosethat
representcommonlyusedshareddatastructures.TheImplementationMechanismsdesignspaceis
concernedwithhowthepatternsofthehigherlevelspacesaremappedintoparticularprogramming
environments.Weuseittoprovidedescriptionsofcommonmechanismsforprocess/thread
management(forexample,creatingordestroyingprocesses/threads)andprocess/threadinteraction
(forexample,semaphores,barriers,ormessagepassing).Theitemsinthisdesignspacearenot
presentedaspatternsbecauseinmanycasestheymapdirectlyontoelementswithinparticularparallel
programmingenvironments.Theyareincludedinthepatternlanguageanyway,however,toprovidea
completepathfromproblemdescriptiontocode.
Chapter 2. Background and Jargon of Parallel

Computing
2.1CONCURRENCYINPARALLELPROGRAMSVERSUSOPERATINGSYSTEMS
2.2PARALLELARCHITECTURES:ABRIEFINTRODUCTION
2.3PARALLELPROGRAMMINGENVIRONMENTS
2.4THEJARGONOFPARALLELCOMPUTING
2.5AQUANTITATIVELOOKATPARALLELCOMPUTATION
2.6COMMUNICATION
2.7SUMMARY
Inthischapter,wegiveanoverviewoftheparallelprogramminglandscape,anddefineany
specializedparallelcomputingterminologythatwewilluseinthepatterns.Becausemanytermsin
computingareoverloaded,takingdifferentmeaningsindifferentcontexts,wesuggestthateven
readersfamiliarwithparallelprogrammingatleastskimthischapter.
2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS

OPERATING SYSTEMS
Concurrencywasfirstexploitedincomputingtobetterutilizeorshareresourceswithinacomputer.
Modernoperatingsystemssupportcontextswitchingtoallowmultipletaskstoappeartoexecute
concurrently,therebyallowingusefulworktooccurwhiletheprocessorisstalledononetask.This
applicationofconcurrency,forexample,allowstheprocessortostaybusybyswappinginanewtask
toexecutewhileanothertaskiswaitingforI/O.Byquicklyswappingtasksinandout,givingeach
taska"slice"oftheprocessortime,theoperatingsystemcanallowmultipleuserstousethesystemas
ifeachwereusingitalone(butwithdegradedperformance).
Mostmodernoperatingsystemscanusemultipleprocessorstoincreasethethroughputofthesystem.
TheUNIXshellusesconcurrencyalongwithacommunicationabstractionknownaspipestoprovide
apowerfulformofmodularity:Commandsarewrittentoacceptastreamofbytesasinput(the
consumer)andproduceastreamofbytesasoutput(theproducer).Multiplecommandscanbechained
togetherwithapipeconnectingtheoutputofonecommandtotheinputofthenext,allowingcomplex
commandstobebuiltfromsimplebuildingblocks.Eachcommandisexecutedinitsownprocess,
withallprocessesexecutingconcurrently.Becausetheproducerblocksifbufferspaceinthepipeis
notavailable,andtheconsumerblocksifdataisnotavailable,thejobofmanagingthestreamof
resultsmovingbetweencommandsisgreatlysimplified.Morerecently,withoperatingsystemswith
windowsthatinviteuserstodomorethanonethingatatime,andtheInternet,whichoftenintroduces
I/Odelaysperceptibletotheuser,almosteveryprogramthatcontainsaGUIincorporates
concurrency.
Althoughthefundamentalconceptsforsafelyhandlingconcurrencyarethesameinparallelprograms
andoperatingsystems,therearesomeimportantdifferences.Foranoperatingsystem,theproblemis
notfindingconcurrencytheconcurrencyisinherentinthewaytheoperatingsystemfunctionsin
managingacollectionofconcurrentlyexecutingprocesses(representingusers,applications,and
backgroundactivitiessuchasprintspooling)andprovidingsynchronizationmechanismssoresources
canbesafelyshared.However,anoperatingsystemmustsupportconcurrencyinarobustandsecure
way:Processesshouldnotbeabletointerferewitheachother(intentionallyornot),andtheentire
systemshouldnotcrashifsomethinggoeswrongwithoneprocess.Inaparallelprogram,findingand
exploitingconcurrencycanbeachallenge,whileisolatingprocessesfromeachotherisnotthecritical
concernitiswithanoperatingsystem.Performancegoalsaredifferentaswell.Inanoperating
system,performancegoalsarenormallyrelatedtothroughputorresponsetime,anditmaybe
acceptabletosacrificesomeefficiencytomaintainrobustnessandfairnessinresourceallocation.Ina
parallelprogram,thegoalistominimizetherunningtimeofasingleprogram.
2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION

Therearedozensofdifferentparallelarchitectures,amongthemnetworksofworkstations,clustersof
offtheshelfPCs,massivelyparallelsupercomputers,tightlycoupledsymmetricmultiprocessors,and
multiprocessorworkstations.Inthissection,wegiveanoverviewofthesesystems,focusingonthe
characteristicsrelevanttotheprogrammer.
2.2.1. Flynn's Taxonomy
ByfarthemostcommonwaytocharacterizethesearchitecturesisFlynn'staxonomy[Fly72].He
categorizesallcomputersaccordingtothenumberofinstructionstreamsanddatastreamstheyhave,
whereastreamisasequenceofinstructionsordataonwhichacomputeroperates.InFlynn's
taxonomy,therearefourpossibilities:SISD,SIMD,MISD,andMIMD.
SingleInstruction,SingleData(SISD).InaSISDsystem,onestreamofinstructionsprocessesa
singlestreamofdata,asshowninFig.2.1.ThisisthecommonvonNeumannmodelusedinvirtually
allsingleprocessorcomputers.
Figure 2.1. The Single Instruction, Single Data (SISD) architecture
SingleInstruction,MultipleData(SIMD).InaSIMDsystem,asingleinstructionstreamis
concurrentlybroadcasttomultipleprocessors,eachwithitsowndatastream(asshowninFig.2.2).
TheoriginalsystemsfromThinkingMachinesandMasParcanbeclassifiedasSIMD.TheCPPDAP
GammaIIandQuadricsApemillearemorerecentexamples;thesearetypicallydeployedin
specializedapplications,suchasdigitalsignalprocessing,thataresuitedtofinegrainedparallelism
andrequirelittleinterprocesscommunication.Vectorprocessors,whichoperateonvectordataina
pipelinedfashion,canalsobecategorizedasSIMD.Exploitingthisparallelismisusuallydonebythe
compiler.
Figure 2.2. The Single Instruction, Multiple Data (SIMD) architecture
MultipleInstruction,SingleData(MISD).Nowellknownsystemsfitthisdesignation.Itismentioned
forthesakeofcompleteness.
MultipleInstruction,MultipleData(MIMD).InaMIMDsystem,eachprocessingelementhasitsown
streamofinstructionsoperatingonitsowndata.Thisarchitecture,showninFig.2.3,isthemost
generalofthearchitecturesinthateachoftheothercasescanbemappedontotheMIMDarchitecture.
Thevastmajorityofmodernparallelsystemsfitintothiscategory.
Figure 2.3. The Multiple Instruction, Multiple Data (MIMD) architecture
2.2.2. A Further Breakdown of MIMD

TheMIMDcategoryofFlynn'staxonomyistoobroadtobeusefulonitsown;thiscategoryis
typicallydecomposedaccordingtomemoryorganization.
Sharedmemory.Inasharedmemorysystem,allprocessesshareasingleaddressspaceand
communicatewitheachotherbywritingandreadingsharedvariables.
OneclassofsharedmemorysystemsiscalledSMPs(symmetricmultiprocessors).AsshowninFig.
2.4,allprocessorsshareaconnectiontoacommonmemoryandaccessallmemorylocationsatequal
speeds.SMPsystemsarearguablytheeasiestparallelsystemstoprogrambecauseprogrammersdo
notneedtodistributedatastructuresamongprocessors.Becauseincreasingthenumberofprocessors
increasescontentionforthememory,theprocessor/memorybandwidthistypicallyalimitingfactor.
Thus,SMPsystemsdonotscalewellandarelimitedtosmallnumbersofprocessors.
Figure 2.4. The Symmetric Multiprocessor (SMP) architecture
TheothermainclassofsharedmemorysystemsiscalledNUMA(nonuniformmemoryaccess).As
showninFig.2.5,thememoryisshared;thatis,itisuniformlyaddressablefromallprocessors,but
someblocksofmemorymaybephysicallymorecloselyassociatedwithsomeprocessorsthanothers.
Thisreducesthememorybandwidthbottleneckandallowssystemswithmoreprocessors;however,as
aresult,theaccesstimefromaprocessortoamemorylocationcanbesignificantlydifferent
dependingonhow"close"thememorylocationistotheprocessor.Tomitigatetheeffectsof
nonuniformaccess,eachprocessorhasacache,alongwithaprotocoltokeepcacheentriescoherent.
Hence,anothernameforthesearchitecturesiscachecoherentnonuniformmemoryaccesssystems
(ccNUMA).Logically,programmingaccNUMAsystemisthesameasprogramminganSMP,butto
obtainthebestperformance,theprogrammerwillneedtobemorecarefulaboutlocalityissuesand
cacheeffects.
Figure 2.5. An example of the nonuniform memory access (NUMA) architecture
Distributedmemory.Inadistributedmemorysystem,eachprocesshasitsownaddressspaceand
communicateswithotherprocessesbymessagepassing(sendingandreceivingmessages).A
schematicrepresentationofadistributedmemorycomputerisshowninFig.2.6.
Figure 2.6. The distributed-memory architecture
Dependingonthetopologyandtechnologyusedfortheprocessorinterconnection,communication
speedcanrangefromalmostasfastassharedmemory(intightlyintegratedsupercomputers)toorders
ofmagnitudeslower(forexample,inaclusterofPCsinterconnectedwithanEthernetnetwork).The
programmermustexplicitlyprogramallthecommunicationbetweenprocessorsandbeconcerned
withthedistributionofdata.
Distributedmemorycomputersaretraditionallydividedintotwoclasses:MPP(massivelyparallel
processors)andclusters.InanMPP,theprocessorsandthenetworkinfrastructurearetightlycoupled
andspecializedforuseinaparallelcomputer.Thesesystemsareextremelyscalable,insomecases
supportingtheuseofmanythousandsofprocessorsinasinglesystem[MSW96,IBM02].
Clustersaredistributedmemorysystemscomposedofofftheshelfcomputersconnectedbyanoff
theshelfnetwork.WhenthecomputersarePCsrunningtheLinuxoperatingsystem,theseclustersare
calledBeowulfclusters.Asofftheshelfnetworkingtechnologyimproves,systemsofthistypeare
becomingmorecommonandmuchmorepowerful.Clustersprovideaninexpensivewayforan
organizationtoobtainparallelcomputingcapabilities[Beo].Preconfiguredclustersarenowavailable
frommanyvendors.Onefrugalgroupevenreportedconstructingausefulparallelsystembyusinga
clustertoharnessthecombinedpowerofobsoletePCsthatotherwisewouldhavebeendiscarded
[HHS01].
Hybridsystems.Thesesystemsareclustersofnodeswithseparateaddressspacesinwhicheachnode
containsseveralprocessorsthatsharememory.
AccordingtovanderSteenandDongarra's"OverviewofRecentSupercomputers"[vdSD03],which
containsabriefdescriptionofthesupercomputerscurrentlyorsoontobecommerciallyavailable,
hybridsystemsformedfromclustersofSMPsconnectedbyafastnetworkarecurrentlythedominant
trendinhighperformancecomputing.Forexample,inlate2003,fourofthefivefastestcomputersin
theworldwerehybridsystems[Top].
Grids.Gridsaresystemsthatusedistributed,heterogeneousresourcesconnectedbyLANsand/or
WANs[FK03].OftentheinterconnectionnetworkistheInternet.Gridswereoriginallyenvisionedas
awaytolinkmultiplesupercomputerstoenablelargerproblemstobesolved,andthuscouldbe
viewedasaspecialtypeofdistributedmemoryorhybridMIMDmachine.Morerecently,theideaof
gridcomputinghasevolvedintoageneralwaytoshareheterogeneousresources,suchascomputation
servers,storage,applicationservers,informationservices,orevenscientificinstruments.Gridsdiffer
fromclustersinthatthevariousresourcesinthegridneednothaveacommonpointofadministration.
Inmostcases,theresourcesonagridareownedbydifferentorganizationsthatmaintaincontrolover
thepoliciesgoverninguseoftheresources.Thisaffectsthewaythesesystemsareused,the
middlewarecreatedtomanagethem,andmostimportantlyforthisdiscussion,theoverheadincurred
whencommunicatingbetweenresourceswithinthegrid.
2.2.3. Summary
Wehaveclassifiedthesesystemsaccordingtothecharacteristicsofthehardware.These
characteristicstypicallyinfluencethenativeprogrammingmodelusedtoexpressconcurrencyona
system;however,thisisnotalwaysthecase.Itispossibleforaprogrammingenvironmentfora
sharedmemorymachinetoprovidetheprogrammerwiththeabstractionofdistributedmemoryand
messagepassing.Virtualdistributedsharedmemorysystemscontainmiddlewaretoprovidethe
opposite:theabstractionofsharedmemoryonadistributedmemorymachine.
2.3. PARALLEL PROGRAMMING ENVIRONMENTS

Parallelprogrammingenvironmentsprovidethebasictools,languagefeatures,andapplication
programminginterfaces(APIs)neededtoconstructaparallelprogram.Aprogrammingenvironment
impliesaparticularabstractionofthecomputersystemcalledaprogrammingmodel.Traditional
sequentialcomputersusethewellknownvonNeumannmodel.Becauseallsequentialcomputersuse
thismodel,softwaredesignerscandesignsoftwaretoasingleabstractionandreasonablyexpectitto
mapontomost,ifnotall,sequentialcomputers.
Unfortunately,therearemanypossiblemodelsforparallelcomputing,reflectingthedifferentways
processorscanbeinterconnectedtoconstructaparallelsystem.Themostcommonmodelsarebased
ononeofthewidelydeployedparallelarchitectures:sharedmemory,distributedmemorywith
messagepassing,orahybridcombinationofthetwo.
Programmingmodelstoocloselyalignedtoaparticularparallelsystemleadtoprogramsthatarenot
portablebetweenparallelcomputers.Becausetheeffectivelifespanofsoftwareislongerthanthatof
hardware,manyorganizationshavemorethanonetypeofparallelcomputer,andmostprogrammers
insistonprogrammingenvironmentsthatallowthemtowriteportableparallelprograms.Also,
explicitlymanaginglargenumbersofresourcesinaparallelcomputerisdifficult,suggestingthat
higherlevelabstractionsoftheparallelcomputermightbeuseful.Theresultisthatasofthemid
1990s,therewasaveritableglutofparallelprogrammingenvironments.Apartiallistoftheseis
showninTable2.1.Thiscreatedagreatdealofconfusionforapplicationdevelopersandhinderedthe
adoptionofparallelcomputingformainstreamapplications.
Table 2.1. Some Parallel Programming Environments from the Mid-1990s
"C*inC
CUMULVS
JavaRMI
PRIO
Quake
ABCPL
DAGGER
javaPG
P3L
Quark
ACE
DAPPLE
JAVAR
P4Linda
QuickThreads
ACT++
DataParallelC
JavaSpaces
Pablo
Sage++
ADDAP
DC++
JIDL
PADE
SAM
Adl
DCE++
Joyce
PADRE
SCANDAL
Adsmith
DDD
Karma
Panda
SCHEDULE
AFAPI
DICE
Khoros
Papers
SciTL
ALWAN
DIPC
KOAN/FortranS
Para++
SDDA
AM
Distributed
Smalltalk
LAM
Paradigm
SHMEM
AMDC
DOLIB
Legion
Parafrase2
SIMPLE
Amoeba
DOME
Lilac
Paralation
Sina
AppLeS
DOSMOS
Linda
Parallaxis
SISAL
ARTS
DRL
LiPS
Parallel
Haskell
SMI
AthapascanOb DSMThreads
Locust
ParallelC++
SONiC
Aurora
Ease
Lparx
ParC
SplitC
Automap
ECO
Lucid
ParLib++
SR
bb_threads
Eilean
Maisie
ParLin
Sthreads
Blaze
Emerald
Manifold
Parlog
Strand
BlockComm
EPL
Mentat
Parmacs
SUIF
BSP
Excalibur
MetaChaos
Parti
SuperPascal
C*
Express
Midway
pC
Synergy
C**
Falcon
Millipede
pC++
TCGMSG
C4
Filaments
Mirage
PCN
Telegraphos
CarlOS
FLASH
Modula2*
PCP:
TheFORCE
Cashmere
FM
ModulaP
PCU
Threads.h++
CC++
Fork
MOSIX
PEACE
TRAPPER
Charlotte
FortranM
MpC
PENNY
TreadMarks
Charm
FX
MPC++
PET
UC
Charm++
GA
MPI
PETSc
uC++
Chu
GAMMA
Multipol
PH
UNITY
Cid
Glenda
Munin
Phosphorus
Cilk
GLU
NanoThreads
POET
Vic*
CMFortran
GUARD
NESL
Polaris
VisifoldVNUS
Code
HAsL
NetClasses++
POOLT
VPE
ConcurrentML HORUS
Nexus
POOMA
Win32threads
Converse
HPC
Nimrod
POSYBL
WinPar
COOL
HPC++
NOW
PRESTO
WWWinda
CORRELATE
HPF
ObjectiveLinda
Prospero
XENOOPS
CparPar
IMPACT
Occam
Proteus
XPC
CPS
ISETLLinda
Omega
PSDM
Zounds
CRL
ISIS
OOF90
PSI
ZPL
CSP
JADA
Orca
PVM
Cthreads
JADE
P++
QPC++
Fortunately,bythelate1990s,theparallelprogrammingcommunityconvergedpredominantlyontwo
environmentsforparallelprogramming:OpenMP[OMP]forsharedmemoryandMPI[Mesb]for
messagepassing.
OpenMPisasetoflanguageextensionsimplementedascompilerdirectives.Implementationsare
currentlyavailableforFortran,C,andC++.OpenMPisfrequentlyusedtoincrementallyadd
parallelismtosequentialcode.Byaddingacompilerdirectivearoundaloop,forexample,the
compilercanbeinstructedtogeneratecodetoexecutetheiterationsoftheloopinparallel.The
compilertakescareofmostofthedetailsofthreadcreationandmanagement.OpenMPprogramstend
toworkverywellonSMPs,butbecauseitsunderlyingprogrammingmodeldoesnotincludeanotion
ofnonuniformmemoryaccesstimes,itislessidealforccNUMAanddistributedmemorymachines.
MPIisasetoflibraryroutinesthatprovideforprocessmanagement,messagepassing,andsome
collectivecommunicationoperations(theseareoperationsthatinvolvealltheprocessesinvolvedina
program,suchasbarrier,broadcast,andreduction).MPIprogramscanbedifficulttowritebecause
theprogrammerisresponsiblefordatadistributionandexplicitinterprocesscommunicationusing
messages.Becausetheprogrammingmodelassumesdistributedmemory,MPIisagoodchoicefor
MPPsandotherdistributedmemorymachines.
NeitherOpenMPnorMPIisanidealfitforhybridarchitecturesthatcombinemultiprocessornodes,
eachwithmultipleprocessesandasharedmemory,intoalargersystemwithseparateaddressspaces
foreachnode:TheOpenMPmodeldoesnotrecognizenonuniformmemoryaccesstimes,soitsdata
allocationcanleadtopoorperformanceonmachinesthatarenotSMPs,whileMPIdoesnotinclude
constructstomanagedatastructuresresidinginasharedmemory.Onesolutionisahybridmodelin
whichOpenMPisusedoneachsharedmemorynodeandMPIisusedbetweenthenodes.Thisworks
well,butitrequirestheprogrammertoworkwithtwodifferentprogrammingmodelswithinasingle
program.AnotheroptionistouseMPIonboththesharedmemoryanddistributedmemoryportions
ofthealgorithmandgiveuptheadvantagesofasharedmemoryprogrammingmodel,evenwhenthe
hardwaredirectlysupportsit.
Newhighlevelprogrammingenvironmentsthatsimplifyportableparallelprogrammingandmore
accuratelyreflecttheunderlyingparallelarchitecturesaretopicsofcurrentresearch[Cen].Another
approachmorepopularinthecommercialsectoristoextendMPIandOpenMP.Inthemid1990s,the
MPIForumdefinedanextendedMPIcalledMPI2.0,althoughimplementationsarenotwidely
availableatthetimethiswaswritten.ItisalargecomplexextensiontoMPIthatincludesdynamic
processcreation,parallelI/O,andmanyotherfeatures.Ofparticularinteresttoprogrammersof
modernhybridarchitecturesistheinclusionofonesidedcommunication.Onesidedcommunication
mimicssomeofthefeaturesofasharedmemorysystembylettingoneprocesswriteintoorreadfrom
thememoryregionsofotherprocesses.Theterm"onesided"referstothefactthatthereadorwriteis
launchedbytheinitiatingprocesswithouttheexplicitinvolvementoftheotherparticipatingprocess.
AmoresophisticatedabstractionofonesidedcommunicationisavailableaspartoftheGlobalArrays
+
[NHL96,NHK
02
,Gloa]package.GlobalArraysworkstogetherwithMPItohelpaprogrammer
managedistributedarraydata.Aftertheprogrammerdefinesthearrayandhowitislaidoutin
memory,theprogramexecutes"puts"or"gets"intothearraywithoutneedingtoexplicitlymanage
whichMPIprocess"owns"theparticularsectionofthearray.Inessence,theglobalarrayprovidesan
abstractionofagloballysharedarray.Thisonlyworksforarrays,butthesearesuchcommondata
structuresinparallelcomputingthatthispackage,althoughlimited,canbeveryuseful.
JustasMPIhasbeenextendedtomimicsomeofthebenefitsofasharedmemoryenvironment,
OpenMPhasbeenextendedtorunindistributedmemoryenvironments.TheannualWOMPAT
(WorkshoponOpenMPApplicationsandTools)workshopscontainmanypapersdiscussingvarious
approachesandexperienceswithOpenMPinclustersandccNUMAenvironments.
MPIisimplementedasalibraryofroutinestobecalledfromprogramswritteninasequential
programminglanguage,whereasOpenMPisasetofextensionstosequentialprogramminglanguages.
Theyrepresenttwoofthepossiblecategoriesofparallelprogrammingenvironments(librariesand
languageextensions),andthesetwoparticularenvironmentsaccountfortheoverwhelmingmajority
ofparallelcomputingbeingdonetoday.Thereis,however,onemorecategoryofparallel
programmingenvironments,namelylanguageswithbuiltinfeaturestosupportparallelprogramming.
Javaissuchalanguage.Ratherthanbeingdesignedtosupporthighperformancecomputing,Javais
anobjectoriented,generalpurposeprogrammingenvironmentwithfeaturesforexplicitlyspecifying
concurrentprocessingwithsharedmemory.Inaddition,thestandardI/Oandnetworkpackages
provideclassesthatmakeiteasyforJavatoperforminterprocesscommunicationbetweenmachines,
thusmakingitpossibletowriteprogramsbasedonboththesharedmemoryandthedistributed
memorymodels.Thenewerjava.niopackagessupportI/Oinawaythatislessconvenientforthe
programmer,butgivessignificantlybetterperformance,andJava21.5includesnewsupportfor
concurrentprogramming,mostsignificantlyinthejava.util.concurrent.*packages.Additional
packagesthatsupportdifferentapproachestoparallelcomputingarewidelyavailable.
Althoughtherehavebeenothergeneralpurposelanguages,bothpriortoJavaandmorerecent(for
example,C#),thatcontainedconstructsforspecifyingconcurrency,Javaisthefirsttobecomewidely
used.Asaresult,itmaybethefirstexposureformanyprogrammerstoconcurrentandparallel
programming.AlthoughJavaprovidessoftwareengineeringbenefits,currentlytheperformanceof
parallelJavaprogramscannotcompetewithOpenMPorMPIprogramsfortypicalscientific
computingapplications.TheJavadesignhasalsobeencriticizedforseveraldeficienciesthatmatterin
thisdomain(forexample,afloatingpointmodelthatemphasizesportabilityandmorereproducible
resultsoverexploitingtheavailablefloatingpointhardwaretothefullest,inefficienthandlingof
arrays,andlackofalightweightmechanismtohandlecomplexnumbers).Theperformancedifference
betweenJavaandotheralternativescanbeexpectedtodecrease,especiallyforsymbolicorother
nonnumericproblems,ascompilertechnologyforJavaimprovesandasnewpackagesandlanguage
extensionsbecomeavailable.TheTitaniumproject[Tita]isanexampleofaJavadialectdesignedfor
highperformancecomputinginaccNUMAenvironment.
Forthepurposesofthisbook,wehavechosenOpenMP,MPI,andJavaasthethreeenvironmentswe
willuseinourexamplesOpenMPandMPIfortheirpopularityandJavabecauseitislikelytobe
manyprogrammers'firstexposuretoconcurrentprogramming.Abriefoverviewofeachcanbefound
intheappendixes.
2.4. THE JARGON OF PARALLEL COMPUTING

Inthissection,wedefinesometermsthatarefrequentlyusedthroughoutthepatternlanguage.
Additionaldefinitionscanbefoundintheglossary.
Task.Thefirststepindesigningaparallelprogramistobreaktheproblemupintotasks.Ataskisa
sequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondstosomelogicalpart
ofanalgorithmorprogram.Forexample,considerthemultiplicationoftwoorderNmatrices.
Dependingonhowweconstructthealgorithm,thetaskscouldbe(1)themultiplicationofsubblocks
ofthematrices,(2)innerproductsbetweenrowsandcolumnsofthematrices,or(3)individual
iterationsoftheloopsinvolvedinthematrixmultiplication.Thesearealllegitimatewaystodefine
tasksformatrixmultiplication;thatis,thetaskdefinitionfollowsfromthewaythealgorithmdesigner
thinksabouttheproblem.
Unitofexecution(UE).Tobeexecuted,ataskneedstobemappedtoaUEsuchasaprocessor
thread.Aprocessisacollectionofresourcesthatenablestheexecutionofprograminstructions.
Theseresourcescanincludevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,user
andgroupIDs,andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"
unitofexecutionwithitsownaddressspace.AthreadisthefundamentalUEinmodernoperating
systems.Athreadisassociatedwithaprocessandsharestheprocess'senvironment.Thismakes
threadslightweight(thatis,acontextswitchbetweenthreadstakesonlyasmallamountoftime).A
morehighlevelviewisthatathreadisa"lightweight"UEthatsharesanaddressspacewithother
threads.
WewilluseunitofexecutionorUEasagenerictermforoneofacollectionofpossiblyconcurrently
executingentities,usuallyeitherprocessesorthreads.Thisisconvenientintheearlystagesof
programdesignwhenthedistinctionsbetweenprocessesandthreadsarelessimportant.
Processingelement(PE).Weusethetermprocessingelement(PE)asagenerictermforahardware
elementthatexecutesastreamofinstructions.TheunitofhardwareconsideredtobeaPEdependson
thecontext.Forexample,someprogrammingenvironmentsvieweachworkstationinaclusterofSMP
workstationsasexecutingasingleinstructionstream;inthissituation,thePEwouldbethe
workstation.Adifferentprogrammingenvironmentrunningonthesamehardware,however,might
vieweachprocessorofeachworkstationasexecutinganindividualinstructionstream;inthiscase,the
PEistheindividualprocessor,andeachworkstationcontainsseveralPEs.
Loadbalanceandloadbalancing.Toexecuteaparallelprogram,thetasksmustbemappedtoUEs,
andtheUEstoPEs.Howthemappingsaredonecanhaveasignificantimpactontheoverall
performanceofaparallelalgorithm.ItiscrucialtoavoidthesituationinwhichasubsetofthePEsis
doingmostoftheworkwhileothersareidle.Loadbalancereferstohowwelltheworkisdistributed
amongPEs.LoadbalancingistheprocessofallocatingworktoPEs,eitherstaticallyordynamically,
sothattheworkisdistributedasevenlyaspossible.
Synchronization.Inaparallelprogram,duetothenondeterminismoftaskschedulingandother
factors,eventsinthecomputationmightnotalwaysoccurinthesameorder.Forexample,inonerun,
ataskmightreadvariablexbeforeanothertaskreadsvariabley;inthenextrunwiththesameinput,
theeventsmightoccurintheoppositeorder.Inmanycases,theorderinwhichtwoeventsoccurdoes
notmatter.Inothersituations,theorderdoesmatter,andtoensurethattheprogramiscorrect,the
programmermustintroducesynchronizationtoenforcethenecessaryorderingconstraints.The
primitivesprovidedforthispurposeinourselectedenvironmentsarediscussedintheImplementation
Mechanismsdesignspace(Section6.3).
Synchronousversusasynchronous.Weusethesetwotermstoqualitativelyrefertohowtightly
coupledintimetwoeventsare.Iftwoeventsmusthappenatthesametime,theyaresynchronous;
otherwisetheyareasynchronous.Forexample,messagepassing(thatis,communicationbetweenUEs
bysendingandreceivingmessages)issynchronousifamessagesentmustbereceivedbeforethe
sendercancontinue.Messagepassingisasynchronousifthesendercancontinueitscomputation
regardlessofwhathappensatthereceiver,orifthereceivercancontinuecomputationswhilewaiting
forareceivetocomplete.
Raceconditions.Araceconditionisakindoferrorpeculiartoparallelprograms.Itoccurswhenthe
outcomeofaprogramchangesastherelativeschedulingofUEsvaries.Becausetheoperatingsystem
andnottheprogrammercontrolstheschedulingoftheUEs,raceconditionsresultinprogramsthat
potentiallygivedifferentanswersevenwhenrunonthesamesystemwiththesamedata.Race
conditionsareparticularlydifficulterrorstodebugbecausebytheirnaturetheycannotbereliably
reproduced.Testinghelps,butisnotaseffectiveaswithsequentialprograms:Aprogrammayrun
correctlythefirstthousandtimesandthenfailcatastrophicallyonthethousandandfirstexecution
andthenrunagaincorrectlywhentheprogrammerattemptstoreproducetheerrorasthefirststepin
debugging.
Raceconditionsresultfromerrorsinsynchronization.IfmultipleUEsreadandwriteshared
variables,theprogrammermustprotectaccesstothesesharedvariablessothereadsandwritesoccur
inavalidorderregardlessofhowthetasksareinterleaved.Whenmanyvariablesaresharedorwhen
theyareaccessedthroughmultiplelevelsofindirection,verifyingbyinspectionthatnorace
conditionsexistcanbeverydifficult.Toolsareavailablethathelpdetectandfixraceconditions,such
asThreadCheckerfromIntelCorporation,andtheproblemremainsanareaofactiveandimportant
research[NM92].
Deadlocks.Deadlocksareanothertypeoferrorpeculiartoparallelprograms.Adeadlockoccurs
whenthereisacycleoftasksinwhicheachtaskisblockedwaitingforanothertoproceed.Because
allarewaitingforanothertasktodosomething,theywillallbeblockedforever.Asasimpleexample,
considertwotasksinamessagepassingenvironment.TaskAattemptstoreceiveamessagefromtask
B,afterwhichAwillreplybysendingamessageofitsowntotaskB.Meanwhile,taskBattemptsto
receiveamessagefromtaskA,afterwhichBwillsendamessagetoA.Becauseeachtaskiswaiting
fortheothertosenditamessagefirst,bothtaskswillbeblockedforever.Fortunately,deadlocksare
notdifficulttodiscover,asthetaskswillstopatthepointofthedeadlock.
2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION

Thetwomainreasonsforimplementingaparallelprogramaretoobtainbetterperformanceandto
solvelargerproblems.Performancecanbebothmodeledandmeasured,sointhissectionwewilltake
aanotherlookatparallelcomputationsbygivingsomesimpleanalyticalmodelsthatillustratesome
ofthefactorsthatinfluencetheperformanceofaparallelprogram.
Consideracomputationconsistingofthreeparts:asetupsection,acomputationsection,anda
finalizationsection.ThetotalrunningtimeofthisprogramononePEisthengivenasthesumofthe
timesforthethreeparts.
Equation2.1
WhathappenswhenwerunthiscomputationonaparallelcomputerwithmultiplePEs?Supposethat
thesetupandfinalizationsectionscannotbecarriedoutconcurrentlywithanyotheractivities,but
thatthecomputationsectioncouldbedividedintotasksthatwouldrunindependentlyonasmanyPEs
asareavailable,withthesametotalnumberofcomputationstepsasintheoriginalcomputation.The
timeforthefullcomputationonPPEscanthereforebegivenbyOfcourse,Eq.2.2describesavery
idealizedsituation.However,theideathatcomputationshaveaserialpart(forwhichadditionalPEs
areuseless)andaparallelizablepart(forwhichmorePEsdecreasetherunningtime)isrealistic.Thus,
thissimplemodelcapturesanimportantrelationship.
Equation2.2
AnimportantmeasureofhowmuchadditionalPEshelpistherelativespeedupS,whichdescribes
howmuchfasteraproblemrunsinawaythatnormalizesawaytheactualrunningtime.
Equation2.3
ArelatedmeasureistheefficiencyE,whichisthespeedupnormalizedbythenumberofPEs.
Equation2.4
Equation2.5
Ideally,wewouldwantthespeeduptobeequaltoP,thenumberofPEs.Thisissometimescalled
perfectlinearspeedup.Unfortunately,thisisanidealthatcanrarelybeachievedbecausetimesfor
setupandfinalizationarenotimprovedbyaddingmorePEs,limitingthespeedup.Thetermsthat
cannotberunconcurrentlyarecalledtheserialterms.Theirrunningtimesrepresentsomefractionof
thetotal,calledtheserialfraction,denoted.
Equation2.6
Thefractionoftimespentintheparallelizablepartoftheprogramisthen(1).Wecanthus
rewritetheexpressionfortotalcomputationtimewithPPEsas
Equation2.7
Now,rewritingSintermsofthenewexpressionforTtotal(P),weobtainthefamousAmdahl'slaw:
Equation2.8
Equation2.9
Thus,inanidealparallelalgorithmwithnooverheadintheparallelpart,thespeedupshouldfollow
Eq.2.9.Whathappenstothespeedupifwetakeouridealparallelalgorithmanduseaverylarge
numberofprocessors?TakingthelimitasPgoestoinfinityinourexpressionforSyields
Equation2.10
Eq.2.10thusgivesanupperboundonthespeedupobtainableinanalgorithmwhoseserialpart
representsofthetotalcomputation.
Theseconceptsarevitaltotheparallelalgorithmdesigner.Indesigningaparallelalgorithm,itis
importanttounderstandthevalueoftheserialfractionsothatrealisticexpectationscanbesetfor
performance.Itmaynotmakesensetoimplementacomplex,arbitrarilyscalableparallelalgorithmif
10%ormoreofthealgorithmisserialand10%isfairlycommon.
Ofcourse,Amdahl'slawisbasedonassumptionsthatmayormaynotbetrueinpractice.Inreallife,
anumberoffactorsmaymaketheactualrunningtimelongerthanthisformulaimplies.Forexample,
creatingadditionalparalleltasksmayincreaseoverheadandthechancesofcontentionforshared
resources.Ontheotherhand,iftheoriginalserialcomputationislimitedbyresourcesotherthanthe
availabilityofCPUcycles,theactualperformancecouldbemuchbetterthanAmdahl'slawwould
predict.Forexample,alargeparallelmachinemayallowbiggerproblemstobeheldinmemory,thus
reducingvirtualmemorypaging,ormultipleprocessorseachwithitsowncachemayallowmuch
moreoftheproblemtoremaininthecache.Amdahl'slawalsorestsontheassumptionthatforany
giveninput,theparallelandserialimplementationsperformexactlythesamenumberof
computationalsteps.Iftheserialalgorithmbeingusedintheformulaisnotthebestpossible
algorithmfortheproblem,thenacleverparallelalgorithmthatstructuresthecomputationdifferently
canreducethetotalnumberofcomputationalsteps.
Ithasalsobeenobserved[Gus88]thattheexerciseunderlyingAmdahl'slaw,namelyrunningexactly
thesameproblemwithvaryingnumbersofprocessors,isartificialinsomecircumstances.If,say,the
parallelapplicationwereaweathersimulation,thenwhennewprocessorswereadded,onewould
mostlikelyincreasetheproblemsizebyaddingmoredetailstothemodelwhilekeepingthetotal
executiontimeconstant.Ifthisisthecase,thenAmdahl'slaw,orfixedsizespeedup,givesa
pessimisticviewofthebenefitsofadditionalprocessors.
Toseethis,wecanreformulatetheequationtogivethespeedupintermsofperformanceonaP
processorsystem.EarlierinEq.2.2,weobtainedtheexecutiontimeforTprocessors,Ttotal(P),from
theexecutiontimeoftheserialtermsandtheexecutiontimeoftheparallelizablepartwhenexecuted
ononeprocessor.Here,wedotheoppositeandobtainTtotal(1)fromtheserialandparallelterms
whenexecutedonPprocessors.
Equation2.11
Now,wedefinethesocalledscaledserialfraction,denotedscaled,as
Equation2.12
andthen
Equation2.13
Rewritingtheequationforspeedup(Eq.2.3)andsimplifying,weobtainthescaled(orfixedtime)
speedup.[1]
[1]
Thisequation,sometimesknownasGustafson'slaw,wasattributedin[Gus88]toE.
Barsis.
Equation2.14
ThisgivesexactlythesamespeedupasAmdahl'slaw,butallowsadifferentquestiontobeaskedwhen
thenumberofprocessorsisincreased.SincescaleddependsonP,theresultoftakingthelimitisn't
immediatelyobvious,butwouldgivethesameresultasthelimitinAmdahl'slaw.However,suppose
wetakethelimitinPwhileholdingTcomputeandthusscaledconstant.Theinterpretationisthatwe
areincreasingthesizeoftheproblemsothatthetotalrunningtimeremainsconstantwhenmore
processorsareadded.(Thiscontainstheimplicitassumptionthattheexecutiontimeoftheserial
termsdoesnotchangeastheproblemsizegrows.)Inthiscase,thespeedupislinearinP.Thus,while
addingmoreprocessorstosolveafixedproblemmayhitthespeeduplimitsofAmdahl'slawwitha
relativelysmallnumberofprocessors,iftheproblemgrowsasmoreprocessorsareadded,Amdahl's
lawwillbepessimistic.Thesetwomodelsofspeedup,alongwithafixedmemoryversionofspeedup,
arediscussedin[SN90].
2.6. COMMUNICATION
2.6.1. Latency and Bandwidth
Asimplebutusefulmodelcharacterizesthetotaltimeformessagetransferasthesumofafixedcost
plusavariablecostthatdependsonthelengthofthemessage.
Equation2.15
Thefixedcostiscalledlatencyandisessentiallythetimeittakestosendanemptymessageover
thecommunicationmedium,fromthetimethesendroutineiscalledtothetimethedataisreceived
bytherecipient.Latency(giveninsomeappropriatetimeunit)includesoverheadduetosoftwareand
networkhardwareplusthetimeittakesforthemessagetotraversethecommunicationmedium.The
bandwidth(giveninsomemeasureofbytespertimeunit)isameasureofthecapacityofthe
communicationmedium.Nisthelengthofthemessage.
Thelatencyandbandwidthcanvarysignificantlybetweensystemsdependingonboththehardware
usedandthequalityofthesoftwareimplementingthecommunicationprotocols.Becausethesevalues
canbemeasuredwithfairlysimplebenchmarks[DD97],itissometimesworthwhiletomeasure
valuesforand,asthesecanhelpguideoptimizationstoimprovecommunicationperformance.
Forexample,inasysteminwhichisrelativelylarge,itmightbeworthwhiletotrytorestructurea
programthatsendsmanysmallmessagestoaggregatethecommunicationintoafewlargemessages
+
instead.Dataforseveralrecentsystemshasbeenpresentedin[BBC
03
].
2.6.2. Overlapping Communication and Computation and Latency Hiding

Ifwelookmorecloselyatthecomputationtimewithinasingletaskonasingleprocessor,itcan
roughlybedecomposedintocomputationtime,communicationtime,andidletime.The
communicationtimeisthetimespentsendingandreceivingmessages(andthusonlyappliesto
distributedmemorymachines),whereastheidletimeistimethatnoworkisbeingdonebecausethe
taskiswaitingforanevent,suchasthereleaseofaresourceheldbyanothertask.
Acommonsituationinwhichataskmaybeidleiswhenitiswaitingforamessagetobetransmitted
throughthesystem.Thiscanoccurwhensendingamessage(astheUEwaitsforareplybefore
proceeding)orwhenreceivingamessage.Sometimesitispossibletoeliminatethiswaitby
restructuringthetasktosendthemessageand/orpostthereceive(thatis,indicatethatitwantsto
receiveamessage)andthencontinuethecomputation.Thisallowstheprogrammertooverlap
communicationandcomputation.WeshowanexampleofthistechniqueinFig.2.7.Thisstyleof
messagepassingismorecomplicatedfortheprogrammer,becausetheprogrammermusttakecareto
waitforthereceivetocompleteafteranyworkthatcanbeoverlappedwithcommunicationis
completed.
Figure 2.7. Communication without (left) and with (right) support for overlapping
communication and computation. Although UE 0 in the computation on the right still
has some idle time waiting for the reply from UE 1, the idle time is reduced and the
computation requires less total time because of UE 1 's earlier start.
AnothertechniqueusedonmanyparallelcomputersistoassignmultipleUEstoeachPE,sothat
whenoneUEiswaitingforcommunication,itwillbepossibletocontextswitchtoanotherUEand
keeptheprocessorbusy.Thisisanexampleoflatencyhiding.Itisincreasinglybeingusedonmodern
highperformancecomputingsystems,themostfamousexamplebeingtheMTAsystemfromCray
+
Research[ACC
90
].
2.7. SUMMARY
Thischapterhasgivenabriefoverviewofsomeoftheconceptsandvocabularyusedinparallel
computing.Additionaltermsaredefinedintheglossary.Wealsodiscussedthemajorprogramming
environmentsinuseforparallelcomputing:OpenMP,MPI,andJava.Throughoutthebook,wewill
usethesethreeprogrammingenvironmentsforourexamples.MoredetailsaboutOpenMP,MPI,and
Javaandhowtousethemtowriteparallelprogramsareprovidedintheappendixes.
Chapter 3. The Finding Concurrency Design

Space
3.1ABOUTTHEDESIGNSPACE
3.2THETASKDECOMPOSITIONPATTERN
3.3THEDATADECOMPOSITIONPATTERN
3.4THEGROUPTASKSPATTERN
3.5THEORDERTASKSPATTERN
3.6THEDATASHARINGPATTERN
3.7THEDESIGNEVALUATIONPATTERN
3.8SUMMARY
3.1. ABOUT THE DESIGN SPACE

Thesoftwaredesignerworksinanumberofdomains.Thedesignprocessstartsintheproblem
domainwithdesignelementsdirectlyrelevanttotheproblembeingsolved(forexample,fluidflows,
decisiontrees,atoms,etc.).Theultimateaimofthedesignissoftware,soatsomepoint,thedesign
elementschangeintoonesrelevanttoaprogram(forexample,datastructuresandsoftwaremodules).
Wecallthistheprogramdomain.Althoughitisoftentemptingtomoveintotheprogramdomainas
soonaspossible,adesignerwhomovesoutoftheproblemdomaintoosoonmaymissvaluabledesign
options.
Thisisparticularlyrelevantinparallelprogramming.Parallelprogramsattempttosolvebigger
problemsinlesstimebysimultaneouslysolvingdifferentpartsoftheproblemondifferentprocessing
elements.Thiscanonlywork,however,iftheproblemcontainsexploitableconcurrency,thatis,
multipleactivitiesortasksthatcanexecuteatthesametime.Afteraproblemhasbeenmappedonto
theprogramdomain,however,itcanbedifficulttoseeopportunitiestoexploitconcurrency.
Hence,programmersshouldstarttheirdesignofaparallelsolutionbyanalyzingtheproblemwithin
theproblemdomaintoexposeexploitableconcurrency.Wecallthedesignspaceinwhichthis
analysisiscarriedouttheFindingConcurrencydesignspace.Thepatternsinthisdesignspacewill
helpidentifyandanalyzetheexploitableconcurrencyinaproblem.Afterthisisdone,oneormore
patternsfromtheAlgorithmStructurespacecanbechosentohelpdesigntheappropriatealgorithm
structuretoexploittheidentifiedconcurrency.
AnoverviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.3.1.
Figure 3.1. Overview of the Finding Concurrency design space and its place in the
pattern language
Experienceddesignersworkinginafamiliardomainmayseetheexploitableconcurrency
immediatelyandcouldmovedirectlytothepatternsintheAlgorithmStructuredesignspace.
3.1.1. Overview
Beforestartingtoworkwiththepatternsinthisdesignspace,thealgorithmdesignermustfirst
considertheproblemtobesolvedandmakesuretheefforttocreateaparallelprogramwillbe
justified:Istheproblemlargeenoughandtheresultssignificantenoughtojustifyexpendingeffortto
solveitfaster?Ifso,thenextstepistomakesurethekeyfeaturesanddataelementswithinthe
problemarewellunderstood.Finally,thedesignerneedstounderstandwhichpartsoftheproblemare
mostcomputationallyintensive,becausetheefforttoparallelizetheproblemshouldbefocusedon
thoseparts.
Afterthisanalysisiscomplete,thepatternsintheFindingConcurrencydesignspacecanbeusedto
startdesigningaparallelalgorithm.Thepatternsinthisdesignspacecanbeorganizedintothree
groups.
DecompositionPatterns.Thetwodecompositionpatterns,TaskDecompositionandData
Decomposition,areusedtodecomposetheproblemintopiecesthatcanexecuteconcurrently.
DependencyAnalysisPatterns.Thisgroupcontainsthreepatternsthathelpgroupthetasks
andanalyzethedependenciesamongthem:GroupTasks,OrderTasks,andDataSharing.
Nominally,thepatternsareappliedinthisorder.Inpractice,however,itisoftennecessaryto
workbackandforthbetweenthem,orpossiblyevenrevisitthedecompositionpatterns.
DesignEvaluationPattern.Thefinalpatterninthisspaceguidesthealgorithmdesigner
throughananalysisofwhathasbeendonesofarbeforemovingontothepatternsinthe
AlgorithmStructuredesignspace.Thispatternisimportantbecauseitoftenhappensthatthe
bestdesignisnotfoundonthefirstattempt,andtheearlierdesignflawsareidentified,the
easiertheyaretocorrect.Ingeneral,workingthroughthepatternsinthisspaceisaniterative
process.
3.1.2. Using the Decomposition Patterns
Thefirststepindesigningaparallelalgorithmistodecomposetheproblemintoelementsthatcan
executeconcurrently.Wecanthinkofthisdecompositionasoccurringintwodimensions.
Thetaskdecompositiondimensionviewstheproblemasastreamofinstructionsthatcanbe
brokenintosequencescalledtasksthatcanexecutesimultaneously.Forthecomputationtobe
efficient,theoperationsthatmakeupthetaskshouldbelargelyindependentoftheoperations
takingplaceinsideothertasks.
Thedatadecompositiondimensionfocusesonthedatarequiredbythetasksandhowitcanbe
decomposedintodistinctchunks.Thecomputationassociatedwiththedatachunkswillonly
beefficientifthedatachunkscanbeoperateduponrelativelyindependently.
Viewingtheproblemdecompositionintermsoftwodistinctdimensionsissomewhatartificial.Atask
decompositionimpliesadatadecompositionandviceversa;hence,thetwodecompositionsarereally
differentfacetsofthesamefundamentaldecomposition.Wedividethemintoseparatedimensions,
however,becauseaproblemdecompositionusuallyproceedsmostnaturallybyemphasizingone
dimensionofthedecompositionovertheother.Bymakingthemdistinct,wemakethisdesign
emphasisexplicitandeasierforthedesignertounderstand.
3.1.3. Background for Examples
Inthissection,wegivebackgroundinformationonsomeoftheexamplesthatareusedinseveral
patterns.Itcanbeskippedforthetimebeingandrevisitedlaterwhenreadingapatternthatrefersto
oneoftheexamples.
Medical imaging
PET(PositronEmissionTomography)scansprovideanimportantdiagnostictoolbyallowing
physicianstoobservehowaradioactivesubstancepropagatesthroughapatient'sbody.Unfortunately,
theimagesformedfromthedistributionofemittedradiationareoflowresolution,dueinparttothe
scatteringoftheradiationasitpassesthroughthebody.Itisalsodifficulttoreasonfromtheabsolute
radiationintensities,becausedifferentpathwaysthroughthebodyattenuatetheradiationdifferently.
Tosolvethisproblem,modelsofhowradiationpropagatesthroughthebodyareusedtocorrectthe
images.AcommonapproachistobuildaMonteCarlomodel,asdescribedbyLjungbergandKing
[LK98].Randomlyselectedpointswithinthebodyareassumedtoemitradiation(usuallyagamma
ray),andthetrajectoryofeachrayisfollowed.Asaparticle(ray)passesthroughthebody,itis
attenuatedbythedifferentorgansittraverses,continuinguntiltheparticleleavesthebodyandhitsa
cameramodel,therebydefiningafulltrajectory.Tocreateastatisticallysignificantsimulation,
thousands,ifnotmillions,oftrajectoriesarefollowed.
Thisproblemcanbeparallelizedintwoways.Becauseeachtrajectoryisindependent,itispossibleto
parallelizetheapplicationbyassociatingeachtrajectorywithatask.Thisapproachisdiscussedinthe
ExamplessectionoftheTaskDecompositionpattern.Anotherapproachwouldbetopartitionthe
bodyintosectionsandassigndifferentsectionstodifferentprocessingelements.Thisapproachis
discussedintheExamplessectionoftheDataDecompositionpattern.
Linear algebra
Linearalgebraisanimportanttoolinappliedmathematics:Itprovidesthemachineryrequiredto
analyzesolutionsoflargesystemsoflinearequations.Theclassiclinearalgebraproblemasks,for
matrixAandvectorb,whatvaluesforxwillsolvetheequation
Equation3.1
ThematrixAinEq.3.1takesonacentralroleinlinearalgebra.Manyproblemsareexpressedin
termsoftransformationsofthismatrix.Thesetransformationsareappliedbymeansofamatrix
multiplication
Equation3.2
IfT,A,andCaresquarematricesoforderN,matrixmultiplicationisdefinedsuchthateachelement
oftheresultingmatrixCis
Equation3.3
wherethesubscriptsdenoteparticularelementsofthematrices.Inotherwords,theelementofthe
productmatrixCinrowiandcolumnjisthedotproductoftheithrowofTandthejthcolumnof
A.Hence,computingeachoftheN2elementsofCrequiresNmultiplicationsandN1additions,
makingtheoverallcomplexityofmatrixmultiplicationO(N3).
Therearemanywaystoparallelizeamatrixmultiplicationoperation.Itcanbeparallelizedusing
eitherataskbaseddecomposition(asdiscussedintheExamplessectionoftheTaskDecomposition
pattern)oradatabaseddecomposition(asdiscussedintheExamplessectionoftheData
Decompositionpattern).
Molecular dynamics
Moleculardynamicsisusedtosimulatethemotionsofalargemolecularsystem.Forexample,
moleculardynamicssimulationsshowhowalargeproteinmovesaroundandhowdifferentlyshaped
drugsmightinteractwiththeprotein.Notsurprisingly,moleculardynamicsisextremelyimportantin
thepharmaceuticalindustry.Itisalsoausefultestproblemforcomputerscientistsworkingonparallel
computing:Itisstraightforwardtounderstand,relevanttoscienceatlarge,anddifficulttoparallelize
effectively.Asaresult,ithasbeenthesubjectofmuchresearch[Mat94,PH95,Pli95].
Thebasicideaistotreatamoleculeasalargecollectionofballsconnectedbysprings.Theballs
representtheatomsinthemolecule,whilethespringsrepresentthechemicalbondsbetweenthe
atoms.Themoleculardynamicssimulationitselfisanexplicittimesteppingprocess.Ateachtime
step,theforceoneachatomiscomputedandthenstandardclassicalmechanicstechniquesareusedto
computehowtheforcemovestheatoms.Thisprocessiscarriedoutrepeatedlytostepthroughtime
andcomputeatrajectoryforthemolecularsystem.
Theforcesduetothechemicalbonds(the"springs")arerelativelysimpletocompute.These
correspondtothevibrationsandrotationsofthechemicalbondsthemselves.Theseareshortrange
forcesthatcanbecomputedwithknowledgeofthehandfulofatomsthatsharechemicalbonds.The
majordifficultyarisesbecausetheatomshavepartialelectricalcharges.Hence,whileatomsonly
interactwithasmallneighborhoodofatomsthroughtheirchemicalbonds,theelectricalcharges
causeeveryatomtoapplyaforceoneveryotheratom.
ThisisthefamousNbodyproblem.OntheorderofN2termsmustbecomputedtofindthese
nonbondedforces.BecauseNislarge(tensorhundredsofthousands)andthenumberoftimestepsin
asimulationishuge(tensofthousands),thetimerequiredtocomputethesenonbondedforces
dominatesthecomputation.Severalwayshavebeenproposedtoreducetheeffortrequiredtosolvethe
Nbodyproblem.Weareonlygoingtodiscussthesimplestone:thecutoffmethod.
Theideaissimple.Eventhougheachatomexertsaforceoneveryotheratom,thisforcedecreases
withthesquareofthedistancebetweentheatoms.Hence,itshouldbepossibletopickadistance
beyondwhichtheforcecontributionissosmallthatitcanbeignored.Byignoringtheatomsthat
exceedthiscutoff,theproblemisreducedtoonethatscalesasO(Nxn),wherenisthenumberof
atomswithinthecutoffvolume,usuallyhundreds.Thecomputationisstillhuge,anditdominatesthe
overallruntimeforthesimulation,butatleasttheproblemistractable.
Thereareahostofdetails,butthebasicsimulationcanbesummarizedasinFig.3.2.
Theprimarydatastructuresholdtheatomicpositions(atoms),thevelocitiesofeachatom
(velocity),theforcesexertedoneachatom(forces),andlistsofatomswithinthecutoff
distanceofeachatoms(neighbors).Theprogramitselfisatimesteppingloop,inwhicheach
iterationcomputestheshortrangeforceterms,updatestheneighborlists,andthenfindsthe
nonbondedforces.Aftertheforceoneachatomhasbeencomputed,asimpleordinarydifferential
equationissolvedtoupdatethepositionsandvelocities.Physicalpropertiesbasedonatomicmotions
arethenupdated,andwegotothenexttimestep.
Therearemanywaystoparallelizethemoleculardynamicsproblem.Weconsiderthemostcommon
approach,startingwiththetaskdecomposition(discussedintheTaskDecompositionpattern)and
followingwiththeassociateddatadecomposition(discussedintheDataDecompositionpattern).This
exampleshowshowthetwodecompositionsfittogethertoguidethedesignoftheparallelalgorithm.
Figure 3.2. Pseudocode for the molecular dynamics example

Int const N // number of atoms
Array
Array
Array
Array
of
of
of
of
Real
Real
Real
List
::
::
::
::
atoms (3,N) //3D coordinates

velocities (3,N) //velocity vector
forces (3,N) //force in each dimension
neighbors(N) //atoms in cutoff volume
loop over time steps

vibrational_forces (N, atoms, forces)
rotational_forces (N, atoms, forces)
neighbor_list (N, atoms, neighbors)
non_bonded_forces (N, atoms, neighbors, forces)
update_atom_positions_and_velocities(
N, atoms, velocities, forces)
physical_properties ( ... Lots of stuff ... )
end loop
3.2. THE TASK DECOMPOSITION PATTERN

Problem
Howcanaproblembedecomposedintotasksthatcanexecuteconcurrently?
Context
Everyparallelalgorithmdesignstartsfromthesamepoint,namelyagoodunderstandingofthe
problembeingsolved.Theprogrammermustunderstandwhicharethecomputationallyintensive
partsoftheproblem,thekeydatastructures,andhowthedataisusedastheproblem'ssolution
unfolds.
Thenextstepistodefinethetasksthatmakeuptheproblemandthedatadecompositionimpliedby
thetasks.Fundamentally,everyparallelalgorithminvolvesacollectionoftasksthatcanexecute
concurrently.Thechallengeistofindthesetasksandcraftanalgorithmthatletsthemrun
concurrently.
Insomecases,theproblemwillnaturallybreakdownintoacollectionofindependent(ornearly
independent)tasks,anditiseasiesttostartwithataskbaseddecomposition.Inothercases,thetasks
aredifficulttoisolateandthedecompositionofthedata(asdiscussedintheDataDecomposition
pattern)isabetterstartingpoint.Itisnotalwaysclearwhichapproachisbest,andoftenthealgorithm
designerneedstoconsiderboth.
Regardlessofwhetherthestartingpointisataskbasedoradatabaseddecomposition,however,a
parallelalgorithmultimatelyneedstasksthatwillexecuteconcurrently,sothesetasksmustbe
identified.
Forces
Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilityinthedesignwillallowittobeadaptedtodifferentimplementation
requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle
computersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel
computer(intermsofreducedruntimeand/ormemoryutilization).Forataskdecomposition,
thismeansweneedenoughtaskstokeepallthePEsbusy,withenoughworkpertaskto
compensateforoverheadincurredtomanagedependencies.However,thedriveforefficiency
canleadtocomplexdecompositionsthatlackflexibility.
Simplicity.Thetaskdecompositionneedstobecomplexenoughtogetthejobdone,but
simpleenoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
Solution
Thekeytoaneffectivetaskdecompositionistoensurethatthetasksaresufficientlyindependentso
thatmanagingdependenciestakesonlyasmallfractionoftheprogram'soverallexecutiontime.Itis
alsoimportanttoensurethattheexecutionofthetaskscanbeevenlydistributedamongtheensemble
ofPEs(theloadbalancingproblem).
Inanidealworld,thecompilerwouldfindthetasksfortheprogrammer.Unfortunately,thisalmost
neverhappens.Instead,itmustusuallybedonebyhandbasedonknowledgeoftheproblemandthe
coderequiredtosolveit.Insomecases,itmightbenecessarytocompletelyrecasttheproblemintoa
formthatexposesrelativelyindependenttasks.
Inataskbaseddecomposition,welookattheproblemasacollectionofdistincttasks,paying
particularattentionto
Theactionsthatarecarriedouttosolvetheproblem.(Arethereenoughofthemtokeepthe
processingelementsonthetargetmachinesbusy?)
Whethertheseactionsaredistinctandrelativelyindependent.
Asafirstpass,wetrytoidentifyasmanytasksaspossible;itismucheasiertostartwithtoomany
tasksandmergethemlateronthantostartwithtoofewtasksandlatertrytosplitthem.
Taskscanbefoundinmanydifferentplaces.
Insomecases,eachtaskcorrespondstoadistinctcalltoafunction.Definingataskforeach
functioncallleadstowhatissometimescalledafunctionaldecomposition.
Anotherplacetofindtasksisindistinctiterationsoftheloopswithinanalgorithm.Ifthe
iterationsareindependentandthereareenoughofthem,thenitmightworkwelltobaseatask
decompositiononmappingeachiterationontoatask.Thisstyleoftaskbaseddecomposition
leadstowhataresometimescalledloopsplittingalgorithms.
Tasksalsoplayakeyroleindatadrivendecompositions.Inthiscase,alargedatastructureis
decomposedandmultipleunitsofexecutionconcurrentlyupdatedifferentchunksofthedata
structure.Inthiscase,thetasksarethoseupdatesonindividualchunks.
AlsokeepinmindtheforcesgivenintheForcessection:
Flexibility.Thedesignneedstobeflexibleinthenumberoftasksgenerated.Usuallythisis
donebyparameterizingthenumberandsizeoftasksonsomeappropriatedimension.This
willletthedesignbeadaptedtoawiderangeofparallelcomputerswithdifferentnumbersof
processors.
Efficiency.Therearetwomajorefficiencyissuestoconsiderinthetaskdecomposition.First,
eachtaskmustincludeenoughworktocompensatefortheoverheadincurredbycreatingthe
tasksandmanagingtheirdependencies.Second,thenumberoftasksshouldbelargeenough
sothatalltheunitsofexecutionarebusywithusefulworkthroughoutthecomputation.
Simplicity.Tasksshouldbedefinedinawaythatmakesdebuggingandmaintenancesimple.
Whenpossible,tasksshouldbedefinedsotheyreusecodefromexistingsequentialprograms
thatsolverelatedproblems.
Afterthetaskshavebeenidentified,thenextstepistolookatthedatadecompositionimpliedbythe
tasks.TheDataDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea
modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe
trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands,
ifnotmillions,oftrajectoriesarefollowed.
Itisnaturaltoassociateataskwitheachtrajectory.Thesetasksareparticularlysimpletomanage
concurrentlybecausetheyarecompletelyindependent.Furthermore,therearelargenumbersof
trajectories,sotherewillbemanytasks,makingthisdecompositionsuitableforalargerangeof
computersystems,fromasharedmemorysystemwithasmallnumberofprocessingelementstoa
largeclusterwithhundredsofprocessingelements.
Withthebasictasksdefined,wenowconsiderthecorrespondingdatadecompositionthatis,we
definethedataassociatedwitheachtask.Eachtaskneedstoholdtheinformationdefiningthe
trajectory.Butthatisnotall:Thetasksneedaccesstothemodelofthebodyaswell.Althoughit
mightnotbeapparentfromourdescriptionoftheproblem,thebodymodelcanbeextremelylarge.
Becauseitisareadonlymodel,thisisnoproblemifthereisaneffectivesharedmemorysystem;
eachtaskcanreaddataasneeded.Ifthetargetplatformisbasedonadistributedmemory
architecture,however,thebodymodelwillneedtobereplicatedoneachPE.Thiscanbeverytime
consumingandcanwasteagreatdealofmemory.ForsystemswithsmallmemoriesperPEand/or
withslownetworksbetweenPEs,adecompositionoftheproblembasedonthebodymodelmightbe
moreeffective.
Thisisacommonsituationinparallelprogramming:Manyproblemscanbedecomposedprimarilyin
termsofdataorprimarilyintermsoftasks.Ifataskbaseddecompositionavoidstheneedtobreakup
anddistributecomplexdatastructures,itwillbeamuchsimplerprogramtowriteanddebug.Onthe
otherhand,ifmemoryand/ornetworkbandwidthisalimitingfactor,adecompositionthatfocuseson
thedatamightbemoreeffective.Itisnotsomuchamatterofoneapproachbeing"better"than
anotherasamatterofbalancingtheneedsofthemachinewiththeneedsoftheprogrammer.We
discussthisinmoredetailintheDataDecompositionpattern.
Matrix multiplication
Considerthemultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Wecanproducea
taskbaseddecompositionofthisproblembyconsideringthecalculationofeachelementofthe
productmatrixasaseparatetask.EachtaskneedsaccesstoonerowofAandonecolumnofB.This
decompositionhastheadvantagethatallthetasksareindependent,andbecauseallthedatathatis
sharedamongtasks(AandB)isreadonly,itwillbestraightforwardtoimplementinashared
memoryenvironment.
Theperformanceofthisalgorithm,however,wouldbepoor.Considerthecasewherethethree
matricesaresquareandoforderN.ForeachelementofC,NelementsfromAandNelementsfromB
wouldberequired,resultingin2NmemoryreferencesforNmultiply/addoperations.Memoryaccess
timeisslowcomparedtofloatingpointarithmetic,sothebandwidthofthememorysubsystemwould
limittheperformance.
Abetterapproachwouldbetodesignanalgorithmthatmaximizesreuseofdataloadedintoa
processor'scaches.Wecanarriveatthisalgorithmintwodifferentways.First,wecouldgroup
togethertheelementwisetaskswedefinedearliersothetasksthatusesimilarelementsoftheAandB
matricesrunonthesameUE(seetheGroupTaskspattern).Alternatively,wecouldstartwiththedata
decompositionanddesignthealgorithmfromthebeginningaroundthewaythematricesfitintothe
caches.WediscussthisexamplefurtherintheExamplessectionoftheDataDecompositionpattern.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3.Pseudocodeforthisexampleis
shownagaininFig.3.3.
Beforeperformingthetaskdecomposition,weneedtobetterunderstandsomedetailsoftheproblem.
First,theneighbor_list ()computationistimeconsuming.Thegistofthecomputationisa
loopovereachatom,insideofwhicheveryotheratomischeckedtodeterminewhetheritfallswithin
theindicatedcutoffvolume.Fortunately,thetimestepsareverysmall,andtheatomsdon'tmovevery
muchinanygiventimestep.Hence,thistimeconsumingcomputationisonlycarriedoutevery10to
100steps.
Figure 3.3. Pseudocode for the molecular dynamics example
Array
Array
Array
Array
of
of
of
of
Real
Real
Real
List
::
::
::
::


neighbor_list (N, atoms, neighbors)
non_bonded_forces (N, atoms, neighbors, forces)

end loop
Second,thephysical_properties()functioncomputesenergies,correlationcoefficients,and
ahostofinterestingphysicalproperties.Thesecomputations,however,aresimpleanddonot
significantlyaffecttheprogram'soverallruntime,sowewillignoretheminthisdiscussion.
Becausethebulkofthecomputationtimewillbeinnon_bonded_forces(),wemustpicka
problemdecompositionthatmakesthatcomputationrunefficientlyinparallel.Theproblemismade
easierbythefactthateachofthefunctionsinsidethetimeloophasasimilarstructure:Inthe
sequentialversion,eachfunctionincludesaloopoveratomstocomputecontributionstotheforce
vector.Thus,anaturaltaskdefinitionistheupdaterequiredbyeachatom,whichcorrespondstoa
loopiterationinthesequentialversion.Afterperformingthetaskdecomposition,therefore,weobtain
thefollowingtasks.
Tasksthatfindthevibrationalforcesonanatom
Tasksthatfindtherotationalforcesonanatom
Tasksthatfindthenonbondedforcesonanatom
Tasksthatupdatethepositionandvelocityofanatom
Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)
Withourcollectionoftasksinhand,wecanconsidertheaccompanyingdatadecomposition.Thekey
datastructuresaretheneighborlist,theatomiccoordinates,theatomicvelocities,andtheforcevector.
Everyiterationthatupdatestheforcevectorneedsthecoordinatesofaneighborhoodofatoms.The
computationofnonbondedforces,however,potentiallyneedsthecoordinatesofalltheatoms,because
themoleculebeingsimulatedmightfoldbackonitselfinunpredictableways.Wewillusethis
informationtocarryoutthedatadecomposition(intheDataDecompositionpattern)andthedata
sharinganalysis(intheDataSharingpattern).
Known uses
Taskbaseddecompositionsareextremelycommoninparallelcomputing.Forexample,thedistance
geometrycodeDGEOM[Mat96]usesataskbaseddecomposition,asdoestheparallelWESDYN
moleculardynamicsprogram[MR95].
3.3. THE DATA DECOMPOSITION PATTERN
Problem
Howcanaproblem'sdatabedecomposedintounitsthatcanbeoperatedonrelativelyindependently?
Context
Theparallelalgorithmdesignermusthaveadetailedunderstandingoftheproblembeingsolved.In
addition,thedesignershouldidentifythemostcomputationallyintensivepartsoftheproblem,thekey
datastructuresrequiredtosolvetheproblem,andhowdataisusedastheproblem'ssolutionunfolds.
Afterthebasicproblemisunderstood,theparallelalgorithmdesignershouldconsiderthetasksthat
makeuptheproblemandthedatadecompositionimpliedbythetasks.Boththetaskanddata
decompositionsneedtobeaddressedtocreateaparallelalgorithm.Thequestionisnotwhich
decompositiontodo.Thequestioniswhichonetostartwith.Adatabaseddecompositionisagood
startingpointifthefollowingistrue.
Themostcomputationallyintensivepartoftheproblemisorganizedaroundthemanipulation
ofalargedatastructure.
Similaroperationsarebeingappliedtodifferentpartsofthedatastructure,insuchawaythat
thedifferentpartscanbeoperatedonrelativelyindependently.
Forexample,manylinearalgebraproblemsupdatelargematrices,applyingasimilarsetofoperations
toeachelementofthematrix.Inthesecases,itisstraightforwardtodrivetheparallelalgorithm
designbylookingathowthematrixcanbebrokenupintoblocksthatareupdatedconcurrently.The
taskdefinitionsthenfollowfromhowtheblocksaredefinedandmappedontotheprocessing
elementsoftheparallelcomputer.
Forces
Themainforcesinfluencingthedesignatthispointareflexibility,efficiency,andsimplicity.
Flexibility.Flexibilitywillallowthedesigntobeadaptedtodifferentimplementation
requirements.Forexample,itisusuallynotagoodideatonarrowtheoptionstoasingle
computersystemorstyleofprogrammingatthisstageofthedesign.
Efficiency.Aparallelprogramisonlyusefulifitscalesefficientlywiththesizeoftheparallel
computer(intermsofreducedruntimeand/ormemoryutilization).
Simplicity.Thedecompositionneedstobecomplexenoughtogetthejobdone,butsimple
enoughtolettheprogrambedebuggedandmaintainedwithreasonableeffort.
Solution
InsharedmemoryprogrammingenvironmentssuchasOpenMP,thedatadecompositionwill
frequentlybeimpliedbythetaskdecomposition.Inmostcases,however,thedecompositionwillneed
tobedonebyhand,becausethememoryisphysicallydistributed,becausedatadependenciesaretoo
complexwithoutexplicitlydecomposingthedata,ortoachieveacceptableefficiencyonaNUMA
computer.
Ifataskbaseddecompositionhasalreadybeendone,thedatadecompositionisdrivenbytheneedsof
eachtask.Ifwelldefinedanddistinctdatacanbeassociatedwitheachtask,thedecompositionshould
besimple.
Whenstartingwithadatadecomposition,however,weneedtolooknotatthetasks,butatthecentral
datastructuresdefiningtheproblemandconsiderwhethertheycantheybebrokendownintochunks
thatcanbeoperatedonconcurrently.Afewcommonexamplesincludethefollowing.
Arraybasedcomputations.Concurrencycanbedefinedintermsofupdatesofdifferent
segmentsofthearray.Ifthearrayismultidimensional,itcanbedecomposedinavarietyof
ways(rows,columns,orblocksofvaryingshapes).
Recursivedatastructures.Wecanthinkof,forexample,decomposingtheparallelupdateofa
largetreedatastructurebydecomposingthedatastructureintosubtreesthatcanbeupdated
concurrently.
Regardlessofthenatureoftheunderlyingdatastructure,ifthedatadecompositionistheprimary
factordrivingthesolutiontotheproblem,itservesastheorganizingprincipleoftheparallel
algorithm.
Whenconsideringhowtodecomposetheproblem'sdatastructures,keepinmindthecompeting
forces.
Flexibility.Thesizeandnumberofdatachunksshouldbeflexibletosupportthewidestrange
ofparallelsystems.Oneapproachistodefinechunkswhosesizeandnumberarecontrolledby
asmallnumberofparameters.Theseparametersdefinegranularityknobsthatcanbevariedto
modifythesizeofthedatachunkstomatchtheneedsoftheunderlyinghardware.(Note,
however,thatmanydesignsarenotinfinitelyadaptablewithrespecttogranularity.)
Theeasiestplacetoseetheimpactofgranularityonthedatadecompositionisintheoverhead
requiredtomanagedependenciesbetweenchunks.Thetimerequiredtomanagedependencies
mustbesmallcomparedtotheoverallruntime.Inagooddatadecomposition,the
dependenciesscaleatalowerdimensionthanthecomputationaleffortassociatedwitheach
chunk.Forexample,inmanyfinitedifferenceprograms,thecellsattheboundariesbetween
chunks,thatis,thesurfacesofthechunks,mustbeshared.Thesizeofthesetofdependent
cellsscalesasthesurfacearea,whiletheeffortrequiredinthecomputationscalesasthe
volumeofthechunk.Thismeansthatthecomputationaleffortcanbescaled(basedonthe
chunk'svolume)tooffsetoverheadsassociatedwithdatadependencies(basedonthesurface
areaofthechunk).
Efficiency.Itisimportantthatthedatachunksbelargeenoughthattheamountofworkto
updatethechunkoffsetstheoverheadofmanagingdependencies.Amoresubtleissueto
considerishowthechunksmapontoUEs.Aneffectiveparallelalgorithmmustbalancethe
loadbetweenUEs.Ifthisisn'tdonewell,somePEsmighthaveadisproportionateamountof
work,andtheoverallscalabilitywillsuffer.Thismayrequirecleverwaystobreakupthe
problem.Forexample,iftheproblemclearsthecolumnsinamatrixfromlefttoright,a
columnmappingofthematrixwillcauseproblemsastheUEswiththeleftmostcolumnswill
finishtheirworkbeforetheothers.Arowbasedblockdecompositionorevenablockcyclic
decomposition(inwhichrowsareassignedcyclicallytoPEs)woulddoamuchbetterjobof
keepingalltheprocessorsfullyoccupied.Theseissuesarediscussedinmoredetailinthe
DistributedArraypattern.
Simplicity.Overlycomplexdatadecompositionscanbeverydifficulttodebug.Adata
decompositionwillusuallyrequireamappingofaglobalindexspaceontoatasklocalindex
space.Makingthismappingabstractallowsittobeeasilyisolatedandtested.
Afterthedatahasbeendecomposed,ifithasnotalreadybeendone,thenextstepistolookatthetask
decompositionimpliedbythetasks.TheTaskDecompositionpatternmayhelpwiththisanalysis.
Examples
Medical imaging
ConsiderthemedicalimagingproblemdescribedinSec.3.1.3.Inthisapplication,apointinsidea
modelofthebodyisselectedrandomly,aradioactivedecayisallowedtooccuratthispoint,andthe
trajectoryoftheemittedparticleisfollowed.Tocreateastatisticallysignificantsimulation,thousands
ifnotmillionsoftrajectoriesarefollowed.
Inadatabaseddecompositionofthisproblem,thebodymodelisthelargecentraldatastructure
aroundwhichthecomputationcanbeorganized.Themodelisbrokenintosegments,andoneormore
segmentsareassociatedwitheachprocessingelement.Thebodysegmentsareonlyread,notwritten,
duringthetrajectorycomputations,sotherearenodatadependenciescreatedbythedecompositionof
thebodymodel.
Afterthedatahasbeendecomposed,weneedtolookatthetasksassociatedwitheachdatasegment.
Inthiscase,eachtrajectorypassingthroughthedatasegmentdefinesatask.Thetrajectoriesare
initiatedandpropagatedwithinasegment.Whenasegmentboundaryisencountered,thetrajectory
mustbepassedbetweensegments.Itisthistransferthatdefinesthedependenciesbetweendata
chunks.
Ontheotherhand,inataskbasedapproachtothisproblem(asdiscussedintheTaskDecomposition
pattern),thetrajectoriesforeachparticledrivethealgorithmdesign.EachPEpotentiallyneedsto
accessthefullbodymodeltoserviceitssetoftrajectories.Inasharedmemoryenvironment,thisis
easybecausethebodymodelisareadonlydataset.Inadistributedmemoryenvironment,however,
thiswouldrequiresubstantialstartupoverheadasthebodymodelisbroadcastacrossthesystem.
Thisisacommonsituationinparallelprogramming:Differentpointsofviewleadtodifferent
algorithmswithpotentiallyverydifferentperformancecharacteristics.Thetaskbasedalgorithmis
simple,butitonlyworksifeachprocessingelementhasaccesstoalargememoryandiftheoverhead
incurredloadingthedataintomemoryisinsignificantcomparedtotheprogram'sruntime.An
algorithmdrivenbyadatadecomposition,ontheotherhand,makesefficientuseofmemoryand(in
distributedmemoryenvironments)lessuseofnetworkbandwidth,butitincursmorecommunication
overheadduringtheconcurrentpartofcomputationandissignificantlymorecomplex.Choosing
whichistheappropriateapproachcanbedifficultandisdiscussedfurtherintheDesignEvaluation
pattern.
Considerthestandardmultiplicationoftwomatrices(C=AB),asdescribedinSec.3.1.3.Several
databaseddecompositionsarepossibleforthisproblem.Astraightforwardonewouldbeto
decomposetheproductmatrixCintoasetofrowblocks(setofadjacentrows).Fromthedefinitionof
matrixmultiplication,computingtheelementsofarowblockofCrequiresthefullAmatrix,butonly
thecorrespondingrowblockofB.Withsuchadatadecomposition,thebasictaskinthealgorithm
becomesthecomputationoftheelementsinarowblockofC.
AnevenmoreeffectiveapproachthatdoesnotrequirethereplicationofthefullAmatrixisto
decomposeallthreematricesintosubmatricesorblocks.Thebasictaskthenbecomestheupdateofa
Cblock,withtheAandBblocksbeingcycledamongthetasksasthecomputationproceeds.This
decomposition,however,ismuchmorecomplextoprogram;communicationandcomputationmustbe
carefullycoordinatedduringthemosttimecriticalportionsoftheproblem.Wediscussthisexample
furtherintheGeometricDecompositionandDistributedArraypatterns.
Oneofthefeaturesofthematrixmultiplicationproblemisthattheratiooffloatingpointoperations
(O(N3))tomemoryreferences(O(N2))issmall.Thisimpliesthatitisespeciallyimportanttotakeinto
accountthememoryaccesspatternstomaximizereuseofdatafromthecache.Themosteffective
approachistousetheblock(submatrix)decompositionandadjustthesizeoftheblockssothe
problemsfitintocache.Wecouldarriveatthesamealgorithmbycarefullygroupingtogetherthe
elementwisetasksthatwereidentifiedintheExamplessectionoftheTaskDecompositionpattern,but
startingwithadatadecompositionandassigningatasktoupdateeachsubmatrixseemseasierto
understand.
Molecular dynamics
ConsiderthemoleculardynamicsproblemdescribedinSec.3.1.3andintheExamplessectionofthe
TaskDecompositionpattern.Thisproblemnaturallybreaksdownintoataskdecompositionwitha
taskbeinganiterationoftheloopoveratomsineachoftheforcecomputationroutines.
Summarizingourproblemanditstaskdecomposition,wehavethefollowing:
Atasktoupdatetheneighborlistforalltheatoms(whichwewillleavesequential)
Thekeydatastructuresare
Anarrayofatomcoordinates,oneelementperatom
Anarrayofatomvelocities,oneelementperatom
Anarrayoflists,oneperatom,eachdefiningtheneighborhoodofatomswithinthecutoff
distanceoftheatom
Anarrayofforcesonatoms,oneelementperatom
Anelementofthevelocityarrayisusedonlybythetaskowningthecorrespondingatom.Thisdata
doesnotneedtobesharedandcanremainlocaltothetask.Everytask,however,needsaccesstothe
fullarrayofcoordinates.Thus,itwillmakesensetoreplicatethisdatainadistributedmemory
environmentorshareitamongUEsinasharedmemoryenvironment.
Moreinterestingisthearrayofforces.FromNewton'sthirdlaw,theforcefromatomionatomjisthe
negativeoftheforcefromatomjonatomi.Wecanexploitthissymmetrytocuttheamountof
computationinhalfasweaccumulatetheforceterms.Thevaluesintheforcearrayarenotinthe
computationuntilthelaststepsinwhichthecoordinatesandvelocitiesareupdated.Therefore,the
approachusedistoinitializetheentireforcearrayoneachPEandhavethetasksaccumulatepartial
sumsoftheforcetermsintothisarray.Afterallthepartialforcetermshavecompleted,wesumallthe
PEs'arraystogethertoprovidethefinalforcearray.WediscussthisfurtherintheDataSharing
pattern.
Known uses
Datadecompositionsareverycommoninparallelscientificcomputing.Theparallellinearalgebra
+
libraryScaLAPACK[Sca,BCC
97
]usesblockbaseddecompositions.ThePLAPACKenvironment
[vdG97]fordenselinearalgebraproblemsusesaslightlydifferentapproachtodatadecomposition.If,
forexample,anequationoftheformy=Axappears,insteadoffirstpartitioningmatrixA,thevectors
yandxarepartitionedinanaturalwayandthentheinducedpartitiononAisdetermined.The
authorsreportbetterperformanceandeasierimplementationwiththisapproach.
ThedatadecompositionusedinourmoleculardynamicsexampleisdescribedbyMattsonand
Ravishanker[MR95].Moresophisticateddatadecompositionsforthisproblemthatscalebetterfor
largenumbersofnodesarediscussedbyPlimptonandHendrickson[PH95,Pli95].
3.4. THE GROUP TASKS PATTERN

Problem
Howcanthetasksthatmakeupaproblembegroupedtosimplifythejobofmanagingdependencies?
Context
Thispatterncanbeappliedafterthecorrespondingtaskanddatadecompositionshavebeenidentified
asdiscussedintheTaskDecompositionandDataDecompositionpatterns.
Thispatterndescribesthefirststepinanalyzingdependenciesamongthetaskswithinaproblem's
decomposition.Indevelopingtheproblem'staskdecomposition,wethoughtintermsoftasksthatcan
executeconcurrently.Whilewedidnotemphasizeitduringthetaskdecomposition,itisclearthat
thesetasksdonotconstituteaflatset.Forexample,tasksderivedfromthesamehighleveloperation
inthealgorithmarenaturallygroupedtogether.Othertasksmaynotberelatedintermsoftheoriginal
problembuthavesimilarconstraintsontheirconcurrentexecutionandcanthusbegroupedtogether.
Inshort,thereisconsiderablestructuretothesetoftasks.Thesestructuresthesegroupingsoftasks
simplifyaproblem'sdependencyanalysis.Ifagroupsharesatemporalconstraint(forexample,
waitingononegrouptofinishfillingafilebeforeanothergroupcanbeginreadingit),wecansatisfy
thatconstraintonceforthewholegroup.Ifagroupoftasksmustworktogetheronashareddata
structure,therequiredsynchronizationcanbeworkedoutonceforthewholegroup.Ifasetoftasks
areindependent,combiningthemintoasinglegroupandschedulingthemforexecutionasasingle
largegroupcansimplifythedesignandincreasetheavailableconcurrency(therebylettingthe
solutionscaletomorePEs).
Ineachcase,theideaistodefinegroupsoftasksthatshareconstraintsandsimplifytheproblemof
managingconstraintsbydealingwithgroupsratherthanindividualtasks.
Solution
Constraintsamongtasksfallintoafewmajorcategories.
Theeasiestdependencytounderstandisatemporaldependencythatis,aconstraintonthe
orderinwhichacollectionoftasksexecutes.IftaskAdependsontheresultsoftaskB,for
example,thentaskAmustwaituntiltaskBcompletesbeforeitcanexecute.Wecanusually
thinkofthiscaseintermsofdataflow:TaskAisblockedwaitingforthedatatobeready
fromtaskB;whenBcompletes,thedataflowsintoA.Insomecases,Acanbegincomputing
assoonasdatastartstoflowfromB(forexample,pipelinealgorithmsasdescribedinthe
Pipelinepattern).
Anothertypeoforderingconstraintoccurswhenacollectionoftasksmustrunatthesame
time.Forexample,inmanydataparallelproblems,theoriginalproblemdomainisdivided
intomultipleregionsthatcanbeupdatedinparallel.Typically,theupdateofanygivenregion
requiresinformationabouttheboundariesofitsneighboringregions.Ifalloftheregionsare
notprocessedatthesametime,theparallelprogramcouldstallordeadlockassomeregions
waitfordatafrominactiveregions.
Insomecases,tasksinagrouparetrulyindependentofeachother.Thesetasksdonothavean
orderingconstraintamongthem.Thisisanimportantfeatureofasetoftasksbecauseitmeans
theycanexecuteinanyorder,includingconcurrently,anditisimportanttoclearlynotewhen
thisholds.
Thegoalofthispatternistogrouptasksbasedontheseconstraints,becauseofthefollowing.
Bygroupingtasks,wesimplifytheestablishmentofpartialordersbetweentasks,since
orderingconstraintscanbeappliedtogroupsratherthantoindividualtasks.
Groupingtasksmakesiteasiertoidentifywhichtasksmustexecuteconcurrently.
Foragivenproblemanddecomposition,theremaybemanywaystogrouptasks.Thegoalistopicka
groupingoftasksthatsimplifiesthedependencyanalysis.Toclarifythispoint,thinkofthe
dependencyanalysisasfindingandsatisfyingconstraintsontheconcurrentexecutionofaprogram.
Whentasksshareasetofconstraints,itsimplifiesthedependencyanalysistogroupthemtogether.
Thereisnosinglewaytofindtaskgroups.Wesuggestthefollowingapproach,keepinginmindthat
whileonecannotthinkabouttaskgroupswithoutconsideringtheconstraintsthemselves,atthispoint
inthedesign,itisbesttodosoasabstractlyaspossibleidentifytheconstraintsandgrouptasksto
helpresolvethem,buttrynottogetboggeddowninthedetails.
First,lookathowtheoriginalproblemwasdecomposed.Inmostcases,ahighleveloperation
(forexample,solvingamatrix)oralargeiterativeprogramstructure(forexample,aloop)
playsakeyroleindefiningthedecomposition.Thisisthefirstplacetolookforgrouping
tasks.Thetasksthatcorrespondtoahighleveloperationnaturallygrouptogether.
Atthispoint,theremaybemanysmallgroupsoftasks.Inthenextstep,wewilllookatthe
constraintssharedbetweenthetaskswithinagroup.Ifthetasksshareaconstraintusuallyin
termsoftheupdateofashareddatastructurekeepthemasadistinctgroup.Thealgorithm
designwillneedtoensurethatthesetasksexecuteatthesametime.Forexample,many
problemsinvolvethecoordinatedupdateofashareddatastructurebyasetoftasks.Ifthese
tasksdonotrunconcurrently,theprogramcoulddeadlock.
Next,weaskifanyothertaskgroupssharethesameconstraint.Ifso,mergethegroups
together.LargetaskgroupsprovideadditionalconcurrencytokeepmorePEsbusyandalso
provideextraflexibilityinschedulingtheexecutionofthetasks,therebymakingiteasierto
balancetheloadbetweenPEs(thatis,ensurethateachofthePEsspendsapproximatelythe
sameamountoftimeworkingontheproblem).
Thenextstepistolookatconstraintsbetweengroupsoftasks.Thisiseasywhengroupshave
acleartemporalorderingorwhenadistinctchainofdatamovesbetweengroups.Themore
complexcase,however,iswhenotherwiseindependenttaskgroupsshareconstraintsbetween
groups.Inthesecases,itcanbeusefultomergetheseintoalargergroupofindependenttasks
onceagainbecauselargetaskgroupsusuallymakeformoreschedulingflexibilityandbetter
scalability.
Examples
Molecular dynamics
ThisproblemwasdescribedinSec.3.1.3,andwediscusseditsdecompositionintheTask
DecompositionandDataDecompositionpatterns.Weidentifiedthefollowingtasks:
Atasktoupdatetheneighborlistforalltheatoms(asingletaskbecausewehavedecidedto
leavethispartofthecomputationsequential)
Considerhowthesecanbegroupedtogether.Asafirstpass,eachiteminthepreviouslistcorresponds
toahighleveloperationintheoriginalproblemanddefinesataskgroup.Ifweweretodigdeeper
intotheproblem,however,wewouldseethatineachcasetheupdatesimpliedintheforcefunctions
areindependent.Theonlydependencyisthesummationoftheforcesintoasingleforcearray.
Wenextwanttoseeifwecanmergeanyofthesegroups.Goingdownthelist,thetasksinfirsttwo
groupsareindependentbutsharethesameconstraints.Inbothcases,coordinatesforasmall
neighborhoodofatomsarereadandlocalcontributionsaremadetotheforcearray,sowecanmerge
theseintoasinglegroupforbondedinteractions.Theothergroupshavedistincttemporalorordering
constraintsandthereforeshouldnotbemerged.
IntheExamplessectionoftheTaskDecompositionpatternwediscussdecomposingthematrix
multiplicationC=ABintotasks,eachcorrespondingtotheupdateofoneelementinC.The
memoryorganizationofmostmoderncomputers,however,favorslargergrainedtaskssuchas
updatingablockofC,asdescribedintheExamplessectionoftheDataDecompositionpattern.
Mathematically,thisisequivalenttogroupingtheelementwiseupdatetasksintogroups
correspondingtoblocks,andgroupingthetasksthiswayiswellsuitedtoanoptimumutilizationof
systemmemory.
3.5. THE ORDER TASKS PATTERN

Problem
Givenawayofdecomposingaproblemintotasksandawayofcollectingthesetasksintologically
relatedgroups,howmustthesegroupsoftasksbeorderedtosatisfyconstraintsamongtasks?
Context
Thispatternconstitutesthesecondstepinanalyzingdependenciesamongthetasksofaproblem
decomposition.Thefirststep,addressedintheGroupTaskspattern,istogrouptasksbasedon
constraintsamongthem.Thenextstep,discussedhere,istofindandcorrectlyaccountfor
dependenciesresultingfromconstraintsontheorderofexecutionofacollectionoftasks.Constraints
amongtasksfallintoafewmajorcategories:
Temporaldependencies,thatis,constraintsplacedontheorderinwhichacollectionoftasks
executes.
Requirementsthatparticulartasksmustexecuteatthesametime(forexample,becauseeach
requiresinformationthatwillbeproducedbytheothers).
Lackofconstraint,thatis,totalindependence.Althoughthisisnotstrictlyspeakinga
constraint,itisanimportantfeatureofasetoftasksbecauseitmeanstheycanexecuteinany
order,includingconcurrently,anditisimportanttoclearlynotewhenthisholds.
Thepurposeofthispatternistohelpfindandcorrectlyaccountfordependenciesresultingfrom
constraintsontheorderofexecutionofacollectionoftasks.
Solution
Therearetwogoalstobemetwhenidentifyingorderingconstraintsamongtasksanddefininga
partialorderamongtaskgroups.
Theorderingmustberestrictiveenoughtosatisfyalltheconstraintssothattheresulting
designiscorrect.
Theorderingshouldnotbemorerestrictivethanitneedstobe.Overlyconstrainingthe
solutionlimitsdesignoptionsandcanimpairprogramefficiency;thefewertheconstraints,the
moreflexibilityyouhavetoshifttasksaroundtobalancethecomputationalloadamongPEs.
Toidentifyorderingconstraints,considerthefollowingwaystaskscandependoneachother.
Firstlookatthedatarequiredbyagroupoftasksbeforetheycanexecute.Afterthisdatahas
beenidentified,findthetaskgroupthatcreatesitandanorderingconstraintwillbeapparent.
Forexample,ifonegroupoftasks(callitA)buildsacomplexdatastructureandanother
group(B)usesit,thereisasequentialorderingconstraintbetweenthesegroups.Whenthese
twogroupsarecombinedinaprogram,theymustexecuteinsequence,firstAandthenB.
Alsoconsiderwhetherexternalservicescanimposeorderingconstraints.Forexample,ifa
programmustwritetoafileinacertainorder,thenthesefileI/Ooperationslikelyimposean
orderingconstraint.
Finally,itisequallyimportanttonotewhenanorderingconstraintdoesnotexist.Ifanumber
oftaskgroupscanexecuteindependently,thereisamuchgreateropportunitytoexploit
parallelism,soweneedtonotewhentasksareindependentaswellaswhentheyare
dependent.
Regardlessofthesourceoftheconstraint,wemustdefinetheconstraintsthatrestricttheorderof
executionandmakesuretheyarehandledcorrectlyintheresultingalgorithm.Atthesametime,itis
importanttonotewhenorderingconstraintsareabsent,sincethiswillgivevaluableflexibilitylaterin
thedesign.
Examples
Molecular dynamics
DecompositionandDataDecompositionpatterns.IntheGroupTaskspattern,wedescribedhowto
organizethetasksforthisprobleminthefollowinggroups:
Agroupoftaskstofindthe"bondedforces"(vibrationalforcesandrotationalforces)oneach
atom
Agroupoftaskstofindthenonbondedforcesoneachatom
Agroupoftaskstoupdatethepositionandvelocityofeachatom
Atasktoupdatetheneighborlistforalltheatoms(whichtriviallyconstitutesataskgroup)
Nowwearereadytoconsiderorderingconstraintsbetweenthegroups.Clearly,theupdateofthe
atomicpositionscannotoccuruntiltheforcecomputationiscomplete.Also,thenonbondedforces
cannotbecomputeduntiltheneighborlistisupdated.Soineachtimestep,thegroupsmustbe
orderedasshowninFig.3.4.
Figure 3.4. Ordering of tasks in molecular dynamics problem
Whileitistooearlyinthedesigntoconsiderindetailhowtheseorderingconstraintswillbeenforced,
eventuallywewillneedtoprovidesomesortofsynchronizationtoensurethattheyarestrictly
followed.
3.6. THE DATA SHARING PATTERN

Problem
Givenadataandtaskdecompositionforaproblem,howisdatasharedamongthetasks?
Context
Atahighlevel,everyparallelalgorithmconsistsof
Acollectionoftasksthatcanexecuteconcurrently(seetheTaskDecompositionpattern)
Adatadecompositioncorrespondingtothecollectionofconcurrenttasks(seetheData
Decompositionpattern)
Dependenciesamongthetasksthatmustbemanagedtopermitsafeconcurrentexecution
AsaddressedintheGroupTasksandOrderTaskspatterns,thestartingpointinadependencyanalysis
istogrouptasksbasedonconstraintsamongthemandthendeterminewhatorderingconstraintsapply
togroupsoftasks.Thenextstep,discussedhere,istoanalyzehowdataissharedamonggroupsof
tasks,sothataccesstoshareddatacanbemanagedcorrectly.
Althoughtheanalysisthatledtothegroupingoftasksandtheorderingconstraintsamongthem
focusesprimarilyonthetaskdecomposition,atthisstageofthedependencyanalysis,thefocusshifts
tothedatadecomposition,thatis,thedivisionoftheproblem'sdataintochunksthatcanbeupdated
independently,eachassociatedwithoneormoretasksthathandletheupdateofthatchunk.This
chunkofdataissometimescalledtasklocaldata(orjustlocaldata),becauseitistightlycoupledto
thetask(s)responsibleforitsupdate.Itisrare,however,thateachtaskcanoperateusingonlyitsown
localdata;datamayneedtobesharedamongtasksinmanyways.Twoofthemostcommonsituations
arethefollowing.
Inadditiontotasklocaldata,theproblem'sdatadecompositionmightdefinesomedatathat
mustbesharedamongtasks;forexample,thetasksmightneedtocooperativelyupdatealarge
shareddatastructure.Suchdatacannotbeidentifiedwithanygiventask;itisinherentlyglobal
totheproblem.Thisshareddataismodifiedbymultipletasksandthereforeservesasasource
ofdependenciesamongthetasks.
Datadependenciescanalsooccurwhenonetaskneedsaccesstosomeportionofanother
task'slocaldata.Theclassicexampleofthistypeofdatadependencyoccursinfinite
differencemethodsparallelizedusingadatadecomposition,whereeachpointintheproblem
spaceisupdatedusingvaluesfromnearbypointsandthereforeupdatesforonechunkofthe
decompositionrequirevaluesfromtheboundariesofneighboringchunks.
Thispatterndiscussesdatasharinginparallelalgorithmsandhowtodealwithtypicalformsofshared
data.
Forces
Thegoalofthispatternistoidentifywhatdataissharedamonggroupsoftasksanddeterminehowto
manageaccesstoshareddatainawaythatisbothcorrectandefficient.
Datasharingcanhavemajorimplicationsforbothcorrectnessandefficiency.
Ifthesharingisdoneincorrectly,ataskmaygetinvaliddataduetoaracecondition;this
happensofteninsharedaddressspaceenvironments,whereataskcanreadfromamemory
locationbeforethewriteoftheexpecteddatahascompleted.
Guaranteeingthatshareddataisreadyforusecanleadtoexcessivesynchronizationoverhead.
Forexample,anorderingconstraintcanbeenforcedbyputtingbarrieroperations[1]before
readsofshareddata.Thiscanbeunacceptablyinefficient,however,especiallyincaseswhere
onlyasmallsubsetoftheUEsareactuallysharingthedata.Amuchbetterstrategyistousea
combinationofcopyingintolocaldataorrestructuringtaskstominimizethenumberoftimes
shareddatamustberead.
[1]
Abarrierisasynchronizationconstructthatdefinesapointinaprogramthata
groupofUEsmustallreachbeforeanyofthemareallowedtoproceed.
Anothersourceofdatasharingoverheadiscommunication.Insomeparallelsystems,any
accesstoshareddataimpliesthepassingofamessagebetweenUEs.Thisproblemcan
sometimesbemitigatedbyoverlappingcommunicationandcomputation,butthisisn'talways
possible.Frequently,abetterchoiceistostructurethealgorithmandtaskssothattheamount
ofshareddatatocommunicateisminimized.AnotherapproachistogiveeachUEitsown
copyoftheshareddata;thisrequiressomecaretobesurethatthecopiesarekeptconsistentin
valuebutcanbemoreefficient.
Thegoal,therefore,istomanageshareddataenoughtoensurecorrectnessbutnotsomuchas
tointerferewithefficiency.
Solution
Thefirststepistoidentifydatathatissharedamongtasks.
Thisismostobviouswhenthedecompositionispredominantlyadatabaseddecomposition.For
example,inafinitedifferenceproblem,thebasicdataisdecomposedintoblocks.Thenatureofthe
decompositiondictatesthatthedataattheedgesoftheblocksissharedbetweenneighboringblocks.
Inessence,thedatasharingwasworkedoutwhenthebasicdecompositionwasdone.
Inadecompositionthatispredominantlytaskbased,thesituationismorecomplex.Atsomepointin
thedefinitionoftasks,itwasdeterminedhowdataispassedintooroutofthetaskandwhetherany
dataisupdatedinthebodyofthetask.Thesearethesourcesofpotentialdatasharing.
Aftertheshareddatahasbeenidentified,itneedstobeanalyzedtoseehowitisused.Shareddata
fallsintooneofthefollowingthreecategories.
Readonly.Thedataisreadbutnotwritten.Becauseitisnotmodified,accesstothesevalues
doesnotneedtobeprotected.Onsomedistributedmemorysystems,itisworthwhileto
replicatethereadonlydatasoeachunitofexecutionhasitsowncopy.
Effectivelylocal.Thedataispartitionedintosubsets,eachofwhichisaccessed(forreador
write)byonlyoneofthetasks.(Anexampleofthiswouldbeanarraysharedamongtasksin
suchawaythatitselementsareeffectivelypartitionedintosetsoftasklocaldata.)Thiscase
providessomeoptionsforhandlingthedependencies.Ifthesubsetscanbeaccessed
independently(aswouldnormallybethecasewith,say,arrayelements,butnotnecessarily
withlistelements),thenitisnotnecessarytoworryaboutprotectingaccesstothisdata.On
distributedmemorysystems,suchdatawouldusuallybedistributedamongUEs,witheach
UEhavingonlythedataneededbyitstasks.Ifnecessary,thedatacanberecombinedintoa
singledatastructureattheendofthecomputation.
Readwrite.Thedataisbothreadandwrittenandisaccessedbymorethanonetask.Thisis
thegeneralcase,andincludesarbitrarilycomplicatedsituationsinwhichdataisreadfromand
writtentobyanynumberoftasks.Itisthemostdifficulttodealwith,becauseanyaccessto
thedata(readorwrite)mustbeprotectedwithsometypeofexclusiveaccessmechanism
(locks,semaphores,etc.),whichcanbeveryexpensive.
Twospecialcasesofreadwritedataarecommonenoughtodeservespecialmention:
Accumulate.Thedataisbeingusedtoaccumulatearesult(forexample,whencomputinga
reduction).Foreachlocationintheshareddata,thevaluesareupdatedbymultipletasks,with
theupdatetakingplacethroughsomesortofassociativeaccumulationoperation.Themost
commonaccumulationoperationsaresum,minimum,andmaximum,butanyassociative
operationonpairsofoperandscanbeused.Forsuchdata,eachtask(or,usually,eachUE)has
aseparatecopy;theaccumulationsoccurintotheselocalcopies,whicharethenaccumulated
intoasingleglobalcopyasafinalstepattheendoftheaccumulation.
Multipleread/singlewrite.Thedataisreadbymultipletasks(allofwhichneeditsinitial
value),butmodifiedbyonlyonetask(whichcanreadandwriteitsvaluearbitrarilyoften).
Suchvariablesoccurfrequentlyinalgorithmsbasedondatadecompositions.Fordataofthis
type,atleasttwocopiesareneeded,onetopreservetheinitialvalueandonetobeusedbythe
modifyingtask;thecopycontainingtheinitialvaluecanbediscardedwhennolongerneeded.
Ondistributedmemorysystems,typicallyacopyiscreatedforeachtaskneedingaccess(read
orwrite)tothedata.
Examples
Molecular dynamics
DecompositionandDataDecompositionpatterns.Wethenidentifiedthetaskgroups(intheGroup
Taskspattern)andconsideredtemporalconstraintsamongthetaskgroups(intheOrderTasks
pattern).Wewillignorethetemporalconstraintsfornowandjustfocusondatasharingforthe
problem'sfinaltaskgroups:
Thegroupoftaskstofindthe"bondedforces"(vibrationalforcesandrotationalforces)on
eachatom
Thegroupoftaskstofindthenonbondedforcesoneachatom
Thegroupoftaskstoupdatethepositionandvelocityofeachatom
Thetasktoupdatetheneighborlistforalltheatoms(whichtriviallyconstitutesataskgroup)
Thedatasharinginthisproblemcanbecomplicated.Wesummarizethedatasharedbetweengroups
inFig.3.5.Themajorshareddataitemsarethefollowing.
Theatomiccoordinates,usedbyeachgroup.
Thesecoordinatesaretreatedasreadonlydatabythebondedforcegroup,thenonbonded
forcegroup,andtheneighborlistupdategroup.Thisdataisreadwriteforthepositionupdate
group.Fortunately,thepositionupdategroupexecutesaloneaftertheotherthreegroupsare
done(basedontheorderingconstraintsdevelopedusingtheOrderTaskspattern).Hence,in
thefirstthreegroups,wecanleaveaccessestothepositiondataunprotectedorevenreplicate
it.Forthepositionupdategroup,thepositiondatabelongstothereadwritecategory,and
accesstothisdatawillneedtobecontrolledcarefully.
Theforcearray,usedbyeachgroupexceptfortheneighborlistupdate.
Thisarrayisusedasreadonlydatabythepositionupdategroupandasaccumulatedatafor
thebondedandnonbondedforcegroups.Becausethepositionupdategroupmustfollowthe
forcecomputations(asdeterminedusingtheOrderTaskspattern),wecanputthisarrayinthe
accumulatecategoryfortheforcegroupsandinthereadonlycategoryforthepositionupdate
group.
Thestandardprocedureformoleculardynamicssimulations[MR95]beginsbyinitializingthe
forcearrayasalocalarrayoneachUE.Contributionstoelementsoftheforcearrayarethen
computedbyeachUE,withtheprecisetermscomputedbeingunpredictablebecauseofthe
waythemoleculefoldsinspace.Afteralltheforceshavebeencomputed,thelocalarraysare
reducedintoasinglearray,acopyofwhichisplaceoneachUE(seethediscussionof
reductioninSec.6.4.2formoreinformation.)
Theneighborlist,sharedbetweenthenonbondedforcegroupandtheneighborlistupdate
group.
Theneighborlistisessentiallylocaldatafortheneighborlistupdategroupandreadonlydata
forthenonbondedforcecomputation.ThelistcanbemanagedinlocalstorageoneachUE.
Figure 3.5. Data sharing in molecular dynamics. We distinguish between sharing for
reads, read-writes, and accumulations.
3.7. THE DESIGN EVALUATION PATTERN

Problem
Isthedecompositionanddependencyanalysissofargoodenoughtomoveontothenextdesign
space,orshouldthedesignberevisited?
Context
Atthispoint,theproblemhasbeendecomposedintotasksthatcanexecuteconcurrently(usingthe
TaskDecompositionandDataDecompositionpatterns)andthedependenciesbetweenthemhave
beenidentified(usingtheGroupTasks,OrderTasks,andDataSharingpatterns).Inparticular,the
originalproblemhasbeendecomposedandanalyzedtoproduce:
Ataskdecompositionthatidentifiestasksthatcanexecuteconcurrently
Adatadecompositionthatidentifiesdatalocaltoeachtask
Awayofgroupingtasksandorderingthegroupstosatisfytemporalconstraints
Ananalysisofdependenciesamongtasks
Itisthesefouritemsthatwillguidethedesigner'sworkinthenextdesignspace(theAlgorithm
Structurepatterns).Therefore,gettingtheseitemsrightandfindingthebestproblemdecompositionis
importantforproducingahighqualitydesign.
Insomecases,theconcurrencyisstraightforwardandthereisclearlyasinglebestwaytodecompose
aproblem.Moreoften,however,multipledecompositionsarepossible.Hence,itisimportantbefore
proceedingtoofarintothedesignprocesstoevaluatetheemergingdesignandmakesureitmeetsthe
application'sneeds.Rememberthatalgorithmdesignisaninherentlyiterativeprocess,anddesigners
shouldnotexpecttoproduceanoptimumdesignonthefirstpassthroughtheFindingConcurrency
patterns.
Forces
Thedesignneedstobeevaluatedfromthreeperspectives.
Suitabilityforthetargetplatform.Issuessuchasnumberofprocessorsandhowdatastructures
aresharedwillinfluencetheefficiencyofanydesign,butthemorethedesigndependsonthe
targetarchitecture,thelessflexibleitwillbe.
Designquality.Simplicity,flexibility,andefficiencyarealldesirablebutpossibly
conflictingattributes.
Preparationforthenextphaseofthedesign.Arethetasksanddependenciesregularor
irregular(thatis,aretheysimilarinsize,ordotheyvary)?Istheinteractionbetweentasks
synchronousorasynchronous(thatis,dotheinteractionsoccuratregularintervalsorhighly
variableorevenrandomtimes)?Arethetasksaggregatedinaneffectiveway?Understanding
theseissueswillhelpchooseanappropriatesolutionfromthepatternsintheAlgorithm
Structuredesignspace.
Solution
Beforemovingontothenextphaseofthedesignprocess,itishelpfultoevaluatetheworksofarfrom
thethreeperspectivesmentionedintheForcessection.Theremainderofthispatternconsistsof
questionsanddiscussionstohelpwiththeevaluation.
Suitability for target platform
Althoughitisdesirabletodelaymappingaprogramontoaparticulartargetplatformaslongas
possible,thecharacteristicsofthetargetplatformdoneedtobeconsideredatleastminimallywhile
evaluatingadesign.Followingaresomeissuesrelevanttothechoiceoftargetplatformorplatforms.
HowmanyPEsareavailable?Withsomeexceptions,havingmanymoretasksthanPEsmakesit
easiertokeepallthePEsbusy.Obviouslywecan'tmakeuseofmorePEsthanwehavetasks,but
havingonlyoneorafewtasksperPEcanleadtopoorloadbalance.Forexample,considerthecaseof
aMonteCarlosimulationinwhichacalculationisrepeatedoverandoverfordifferentsetsof
randomlychosendata,suchthatthetimetakenforthecalculationvariesconsiderablydependingon
thedata.Anaturalapproachtodevelopingaparallelalgorithmwouldbetotreateachcalculation(for
aseparatesetofdata)asatask;thesetasksarethencompletelyindependentandcanbescheduled
howeverwelike.Butbecausethetimeforeachtaskcanvaryconsiderably,unlesstherearemany
moretasksthanPEs,itwillbedifficulttoachievegoodloadbalance.
Theexceptionstothisrulearedesignsinwhichthenumberoftaskscanbeadjustedtofitthenumber
ofPEsinsuchawaythatgoodloadbalanceismaintained.Anexampleofsuchadesignistheblock
basedmatrixmultiplicationalgorithmdescribedintheExamplessectionoftheDataDecomposition
pattern:Taskscorrespondtoblocks,andallthetasksinvolveroughlythesameamountof
computation,soadjustingthenumberoftaskstobeequaltothenumberofPEsproducesanalgorithm
withgoodloadbalance.(Note,however,thateveninthiscaseitmightbeadvantageoustohavemore
tasksthanPEs.Thismight,forexample,allowoverlapofcomputationandcommunication.)
HowaredatastructuressharedamongPEs?Adesignthatinvolveslargescaleorfinegraineddata
sharingamongtaskswillbeeasiertoimplementandmoreefficientifalltaskshaveaccesstothe
samememory.Easeofimplementationdependsontheprogrammingenvironment;anenvironment
basedonasharedmemorymodel(allUEsshareanaddressspace)makesiteasiertoimplementa
designrequiringextensivedatasharing.Efficiencydependsalsoonthetargetmachine;adesign
involvingextensivedatasharingislikelytobemoreefficientonasymmetricmultiprocessor(where
accesstimetomemoryisuniformacrossprocessors)thanonamachinethatlayersasharedmemory
environmentoverphysicallydistributedmemory.Incontrast,iftheplanistouseamessagepassing
environmentrunningonadistributedmemoryarchitecture,adesigninvolvingextensivedatasharing
isprobablynotagoodchoice.
Forexample,considerthetaskbasedapproachtothemedicalimagingproblemdescribedinthe
ExamplessectionoftheTaskDecompositionpattern.Thisdesignrequiresthatalltaskshaveread
accesstoapotentiallyverylargedatastructure(thebodymodel).Thispresentsnoproblemsina
sharedmemoryenvironment;itisalsonoprobleminadistributedmemoryenvironmentinwhich
eachPEhasalargememorysubsystemandthereisplentyofnetworkbandwidthtohandle
broadcastingthelargedataset.However,inadistributedmemoryenvironmentwithlimitedmemory
ornetworkbandwidth,themorememoryefficientalgorithmthatemphasizesthedatadecomposition
wouldberequired.
Adesignthatrequiresfinegraineddatasharing(inwhichthesamedatastructureisaccessed
repeatedlybymanytasks,particularlywhenbothreadsandwritesareinvolved)isalsolikelytobe
moreefficientonasharedmemorymachine,becausetheoverheadrequiredtoprotecteachaccessis
likelytobesmallerthanforadistributedmemorymachine.
Theexceptiontotheseprincipleswouldbeaprobleminwhichitiseasytogroupandscheduletasks
insuchawaythattheonlylargescaleorfinegraineddatasharingisamongtasksassignedtothe
sameunitofexecution.
WhatdoesthetargetarchitectureimplyaboutthenumberofUEsandhowstructuresareshared
amongthem?Inessence,werevisittheprecedingtwoquestions,butintermsofUEsratherthanPEs.
ThiscanbeanimportantdistinctiontomakeifthetargetsystemdependsonmultipleUEsperPEto
hidelatency.Therearetwofactorstokeepinmindwhenconsideringwhetheradesignusingmore
thanoneUEperPEmakessense.
ThefirstfactoriswhetherthetargetsystemprovidesefficientsupportformultipleUEsperPE.Some
systemsdoprovidesuchsupport,suchastheCrayMTAmachinesandmachinesbuiltwithIntel
processorsthatutilizehyperthreading.Thisarchitecturalapproachprovideshardwaresupportfor
extremelyrapidcontextswitching,makingitpracticaltouseinafarwiderrangeoflatencyhiding
situations.OthersystemsdonotprovidegoodsupportformultipleUEsperPE.Forexample,anMPP
systemwithslowcontextswitchingand/oroneprocessorpernodemightrunmuchbetterwhenthere
isonlyoneUEperPE.
ThesecondfactoriswhetherthedesigncanmakegooduseofmultipleUEsperPE.Forexample,if
thedesigninvolvescommunicationoperationswithhighlatency,itmightbepossibletomaskthat
latencybyassigningmultipleUEstoeachPEsosomeUEscanmakeprogresswhileothersare
waitingonahighlatencyoperation.If,however,thedesigninvolvescommunicationoperationsthat
aretightlysynchronized(forexample,pairsofblockingsend/receives)andrelativelyefficient,
assigningmultipleUEstoeachPEismorelikelytointerferewitheaseofimplementation(by
requiringextraefforttoavoiddeadlock)thantoimproveefficiency.
Onthetargetplatform,willthetimespentdoingusefulworkinataskbesignificantlygreaterthanthe
timetakentodealwithdependencies?Acriticalfactorindeterminingwhetheradesigniseffectiveis
theratiooftimespentdoingcomputationtotimespentincommunicationorsynchronization:The
highertheratio,themoreefficienttheprogram.Thisratioisaffectednotonlybythenumberandtype
ofcoordinationeventsrequiredbythedesign,butalsobythecharacteristicsofthetargetplatform.For
example,amessagepassingdesignthatisacceptablyefficientonanMPPwithafastinterconnect
networkandrelativelyslowprocessorswilllikelybelessefficient,perhapsunacceptablyso,onan
Ethernetconnectednetworkofpowerfulworkstations.
NotethatthiscriticalratioisalsoaffectedbyproblemsizerelativetothenumberofavailablePEs,
becauseforafixedproblemsize,thetimespentbyeachprocessordoingcomputationdecreaseswith
thenumberofprocessors,whilethetimespentbyeachprocessordoingcoordinationmightstaythe
sameorevenincreaseasthenumberofprocessorsincreases.
Design quality
Keepingthesecharacteristicsofthetargetplatforminmind,wecanevaluatethedesignalongthe
threedimensionsofflexibility,efficiency,andsimplicity.
Flexibility.Itisdesirableforthehighleveldesigntobeadaptabletoavarietyofdifferent
implementationrequirements,andcertainlyalltheimportantones.Therestofthissectionprovidesa
partialchecklistoffactorsthataffectflexibility.
Isthedecompositionflexibleinthenumberoftasksgenerated?Suchflexibilityallowsthe
designtobeadaptedtoawiderangeofparallelcomputers.
Isthedefinitionoftasksimpliedbythetaskdecompositionindependentofhowtheyare
scheduledforexecution?Suchindependencemakestheloadbalancingproblemeasierto
solve.
Canthesizeandnumberofchunksinthedatadecompositionbeparameterized?Such
parameterizationmakesadesigneasiertoscaleforvaryingnumbersofPEs.
Doesthealgorithmhandletheproblem'sboundarycases?Agooddesignwillhandleall
relevantcases,evenunusualones.Forexample,acommonoperationistotransposeamatrix
sothatadistributionintermsofblocksofmatrixcolumnsbecomesadistributionintermsof
blocksofmatrixrows.Itiseasytowritedownthealgorithmandcodeitforsquarematrices
wherethematrixorderisevenlydividedbythenumberofPEs.Butwhatifthematrixisnot
square,orwhatifthenumberofrowsismuchgreaterthanthenumberofcolumnsandneither
numberisevenlydividedbythenumberofPEs?Thisrequiressignificantchangestothe
transposealgorithm.Forarectangularmatrix,forexample,thebufferthatwillholdthematrix
blockwillneedtobelargeenoughtoholdthelargerofthetwoblocks.Ifeithertherowor
columndimensionofthematrixisnotevenlydivisiblebythenumberofPEs,thentheblocks
willnotbethesamesizeoneachPE.Canthealgorithmdealwiththeunevenloadthatwill
resultfromhavingdifferentblocksizesoneachPE?
Efficiency.Theprogramshouldeffectivelyutilizetheavailablecomputingresources.Therestofthis
sectiongivesapartiallistofimportantfactorstocheck.Notethattypicallyitisnotpossibleto
simultaneouslyoptimizeallofthesefactors;designtradeoffsareinevitable.
CanthecomputationalloadbeevenlybalancedamongthePEs?Thisiseasierifthetasksare
independent,oriftheyareroughlythesamesize.
Istheoverheadminimized?Overheadcancomefromseveralsources,includingcreationand
schedulingoftheUEs,communication,andsynchronization.CreationandschedulingofUEs
involvesoverhead,soeachUEneedstohaveenoughworktodotojustifythisoverhead.On
theotherhand,moreUEsallowforbetterloadbalance.
Communicationcanalsobeasourceofsignificantoverhead,particularlyondistributed
memoryplatformsthatdependonmessagepassing.AswediscussedinSection2.6,thetime
totransferamessagehastwocomponents:latencycostarisingfromoperatingsystem
overheadandmessagestartupcostsonthenetwork,andacostthatscaleswiththelengthof
themessage.Tominimizethelatencycosts,thenumberofmessagestobesentshouldbekept
toaminimum.Inotherwords,asmallnumberoflargemessagesisbetterthanalargenumber
ofsmallones.Thesecondtermisrelatedtothebandwidthofthenetwork.Thesecostscan
sometimesbehiddenbyoverlappingcommunicationwithcomputation.
Onsharedmemorymachines,synchronizationisamajorsourceofoverhead.Whendatais
sharedbetweenUEs,dependenciesariserequiringonetasktowaitforanothertoavoidrace
conditions.Thesynchronizationmechanismsusedtocontrolthiswaitingareexpensive
comparedtomanyoperationscarriedoutbyaUE.Furthermore,somesynchronization
constructsgeneratesignificantmemorytrafficastheyflushcaches,buffers,andothersystem
resourcestomakesureUEsseeaconsistentviewofmemory.Thisextramemorytrafficcan
interferewiththeexplicitdatamovementwithinacomputation.Synchronizationoverheadcan
bereducedbykeepingdatawelllocalizedtoatask,therebyminimizingthefrequencyof
synchronizationoperations.
Simplicity.ToparaphraseEinstein:Makeitassimpleaspossible,butnotsimpler.
Keepinmindthatpracticallyallprogramswilleventuallyneedtobedebugged,maintained,andoften
enhancedandported.Adesignevenagenerallysuperiordesignisnotvaluableifitistoohardto
debug,maintain,andverifythecorrectnessofthefinalprogram.
ThemedicalimagingexampleinitiallydescribedinSec.3.1.3andthendiscussedfurtherintheTask
DecompositionandDataDecompositionpatternsisanexcellentcaseinpointinsupportofthevalue
ofsimplicity.Inthisproblem,alargedatabasecouldbedecomposed,butthisdecompositionwould
forcetheparallelalgorithmtoincludecomplexoperationsforpassingtrajectoriesbetweenUEsandto
distributechunksofthedatabase.Thiscomplexitymakestheresultingprogrammuchmoredifficult
tounderstandandgreatlycomplicatesdebugging.Theotherapproach,replicatingthedatabase,leads
toavastlysimplerparallelprograminwhichcompletelyindependenttaskscanbepassedoutto
multipleworkersastheyareread.Allcomplexcommunicationthusgoesaway,andtheparallelpartof
theprogramistrivialtodebugandreasonabout.
Preparation for next phase
TheproblemdecompositioncarriedoutwiththeFindingConcurrencypatternsdefinesthekey
componentsthatwillguidethedesignintheAlgorithmStructuredesignspace:
Beforemovingoninthedesign,considerthesecomponentsrelativetothefollowingquestions.
Howregulararethetasksandtheirdatadependencies?Regulartasksaresimilarinsizeandeffort.
Irregulartaskswouldvarywidelyamongthemselves.Ifthetasksareirregular,theschedulingofthe
tasksandtheirsharingofdatawillbemorecomplicatedandwillneedtobeemphasizedinthedesign.
Inaregulardecomposition,allthetasksareinsomesensethesameroughlythesamecomputation
(ondifferentsetsofdata),roughlythesamedependenciesondatasharedwithothertasks,etc.
ExamplesincludethevariousmatrixmultiplicationalgorithmsdescribedintheExamplessectionsof
theTaskDecomposition,DataDecomposition,andotherpatterns.
Inanirregulardecomposition,theworkdonebyeachtaskand/orthedatadependenciesvaryamong
tasks.Forexample,consideradiscreteeventsimulationofalargesystemconsistingofanumberof
distinctcomponents.Wemightdesignaparallelalgorithmforthissimulationbydefiningataskfor
eachcomponentandhavingtheminteractbasedonthediscreteeventsofthesimulation.Thiswould
beaveryirregulardesigninthattherewouldbeconsiderablevariationamongtaskswithregardto
workdoneanddependenciesonothertasks.
Areinteractionsbetweentasks(ortaskgroups)synchronousorasynchronous?Insomedesigns,the
interactionbetweentasksisalsoveryregularwithregardtotimethatis,itissynchronous.For
example,atypicalapproachtoparallelizingalinearalgebraprobleminvolvingtheupdateofalarge
matrixistopartitionthematrixamongtasksandhaveeachtaskupdateitspartofthematrix,using
datafrombothitsandotherpartsofthematrix.Assumingthatallthedataneededfortheupdateis
presentatthestartofthecomputation,thesetaskswilltypicallyfirstexchangeinformationandthen
computeindependently.Anothertypeofexampleisapipelinecomputation(seethePipelinepattern),
inwhichweperformamultistepoperationonasequenceofsetsofinputdatabysettingupan
assemblylineoftasks(oneforeachstepoftheoperation),withdataflowingfromonetasktothenext
aseachtaskaccomplishesitswork.Thisapproachworksbestifallofthetasksstaymoreorlessin
stepthatis,iftheirinteractionissynchronous.
Inotherdesigns,theinteractionbetweentasksisnotsochronologicallyregular.Anexampleisthe
discreteeventsimulationdescribedpreviously,inwhichtheeventsthatleadtointeractionbetween
taskscanbechronologicallyirregular.
Arethetasksgroupedinthebestway?Thetemporalrelationsareeasy:Tasksthatcanrunatthesame
timearenaturallygroupedtogether.Butaneffectivedesignwillalsogrouptaskstogetherbasedon
theirlogicalrelationshipintheoverallproblem.
Asanexampleofgroupingtasks,considerthemoleculardynamicsproblemdiscussedinthe
ExamplessectionoftheGroupTasks,OrderTasks,andDataSharingpatterns.Thegroupingwe
eventuallyarriveat(intheGroupTaskspattern)ishierarchical:groupsofrelatedtasksbasedonthe
highleveloperationsoftheproblem,furthergroupedonthebasisofwhichonescanexecute
concurrently.Suchanapproachmakesiteasiertoreasonaboutwhetherthedesignmeetsthe
necessaryconstraints(becausetheconstraintscanbestatedintermsofthetaskgroupsdefinedbythe
highleveloperations)whileallowingforschedulingflexibility.
3.8. SUMMARY
WorkingthroughthepatternsintheFindingConcurrencydesignspaceexposestheconcurrencyin
yourproblem.Thekeyelementsfollowingfromthatanalysisare
Apatternlanguageistraditionallydescribedasawebofpatternswithonepatternlogicallyconnected
tothenext.TheoutputfromtheFindingConcurrencydesignspace,however,doesnotfitintothat
picture.Rather,thegoalofthisdesignspaceistohelpthedesignercreatethedesignelementsthat
togetherwillleadintotherestofthepatternlanguage.
Chapter 4. The Algorithm Structure Design

Space
4.1INTRODUCTION
4.2CHOOSINGANALGORITHMSTRUCTUREPATTERN
4.3EXAMPLES
4.4THETASKPARALLELISMPATTERN
4.5THEDIVIDEANDCONQUERPATTERN
4.6THEGEOMETRICDECOMPOSITIONPATTERN
4.7THERECURSIVEDATAPATTERN
4.8THEPIPELINEPATTERN
4.9THEEVENTBASEDCOORDINATIONPATTERN
4.1. INTRODUCTION
Thefirstphaseofdesigningaparallelalgorithmconsistsofanalyzingtheproblemtoidentify
exploitableconcurrency,usuallybyusingthepatternsoftheFindingConcurrencydesignspace.The
outputfromtheFindingConcurrencydesignspaceisadecompositionoftheproblemintodesign
elements:
TheseelementsprovidetheconnectionfromtheFindingConcurrencydesignspacetotheAlgorithm
Structuredesignspace.OurgoalintheAlgorithmStructuredesignspaceistorefinethedesignand
moveitclosertoaprogramthatcanexecutetasksconcurrentlybymappingtheconcurrencyonto
multipleUEsrunningonaparallelcomputer.
Ofthecountlesswaystodefineanalgorithmstructure,mostfollowoneofsixbasicdesignpatterns.
ThesepatternsmakeuptheAlgorithmStructuredesignspace.Anoverviewofthisdesignspaceand
itsplaceinthepatternlanguageisshowninFig.4.1.
Figure 4.1. Overview of the Algorithm Structure design space and its place in the
pattern language
Thekeyissueatthisstageistodecidewhichpatternorpatternsaremostappropriatefortheproblem.
Firstofall,weneedtokeepinmindthatdifferentaspectsoftheanalysiscanpullthedesignin
differentdirections;oneaspectmightsuggestonestructurewhileanothersuggestsadifferent
structure.Innearlyeverycase,however,thefollowingforcesshouldbekeptinmind.
Efficiency.Itiscrucialthataparallelprogramrunquicklyandmakegooduseofthecomputer
resources.
Simplicity.Asimplealgorithmresultingineasytounderstandcodeiseasiertodevelop,
debug,verify,andmodify.
Portability.Ideally,programsshouldrunonthewidestrangeofparallelcomputers.Thiswill
maximizethe"market"foraparticularprogram.Moreimportantly,aprogramisusedfor
manyyears,whileanyparticularcomputersystemisusedforonlyafewyears.Portable
programsprotectasoftwareinvestment.
Scalability.Ideally,analgorithmshouldbeeffectiveonawiderangeofnumbersofprocessing
elements(PEs),fromafewuptohundredsoreventhousands.
Theseforcesconflictinseveralways,however.
Efficiencyconflictswithportability:Makingaprogramefficientalmostalwaysrequiresthatthecode
takeintoaccountthecharacteristicsofthespecificsystemonwhichitisintendedtorun,whichlimits
portability.Adesignthatmakesuseofthespecialfeaturesofaparticularsystemorprogramming
environmentmayleadtoanefficientprogramforthatparticularenvironment,butbeunusablefora
differentplatform,eitherbecauseitperformspoorlyorbecauseitisdifficultorevenimpossibleto
implementforthenewplatform.
Efficiencyalsocanconflictwithsimplicity:Forexample,towriteefficientprogramsthatusetheTask
Parallelismpattern,itissometimesnecessarytousecomplicatedschedulingalgorithms.These
algorithmsinmanycases,however,maketheprogramverydifficulttounderstand.
Thus,agoodalgorithmdesignmuststrikeabalancebetween(1)abstractionandportabilityand(2)
suitabilityforaparticulartargetarchitecture.Thechallengefacedbythedesigner,especiallyatthis
earlyphaseofthealgorithmdesign,istoleavetheparallelalgorithmdesignabstractenoughto
supportportabilitywhileensuringthatitcaneventuallybeimplementedeffectivelyfortheparallel
systemsonwhichitwillbeexecuted.
4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN

FindinganeffectiveAlgorithmStructurepatternforagivenproblemcanbeaccomplishedby
consideringthequestionsinthefollowingsections.
4.2.1. Target Platform
Whatconstraintsareplacedontheparallelalgorithmbythetargetmachineorprogramming
environment?
Inanidealworld,itwouldnotbenecessarytoconsiderthedetailsofthetargetplatformatthisstage
ofthedesign,becausedoingsoworksagainstkeepingtheprogramportableandscalable.Thisisnot
anidealworld,however,andsoftwaredesignedwithoutconsideringthemajorfeaturesofthetarget
platformisunlikelytorunefficiently.
Theprimaryissueishowmanyunitsofexecution(UEs)thesystemwilleffectivelysupport,because
analgorithmthatworkswellfortenUEsmaynotworkwellforhundredsofUEs.Itisnotnecessary
todecideonaspecificnumber(infacttodosowouldoverlyconstraintheapplicabilityofthedesign),
butitisimportanttohaveinmindatthispointanorderofmagnitudeforthenumberofUEs.
AnotherissueishowexpensiveitistoshareinformationamongUEs.Ifthereishardwaresupportfor
sharedmemory,informationexchangetakesplacethroughsharedaccesstocommonmemory,and
frequentdatasharingmakessense.Ifthetargetisacollectionofnodesconnectedbyaslownetwork,
however,thecommunicationrequiredtoshareinformationisveryexpensiveandmustbeavoided
whereverpossible.
WhenthinkingaboutbothoftheseissuesthenumberofUEsandthecostofsharinginformation
avoidthetendencytooverconstrainthedesign.Softwaretypicallyoutliveshardware,sooverthe
courseofaprogram'slifeitmaybeusedonatremendousrangeoftargetplatforms.Thegoalisto
obtainadesignthatworkswellontheoriginaltargetplatform,butatthesametimeisflexibleenough
toadapttodifferentclassesofhardware.
Finally,inadditiontomultipleUEsandsomewaytoshareinformationamongthem,aparallel
computerhasoneormoreprogrammingenvironmentsthatcanbeusedtoimplementparallel
algorithms.Differentprogrammingenvironmentsprovidedifferentwaystocreatetasksandshare
informationamongUEs,andadesignthatdoesnotmapwellontothecharacteristicsofthetarget
programmingenvironmentwillbedifficulttoimplement.
4.2.2. Major Organizing Principle
Whenconsideringtheconcurrencyintheproblem,isthereaparticularwayoflookingatitthatstands
outandprovidesahighlevelmechanismfororganizingthisconcurrency?
TheanalysiscarriedoutusingthepatternsoftheFindingConcurrencydesignspacedescribesthe
potentialconcurrencyintermsoftasksandgroupsoftasks,data(bothsharedandtasklocal),and
orderingconstraintsamongtaskgroups.Thenextstepistofindanalgorithmstructurethatrepresents
howthisconcurrencymapsontotheUEs.Thereisusuallyamajororganizingprincipleimpliedbythe
concurrency.Thisusuallyfallsintooneofthreecamps:organizationbytasks,organizationbydata
decomposition,andorganizationbyflowofdata.Wenowconsidereachoftheseinmoredetail.
Forsomeproblems,thereisreallyonlyonegroupoftasksactiveatonetime,andthewaythetasks
withinthisgroupinteractisthemajorfeatureoftheconcurrency.Examplesincludesocalled
embarrassinglyparallelprogramsinwhichthetasksarecompletelyindependent,aswellasprograms
inwhichthetasksinasinglegroupcooperatetocomputearesult.
Forotherproblems,thewaydataisdecomposedandsharedamongtasksstandsoutasthemajorway
toorganizetheconcurrency.Forexample,manyproblemsfocusontheupdateofafewlargedata
structures,andthemostproductivewaytothinkabouttheconcurrencyisintermsofhowthis
structureisdecomposedanddistributedamongUEs.Programstosolvedifferentialequationsorcarry
outlinearalgebracomputationsoftenfallintothiscategorybecausetheyarefrequentlybasedon
updatinglargedatastructures.
Finally,forsomeproblems,themajorfeatureoftheconcurrencyisthepresenceofwelldefined
interactinggroupsoftasks,andthekeyissueishowthedataflowsamongthetasks.Forexample,ina
signalprocessingapplication,datamayflowthroughasequenceoftasksorganizedasapipeline,each
performingatransformationonsuccessivedataelements.Oradiscreteeventsimulationmightbe
parallelizedbydecomposingitintoatasksinteractingvia"events".Here,themajorfeatureofthe
concurrencyisthewayinwhichthesedistincttaskgroupsinteract.
Noticealsothatthemosteffectiveparallelalgorithmdesignmightmakeuseofmultiplealgorithm
structures(combinedhierarchically,compositionally,orinsequence),andthisisthepointatwhichto
considerwhethersuchadesignmakessense.Forexample,itoftenhappensthattheverytoplevelof
thedesignisasequentialcompositionofoneormoreAlgorithmStructurepatterns.Otherdesigns
mightbeorganizedhierarchically,withonepatternusedtoorganizetheinteractionofthemajortask
groupsandotherpatternsusedtoorganizetaskswithinthegroupsforexample,aninstanceofthe
PipelinepatterninwhichindividualstagesareinstancesoftheTaskParallelismpattern.
4.2.3. The Algorithm Structure Decision Tree
Foreachsubsetoftasks,whichAlgorithmStructuredesignpatternmosteffectivelydefineshowto
mapthetasksontoUEs?
Havingconsideredthequestionsraisedintheprecedingsections,wearenowreadytoselectan
algorithmstructure,guidedbyanunderstandingofconstraintsimposedbythetargetplatform,an
appreciationoftheroleofhierarchyandcomposition,andamajororganizingprincipleforthe
problem.ThedecisionisguidedbythedecisiontreeshowninFig.4.2.Startingatthetopofthetree,
considertheconcurrencyandthemajororganizingprinciple,andusethisinformationtoselectoneof
thethreebranchesofthetree;thenfollowtheupcomingdiscussionfortheappropriatesubtree.Notice
againthatforsomeproblems,thefinaldesignmightcombinemorethanonealgorithmstructure:Ifno
singlestructureseemssuitable,itmightbenecessarytodividethetasksmakinguptheprobleminto
twoormoregroups,workthroughthisprocedureseparatelyforeachgroup,andthendeterminehow
tocombinetheresultingalgorithmstructures.
Figure 4.2. Decision tree for the Algorithm Structure design space
Organize By Tasks
SelecttheOrganizeByTasksbranchwhentheexecutionofthetasksthemselvesisthebestorganizing
principle.Thendeterminehowthetasksareenumerated.Iftheycanbegatheredintoasetlinearin
anynumberofdimensions,choosetheTaskParallelismpattern.Thispatternincludesbothsituations
inwhichthetasksareindependentofeachother(socalledembarrassinglyparallelalgorithms)and
situationsinwhichtherearesomedependenciesamongthetasksintheformofaccesstoshareddata
oraneedtoexchangemessages.Ifthetasksareenumeratedbyarecursiveprocedure,choosethe
DivideandConquerpattern.Inthispattern,theproblemissolvedbyrecursivelydividingitinto
subproblems,solvingeachsubproblemindependently,andthenrecombiningthesubsolutionsintoa
solutiontotheoriginalproblem.
Organize By Data Decomposition
SelecttheOrganizeByDataDecompositionbranchwhenthedecompositionofthedataisthemajor
organizingprincipleinunderstandingtheconcurrency.Therearetwopatternsinthisgroup,differing
inhowthedecompositionisstructuredlinearlyineachdimensionorrecursively.Choosethe
GeometricDecompositionpatternwhentheproblemspaceisdecomposedintodiscretesubspacesand
theproblemissolvedbycomputingsolutionsforthesubspaces,withthesolutionforeachsubspace
typicallyrequiringdatafromasmallnumberofothersubspaces.Manyinstancesofthispatterncanbe
foundinscientificcomputing,whereitisusefulinparallelizinggridbasedcomputations,for
example.ChoosetheRecursiveDatapatternwhentheproblemisdefinedintermsoffollowinglinks
througharecursivedatastructure(forexample,abinarytree).
Organize By Flow of Data
SelecttheOrganizeByFlowofDatabranchwhenthemajororganizingprincipleishowtheflowof
dataimposesanorderingonthegroupsoftasks.Thispatterngrouphastwomembers,onethat
applieswhenthisorderingisregularandstaticandonethatapplieswhenitisirregularand/or
dynamic.ChoosethePipelinepatternwhentheflowofdataamongtaskgroupsisregular,oneway,
anddoesnotchangeduringthealgorithm(thatis,thetaskgroupscanbearrangedintoapipeline
throughwhichthedataflows).ChoosetheEventBasedCoordinationpatternwhentheflowofdatais
irregular,dynamic,and/orunpredictable(thatis,whenthetaskgroupscanbethoughtofasinteracting
viaasynchronousevents).
4.2.4. Re-evaluation
IstheAlgorithmStructurepattern(orpatterns)suitableforthetargetplatform?Itisimportantto
frequentlyreviewdecisionsmadesofartobesurethechosenpattern(s)areagoodfitwiththetarget
platform.
AfterchoosingoneormoreAlgorithmStructurepatternstobeusedinthedesign,skimthroughtheir
descriptionstobesuretheyarereasonablysuitableforthetargetplatform.(Forexample,ifthetarget
platformconsistsofalargenumberofworkstationsconnectedbyaslownetwork,andoneofthe
chosenAlgorithmStructurepatternsrequiresfrequentcommunicationamongtasks,itmightbe
difficulttoimplementthedesignefficiently.)Ifthechosenpatternsseemwildlyunsuitableforthe
targetplatform,tryidentifyingasecondaryorganizingprincipleandworkingthroughthepreceding
stepagain.
4.3. EXAMPLES
4.3.1. Medical Imaging
Forexample,considerthemedicalimagingproblemdescribedinSec.3.1.3.Thisapplication
simulatesalargenumberofgammaraysastheymovethroughabodyandouttoacamera.Oneway
todescribetheconcurrencyistodefinethesimulationofeachrayasatask.Becausetheyareall
logicallyequivalent,weputthemintoasingletaskgroup.Theonlydatasharedamongthetasksisa
largedatastructurerepresentingthebody,andsinceaccesstothisdatastructureisreadonly,thetasks
donotdependoneachother.
Becausetherearemanyindependenttasksforthisproblem,itislessnecessarythanusualtoconsider
thetargetplatform:Thelargenumberoftasksshouldmeanthatwecanmakeeffectiveuseofany
(reasonable)numberofUEs;theindependenceofthetasksshouldmeanthatthecostofsharing
informationamongUEswillnothavemucheffectonperformance.
Thus,weshouldbeabletochooseasuitablestructurebyworkingthroughthedecisiontreeshown
previouslyinFig.4.2.Giventhatinthisproblemthetasksareindependent,theonlyissuewereally
needtoworryaboutasweselectanalgorithmstructureishowtomapthesetasksontoUEs.Thatis,
forthisproblem,themajororganizingprincipleseemstobethewaythetasksareorganized,sowe
startbyfollowingtheOrganizeByTasksbranch.
Wenowconsiderthenatureofoursetoftaskswhethertheyarearrangedhierarchicallyorresidein
anunstructuredorflatset.Forthisproblem,thetasksareinanunstructuredsetwithnoobvious
hierarchicalstructureamongthem,sowechoosetheTaskParallelismpattern.Notethatinthe
problem,thetasksareindependent,afactthatwewillbeabletousetosimplifythesolution.
Finally,wereviewthisdecisioninlightofpossibletargetplatformconsiderations.Asweobserved
earlier,thekeyfeaturesofthisproblem(thelargenumberoftasksandtheirindependence)makeit
unlikelythatwewillneedtoreconsiderbecausethechosenstructurewillbedifficulttoimplementon
thetargetplatform.
4.3.2. Molecular Dynamics
Asasecondexample,considerthemoleculardynamicsproblemdescribedinSec.3.1.3.IntheTask
Decompositionpattern,weidentifiedthefollowinggroupsoftasksassociatedwiththisproblem:
Atasktoupdatetheneighborlistforalltheatoms
Thetaskswithineachgroupareexpressedastheiterationsofaloopovertheatomswithinthe
molecularsystem.
Wecanchooseasuitablealgorithmstructurebyworkingthroughthedecisiontreeshownearlierin
Fig.4.2.Oneoptionistoorganizetheparallelalgorithmintermsoftheflowofdataamongthe
groupsoftasks.Notethatonlythefirstthreetaskgroups(thevibrational,rotational,andnonbonded
forcecalculations)canexecuteconcurrently;thatis,theymustfinishcomputingtheforcesbeforethe
atomicpositions,velocitiesandneighborlistscanbeupdated.Thisisnotverymuchconcurrencyto
workwith,soadifferentbranchinFig.4.2shouldbeusedforthisproblem.
Anotheroptionistoderiveexploitableconcurrencyfromthesetoftaskswithineachgroup,inthis
casetheiterationsofaloopoveratoms.Thissuggestsanorganizationbytaskswithalinear
arrangementoftasks,orbasedonFig.4.2,theTaskParallelismpatternshouldbeused.Totalavailable
concurrencyislarge(ontheorderofthenumberofatoms),providingagreatdealofflexibilityin
designingtheparallelalgorithm.
Thetargetmachinecanhaveamajorimpactontheparallelalgorithmforthisproblem.The
dependenciesdiscussedintheDataDecompositionpattern(replicatedcoordinatesoneachUEanda
combinationofpartialsumsfromeachUEtocomputeaglobalforcearray)suggestthatontheorder
of23Nterms(whereNisthenumberofatoms)willneedtobepassedamongtheUEs.The
computation,however,isofordernN,wherenisthenumberofatomsintheneighborhoodofeach
atomandconsiderablylessthanN.Hence,thecommunicationandcomputationareofthesameorder
andmanagementofcommunicationoverheadwillbeakeyfactorindesigningthealgorithm.
4.4. THE TASK PARALLELISM PATTERN

Problem
Whentheproblemisbestdecomposedintoacollectionoftasksthatcanexecuteconcurrently,how
canthisconcurrencybeexploitedefficiently?
Context
Everyparallelalgorithmisfundamentallyacollectionofconcurrenttasks.Thesetasksandany
dependenciesamongthemcanbeidentifiedbyinspection(forsimpleproblems)orbyapplicationof
thepatternsintheFindingConcurrencydesignspace.Forsomeproblems,focusingonthesetasksand
theirinteractionmightnotbethebestwaytoorganizethealgorithm:Insomecasesitmakessenseto
organizethetasksintermsofthedata(asintheGeometricDecompositionpattern)ortheflowof
dataamongconcurrenttasks(asinthePipelinepattern).However,inmanycasesitisbesttowork
directlywiththetasksthemselves.Whenthedesignisbaseddirectlyonthetasks,thealgorithmis
saidtobeataskparallelalgorithm.
Theclassoftaskparallelalgorithmsisverylarge.Examplesincludethefollowing.
RaytracingcodessuchasthemedicalimagingexampledescribedintheTaskDecomposition
pattern:Herethecomputationassociatedwitheach"ray"becomesaseparateandcompletely
independenttask.
ThemoleculardynamicsexampledescribedintheTaskDecompositionpattern:Theupdateof
thenonbondedforceoneachatomisatask.Thedependenciesamongtasksaremanagedby
replicatingtheforcearrayoneachUEtoholdthepartialsumsforeachatom.Whenallthe
taskshavecompletedtheircontributionstothenonbondedforce,theindividualforcearrays
arecombined(or"reduced")intoasinglearrayholdingthefullsummationofnonbonded
forcesforeachatom.
Branchandboundcomputations,inwhichtheproblemissolvedbyrepeatedlyremovinga
solutionspacefromalistofsuchspaces,examiningit,andeitherdeclaringitasolution,
discardingit,ordividingitintosmallersolutionspacesthatarethenaddedtothelistofspaces
toexamine.Suchcomputationscanbeparallelizedusingthispatternbymakingeach"examine
andprocessasolutionspace"stepaseparatetask.Thetasksweaklydependoneachother
throughthesharedqueueoftasks.
Thecommonfactoristhattheproblemcanbedecomposedintoacollectionoftasksthatcanexecute
concurrently.Thetaskscanbecompletelyindependent(asinthemedicalimagingexample)orthere
canbedependenciesamongthem(asinthemoleculardynamicsexample).Inmostcases,thetasks
willbeassociatedwithiterationsofaloop,butitispossibletoassociatethemwithlargerscale
programstructuresaswell.
Inmanycases,allofthetasksareknownatthebeginningofthecomputation(thefirsttwoexamples).
However,insomecases,tasksarisedynamicallyasthecomputationunfolds,asinthebranchand
boundexample.
Also,whileitisusuallythecasethatalltasksmustbecompletedbeforetheproblemisdone,forsome
problems,itmaybepossibletoreachasolutionwithoutcompletingallofthetasks.Forexample,in
thebranchandboundexample,wehaveapooloftaskscorrespondingtosolutionspacestobe
searched,andwemightfindanacceptablesolutionbeforeallthetasksinthispoolhavebeen
completed.
Forces
Toexploitthepotentialconcurrencyintheproblem,wemustassigntaskstoUEs.Ideallywe
wanttodothisinawaythatissimple,portable,scalable,andefficient.AsnotedinSection
4.1,however,thesegoalsmayconflict.Akeyconsiderationisbalancingtheload,thatis,
ensuringthatallUEshaveroughlythesameamountofworktodo.
Ifthetasksdependoneachotherinsomeway(viaeitherorderingconstraintsordata
dependencies),thesedependenciesmustbemanagedcorrectly,againkeepinginmindthe
sometimesconflictinggoalsofsimplicity,portability,sealability,andefficiency.
Solution
Designsfortaskparallelalgorithmsinvolvethreekeyelements:thetasksandhowtheyaredefined,
thedependenciesamongthem,andtheschedule(howthetasksareassignedtoUEs).Wediscuss
themseparately,butinfacttheyaretightlycoupled,andallthreemustbeconsideredbeforefinal
decisionsaremade.Afterthesefactorsareconsidered,welookattheoverallprogramstructureand
thenatsomeimportantspecialcasesofthispattern.
Tasks
Ideally,thetasksintowhichtheproblemisdecomposedshouldmeettwocriteria:First,thereshould
beatleastasmanytasksasUEs,andpreferablymanymore,toallowgreaterflexibilityinscheduling.
Second,thecomputationassociatedwitheachtaskmustbelargeenoughtooffsettheoverhead
associatedwithmanagingthetasksandhandlinganydependencies.Iftheinitialdecompositiondoes
notmeetthesecriteria,itisworthwhiletoconsiderwhetherthereisanotherwayofdecomposingthe
problemintotasksthatdoesmeetthecriteria.
Forexample,inimageprocessingapplicationswhereeachpixelupdateisindependent,thetask
definitioncanbeindividualpixels,imagelines,orevenwholeblocksintheimage.Onasystemwith
asmallnumberofnodesconnectedbyaslownetwork,tasksshouldbelargetooffsethigh
communicationlatencies,sobasingtasksonblocksoftheimageisappropriate.Thesameproblemon
asystemcontainingalargenumberofnodesconnectedbyafast(lowlatency)network,however,
wouldneedsmallertaskstomakesureenoughworkexiststokeepalltheUEsoccupied.Noticethat
thisimposesarequirementforafastnetwork,becauseotherwisethesmalleramountofworkpertask
willnotbeenoughtocompensateforcommunicationoverhead.
Dependencies
Dependenciesamongtaskshaveamajorimpactontheemergingalgorithmdesign.Therearetwo
categoriesofdependencies,orderingconstraintsanddependenciesrelatedtoshareddata.
Forthispattern,orderingconstraintsapplytotaskgroupsandcanbehandledbyforcingthegroupsto
executeintherequiredorder.Forexample,inataskparallelmultidimensionalFastFourierTransform,
thereisagroupoftasksforeachdimensionofthetransform,andsynchronizationorotherprogram
constructsareusedtomakesurecomputationononedimensioncompletesbeforethenextdimension
begins.Alternatively,wecouldsimplythinkofsuchaproblemasasequentialcompositionoftask
parallelcomputations,oneforeachtaskgroup.
Shareddatadependenciesarepotentiallymorecomplicated.Inthesimplestcase,thereareno
dependenciesamongthetasks.Asurprisinglylargenumberofproblemscanbecastintothisform.
Suchproblemsareoftencalledembarrassinglyparallel.Theirsolutionsareamongthesimplestof
parallelprograms;themainconsiderationsarehowthetasksaredefined(asdiscussedpreviously)and
scheduled(asdiscussedlater).Whendataissharedamongtasks,thealgorithmcanbemuchmore
complicated,althoughtherearestillsomecommoncasesthatcanbedealtwithrelativelyeasily.We
cancategorizedependenciesasfollows.
Removabledependencies.Inthiscase,thedependencyisnotatruedependencybetweentasks,
butanapparentdependencythatcanberemovedbysimplecodetransformations.Thesimplest
caseisatemporaryvariablewhoseuseiscompletelylocaltoeachtask;thatis,eachtask
initializesthevariablewithoutreferencetoothertasks.Thiscasecanbehandledbysimply
creatingacopyofthevariablelocaltoeachUE.Inmorecomplicatedcases,iterative
expressionsmightneedtobetransformedintoclosedformexpressionstoremovealoop
carrieddependency.Forexample,considerthefollowingsimpleloop:
int ii = 0, jj = 0;
for(int i = 0; i< N; i++)
{
ii = ii + 1;
d[ii] = big_time_consuming_work(ii);
jj = jj + i;
a[jj] = other_big_calc(jj);
}
Thevariablesiiandjjcreateadependencybetweentasksandpreventparallelizationofthe
loop.Wecanremovethisdependencybyreplacingiiandjjwithclosedformexpressions
(noticingthatthevaluesofiiandiarethesameandthatthevalueofjjisthesumofthe
valuesfrom0throughi):
for(int i = 0; i< N; i++){
d[i] = big_time_consuming_work(i);
a[(i*i+i)/2] = other_big_calc((i*i+i)/2));
}
"Separable"dependencies.Whenthedependenciesinvolveaccumulationintoashareddata
structure,theycanbeseparatedfromthetasks("pulledoutsidetheconcurrentcomputation")
byreplicatingthedatastructureatthebeginningofthecomputation,executingthetasks,and
thencombiningthecopiesintoasingledatastructureafterthetaskscomplete.Oftenthe
accumulationisareductionoperation,inwhichacollectionofdataelementsisreducedtoa
singleelementbyrepeatedlyapplyingabinaryoperationsuchasadditionormultiplication.
Inmoredetail,thesedependenciescanbemanagedasfollows:Acopyofthedatastructure
usedintheaccumulationiscreatedoneachUE.Eachcopyisinitialized(inthecaseofa
reduction,totheidentityelementforthebinaryoperationforexample,zeroforadditionand
oneformultiplication).Eachtaskthencarriesouttheaccumulationintoitslocaldata
structure,eliminatingtheshareddatadependency.Whenalltasksarecomplete,thelocaldata
structuresoneachUEarecombinedtoproducethefinalglobalresult(inthecaseofa
reduction,byapplyingthebinaryoperationagain).Asanexample,considerthefollowingloop
tosumtheelementsofarrayf:
for(int i = 0; i< N; i++){
sum = sum + f(i);
}
Thisistechnicallyadependencybetweenloopiterations,butifwerecognizethattheloop
bodyisjustaccumulatingintoasimplescalarvariable,itcanbehandledasareduction.
ReductionsaresocommonthatbothMPIandOpenMPprovidesupportforthemaspartofthe
API.Sec.6.4.2intheImplementationMechanismsdesignspacediscussesreductionsinmore
detail.
Otherdependencies.Iftheshareddatacannotbepulledoutofthetasksandisbothreadand
writtenbythetasks,datadependenciesmustbeexplicitlymanagedwithinthetasks.Howto
dothisinawaythatgivescorrectresultsandalsoacceptableperformanceisthesubjectofthe
SharedDatapattern.
Schedule
TheremainingkeyelementtoconsideristheschedulethewayinwhichtasksareassignedtoUEs
andscheduledforexecution.Loadbalance(asdescribedinChapter2)isacriticalconsiderationin
scheduling;adesignthatbalancesthecomputationalloadamongPEswillexecutemoreefficiently
thanonethatdoesnot.Fig.4.3illustratestheproblem.
Figure 4.3. Good versus poor load balance
Twoclassesofschedulesareusedinparallelalgorithms:staticschedules,inwhichthedistributionof
tasksamongUEsisdeterminedatthestartofthecomputationanddoesnotchange;anddynamic
schedules,inwhichthedistributionoftasksamongUEsvariesasthecomputationproceeds.
Inastaticschedule,thetasksareassociatedintoblocksandthenassignedtoUEs.Blocksizeis
adjustedsoeachUEtakesapproximatelythesameamountoftimetocompleteitstasks.Inmost
applicationsusingastaticschedule,thecomputationalresourcesavailablefromtheUEsare
predictableandstableoverthecourseofthecomputation,withthemostcommoncasebeingUEsthat
areidentical(thatis,thecomputingsystemishomogeneous).Ifthesetoftimesrequiredtocomplete
eachtaskisnarrowlydistributedaboutamean,thesizesoftheblocksshouldbeproportionaltothe
relativeperformanceoftheUEs(so,inahomogeneoussystem,theyareallthesamesize).Whenthe
effortassociatedwiththetasksvariesconsiderably,astaticschedulecanstillbeuseful,butnowthe
numberofblocksassignedtoUEsmustbemuchgreaterthanthenumberofUEs.Bydealingoutthe
blocksinaroundrobinmanner(muchasadeckofcardsisdealtamongagroupofcardplayers),the
loadisbalancedstatistically.
Dynamicschedulesareusedwhen(1)theeffortassociatedwitheachtaskvarieswidelyandis
unpredictableand/or(2)whenthecapabilitiesoftheUEsvarywidelyandunpredictably.Themost
commonapproachusedfordynamicloadbalancingistodefineataskqueuetobeusedbyallthe
UEs;whenaUEcompletesitscurrenttaskandisthereforereadytoprocessmorework,itremovesa
taskfromthetaskqueue.FasterUEsorthosereceivinglighterweighttaskswillaccessthequeue
moreoftenandtherebybeassignedmoretasks.
Anotherdynamicschedulingstrategyusesworkstealing,whichworksasfollows.Thetasksare
distributedamongtheUEsatthestartofthecomputation.EachUEhasitsownworkqueue.When
thequeueisempty,theUEwilltrytostealworkfromthequeueonsomeotherUE(wheretheother
UEisusuallyrandomlyselected).Inmanycases,thisproducesanoptimaldynamicschedulewithout
incurringtheoverheadofmaintainingasingleglobalqueue.Inprogrammingenvironmentsor
+
packagesthatprovidesupportfortheconstruct,suchasCilk[BJK
96
],Hood[BP99],ortheFJTask
framework[Lea00b,Lea],itisstraightforwardtousethisapproach.Butwithmorecommonlyused
programmingenvironmentssuchasOpenMP,MPI,orJava(withoutsupportsuchastheFJTask
framework),thisapproachaddssignificantcomplexityandthereforeisnotoftenused.
Selectingascheduleforagivenproblemisnotalwayseasy.Staticschedulesincurtheleastoverhead
duringtheparallelcomputationandshouldbeusedwheneverpossible.
Beforeendingthediscussionofschedules,weshouldmentionagainthatwhileformostproblemsall
ofthetasksareknownwhenthecomputationbeginsandallmustbecompletedtoproduceanoverall
solution,thereareproblemsforwhichoneorbothoftheseisnottrue.Inthesecases,adynamic
scheduleisprobablymoreappropriate.
Program structure
Manytaskparallelproblemscanbeconsideredtobeloopbased.Loopbasedproblemsare,asthe
nameimplies,thoseinwhichthetasksarebasedontheiterationsofaloop.Thebestsolutionsfor
suchproblemsusetheLoopParallelismpattern.Thispatterncanbeparticularlysimpletoimplement
inprogrammingenvironmentsthatprovidedirectivesforautomaticallyassigningloopiterationsto
UEs.Forexample,inOpenMPaloopcanbeparallelizedbysimplyaddinga"parallelfor"directive
withanappropriatescheduleclause(onethatmaximizesefficiency).Thissolutionisespecially
attractivebecauseOpenMPthenguaranteesthattheresultingprogramissemanticallyequivalentto
theanalogoussequentialcode(withinroundofferrorassociatedwithdifferentorderingsof
floatingpointoperations).
ForproblemsinwhichthetargetplatformisnotagoodfitwiththeLoopParallelismpattern,orfor
problemsinwhichthemodelof"alltasksknowninitially,alltasksmustcomplete"doesnotapply
(eitherbecausetaskscanbecreatedduringthecomputationorbecausethecomputationcanterminate
withoutalltasksbeingcomplete),thisstraightforwardapproachisnotthebestchoice.Instead,the
bestdesignmakesuseofataskqueue;tasksareplacedonthetaskqueueastheyarecreatedand
removedbyUEsuntilthecomputationiscomplete.Theoverallprogramstructurecanbebasedon
eithertheMaster/WorkerpatternortheSPMDpattern.Theformerisparticularlyappropriatefor
problemsrequiringadynamicschedule.
Inthecaseinwhichthecomputationcanterminatebeforeallthetasksarecomplete,somecaremust
betakentoensurethatthecomputationendswhenitshould.Ifwedefinetheterminationconditionas
theconditionthatwhentruemeansthecomputationiscompleteeitheralltasksarecompleteor
someothercondition(forexample,anacceptablesolutionhasbeenfoundbyonetask)thenwewant
tobesurethat(1)theterminationconditioniseventuallymet(which,iftaskscanbecreated
dynamically,mightmeanbuildingintoitalimitonthetotalnumberoftaskscreated),and(2)when
theterminationconditionismet,theprogramends.Howtoensurethelatterisdiscussedinthe
Master/WorkerandSPMDpatterns.
Common idioms
Mostproblemsforwhichthispatternisapplicablefallintothefollowingtwocategories.
Embarrassinglyparallelproblemsarethoseinwhichtherearenodependenciesamongthetasks.A
widerangeofproblemsfallintothiscategory,rangingfromrenderingframesinamotionpictureto
statisticalsamplingincomputationalphysics.Becausetherearenodependenciestomanage,thefocus
isonschedulingthetaskstomaximizeefficiency.Inmanycases,itispossibletodefineschedulesthat
automaticallyanddynamicallybalancetheloadamongUEs.
Replicateddataorreductionproblemsarethoseinwhichdependenciescanbemanagedby
"separatingthemfromthetasks"asdescribedearlierreplicatingthedataatthebeginningof
computationandcombiningresultswhentheterminationconditionismet(usually"alltasks
complete").Fortheseproblems,theoverallsolutionconsistsofthreephases,onetoreplicatethedata
intolocalvariables,onetosolvethenowindependenttasks(usingthesametechniquesusedfor
embarrassinglyparallelproblems),andonetorecombinetheresultsintoasingleresult.
Examples
Wewillconsidertwoexamplesofthispattern.Thefirstexample,animageconstructionexample,is
embarrassinglyparallel.Thesecondexamplewillbuildonthemoleculardynamicsexampleusedin
severaloftheFindingConcurrencypatterns.
Image construction
Inmanyimageconstructionproblems,eachpixelintheimageisindependentofalltheotherpixels.
Forexample,considerthewellknownMandelbrotset[Dou86].Thisfamousimageisconstructedby
coloringeachpixelaccordingtothebehaviorofthequadraticrecurrencerelation
Equation4.1
whereCandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
imaginarypartofContheverticalaxisandtherealpartonthehorizontalaxis.Thecolorofeach
pixelisblackiftherecurrencerelationconvergestoastablevalueoriscoloreddependingonhow
rapidlytherelationdiverges.
Atthelowestlevel,thetaskistheupdateforasinglepixel.Firstconsidercomputingthissetona
clusterofPCsconnectedbyanEthernet.Thisisacoarsegrainedsystem;thatis,therateof
communicationisslowrelativetotherateofcomputation.Tooffsettheoverheadincurredbytheslow
network,thetasksizeneedstobelarge;forthisproblem,thatmightmeancomputingafullrowofthe
image.Theworkinvolvedincomputingeachrowvariesdependingonthenumberofdivergentpixels
intherow.Thevariation,however,ismodestanddistributedcloselyaroundameanvalue.Therefore,
astaticschedulewithmanymoretasksthanUEswilllikelygiveaneffectivestatisticalbalanceofthe
loadamongnodes.Theremainingstepinapplyingthepatternischoosinganoverallstructureforthe
program.OnasharedmemorymachineusingOpenMP,theLoopParallelismpatterndescribedinthe
SupportingStructuresdesignspaceisagoodfit.OnanetworkofworkstationsrunningMPI,the
SPMDpattern(alsointheSupportingStructuresdesignspace)isappropriate.
Beforemovingontothenextexample,weconsideronemoretargetsystem,aclusterinwhichthe
nodesarenotheterogeneousthatis,somenodesaremuchfasterthanothers.Assumealsothatthe
speedofeachnodemaynotbeknownwhentheworkisscheduled.Becausethetimeneededto
computetheimageforarownowdependsbothontherowandonwhichnodecomputesit,adynamic
scheduleisindicated.Thisinturnsuggeststhatageneraldynamicloadbalancingschemeisindicated,
whichthensuggeststhattheoverallprogramstructureshouldbebasedontheMaster/Workerpattern.
Molecular dynamics
Foroursecondexample,weconsiderthecomputationofthenonbondedforcesinamolecular
dynamicscomputation.ThisproblemisdescribedinSec.3.1.3andin[Mat95,PH95]andisused
throughoutthepatternsintheFindingConcurrencydesignspace.Pseudocodeforthiscomputationis
showninFig.4.4.Thephysicsinthisexampleisnotrelevantandisburiedincodenotshownhere
(thecomputationoftheneighborslistandtheforcefunction).Thebasiccomputationstructureisa
loopoveratoms,andthenforeachatom,aloopoverinteractionswithotheratoms.Thenumberof
interactionsperatomiscomputedseparatelywhentheneighborslistisdetermined.Thisroutine(not
shownhere)computesthenumberofatomswithinaradiusequaltoapresetcutoffdistance.The
neighborlistisalsomodifiedtoaccountforNewton'sthirdlaw:Becausetheforceofatomionatom
jisthenegativeoftheforceofatomjonatomi,onlyhalfofthepotentialinteractionsneedactually
becomputed.Understandingthisdetailisnotimportantforunderstandingthisexample.Thekeyis
thatthiscauseseachloopoverjtovarygreatlyfromoneatomtoanother,therebygreatly
complicatingtheloadbalancingproblem.Indeed,forthepurposesofthisexample,allthatmustreally
beunderstoodisthatcalculatingtheforceisanexpensiveoperationandthatthenumberof
interactionsperatomvariesgreatly.Hence,thecomputationaleffortforeachiterationoveriis
difficulttopredictinadvance.
Figure 4.4. Pseudocode for the nonbonded computation in a typical molecular
dynamics code
function non_bonded_forces (N, Atoms, neighbors, Forces)

Array of Real :: atoms (3,N) //3D coordinates
Array of Real :: forces (3,N) //force in each dimension
Array of List :: neighbors(N) //atoms in cutoff volume
Real :: forceX, forceY, forceZ
loop [i] over atoms
loop [j] over neighbors(i)
forceX = non_bond_force(atoms(1,i),
forceY = non_bond_force(atoms(2,i),
forceZ = non_bond_force(atoms(3,i),
force(1,i) += forceX; force(1,j) -=
force(2,i) += forceY; force(2,j) -=
force(3,i) += forceZ; force(3,j) -=
end loop [j]
atoms(1,j))
atoms(2,j))
atoms(3,j))
forceX;
forceY;
forceZ;
end loop [i]

end function non_bonded_forces
Eachcomponentoftheforcetermisanindependentcomputation,meaningthateach(i, j)pairis
fundamentallyanindependenttask.Thenumberofatomstendstobeontheorderofthousands,and
squaringthatgivesanumberoftasksthatismorethanenoughforallbutthelargestparallelsystems.
Therefore,wecantakethemoreconvenientapproachofdefiningataskasoneiterationoftheloop
overi.Thetasks,however,arenotindependent:Theforcearrayisreadandwrittenbyeachtask.
Inspectionofthecodeshowsthatthearraysareonlyusedtoaccumulateresultsfromthecomputation,
however.Thus,thefullarraycanbereplicatedoneachUEandthelocalcopiescombined(reduced)
afterthetaskscomplete.
Afterthereplicationisdefined,theproblemisembarrassinglyparallelandthesameapproaches
discussedpreviouslyapply.WewillrevisitthisexampleintheMaster/Worker,LoopParallelism,and
SPMDpatterns.Achoiceamongthesepatternsisnormallymadebasedonthetargetplatforms.
Known uses
Therearemanyapplicationareasinwhichthispatternisuseful,includingthefollowing.
Manyraytracingprogramsusesomeformofpartitioningwithindividualtaskscorrespondingtoscan
linesinthefinalimage[BKS91].
ApplicationswrittenwithcoordinationlanguagessuchasLindaareanotherrichsourceofexamples
+
ofthispattern[BCM
91
].Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat
readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.The
tuplespaceprovidesanaturalwaytoimplementawidevarietyofsharedqueueandmaster/worker
algorithms.
Parallelcomputationalchemistryapplicationsalsomakeheavyuseofthispattern.Inthequantum
chemistryprogramGAMESS,theloopsovertwoelectronintegralsareparallelizedwiththetask
queueimpliedbytheNextvalconstructwithinTCGMSG.Anearlyversionofthedistance
geometryprogramDGEOMwasparallelizedwiththemaster/workerformofthispattern.These
examplesarediscussedin[Mat95].
PTEP(ParallelTelemetryProcessor)[NBB01],developedbyNASAasthedownlinkprocessing
systemfordatafromaplanetaryroverorlander,alsomakesuseofthispattern.Thesystemis
implementedinJavabutcanincorporatecomponentsimplementedinotherlanguages.Foreach
incomingdatapacket,thesystemdetermineswhichinstrumentproducedthedata,andthenperforms
anappropriatesequentialpipelineofprocessingsteps.Becausetheincomingdatapacketsare
independent,theprocessingofindividualpacketscanbedoneinparallel.
4.5. THE DIVIDE AND CONQUER PATTERN

Problem
Supposetheproblemisformulatedusingthesequentialdivideandconquerstrategy.Howcanthe
potentialconcurrencybeexploited?
Context
Thedivideandconquerstrategyisemployedinmanysequentialalgorithms.Withthisstrategy,a
problemissolvedbysplittingitintoanumberofsmallersubproblems,solvingthemindependently,
andmergingthesubsolutionsintoasolutionforthewholeproblem.Thesubproblemscanbesolved
directly,ortheycaninturnbesolvedusingthesamedivideandconquerstrategy,leadingtoan
overallrecursiveprogramstructure.
Thisstrategyhasprovenvaluableforawiderangeofcomputationallyintensiveproblems.Formany
problems,themathematicaldescriptionmapswellontoadivideandconqueralgorithm.Forexample,
thefamousfastFouriertransformalgorithm[PTV93]isessentiallyamappingofthedoublynested
loopsofthediscreteFouriertransformintoadivideandconqueralgorithm.Lesswellknownisthe
factthatmanyalgorithmsfromcomputationallinearalgebra,suchastheCholeskydecomposition
+
[ABE
97
,PLA],alsomapwellontodivideandconqueralgorithms.
Thepotentialconcurrencyinthisstrategyisnothardtosee:Becausethesubproblemsaresolved
independently,theirsolutionscanbecomputedconcurrently.Fig.4.5illustratesthestrategyandthe
potentialconcurrency.Noticethateach"split"doublestheavailableconcurrency.Althoughthe
concurrencyinadivideandconqueralgorithmisobvious,thetechniquesrequiredtoexploitit
effectivelyarenotalwaysobvious.
Figure 4.5. The divide-and-conquer strategy
Forces
Thetraditionaldivideandconquerstrategyisawidelyusefulapproachtoalgorithmdesign.
Sequentialdivideandconqueralgorithmsarealmosttrivialtoparallelizebasedontheobvious
exploitableconcurrency.
AsFig.4.5suggests,however,theamountofexploitableconcurrencyvariesoverthelifeofthe
program.Attheoutermostleveloftherecursion(initialsplitandfinalmerge),thereislittleor
noexploitableconcurrency,andthesubproblemsalsocontainsplitandmergesections.
Amdahl'slaw(Chapter2)tellsusthattheserialpartsofaprogramcansignificantlyconstrain
thespeedupthatcanbeachievedbyaddingmoreprocessors.Thus,ifthesplitandmerge
computationsarenontrivialcomparedtotheamountofcomputationforthebasecases,a
programusingthispatternmightnotbeabletotakeadvantageoflargenumbersofprocessors.
Further,iftherearemanylevelsofrecursion,thenumberoftaskscangrowquitelarge,
perhapstothepointthattheoverheadofmanagingthetasksoverwhelmsanybenefitfrom
executingthemconcurrently.
Indistributedmemorysystems,subproblemscanbegeneratedononePEandexecutedby
another,requiringdataandresultstobemovedbetweenthePEs.Thealgorithmwillbemore
efficientiftheamountofdataassociatedwithacomputation(thatis,thesizeoftheparameter
setandresultforeachsubproblem)issmall.Otherwise,largecommunicationcostscan
dominatetheperformance.
Individeandconqueralgorithms,thetasksarecreateddynamicallyasthecomputation
proceeds,andinsomecases,theresulting"taskgraph"willhaveanirregularanddata
dependentstructure.Ifthisisthecase,thenthesolutionshouldemploydynamicload
balancing.
Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm

func
func
func
func
func
solve returns Solution; // a solution stage

baseCase returns Boolean; // direct solution test
baseSolve returns Solution; // direct solution
merge returns Solution; // combine subsolutions
split returns Problem[]; // split into subprobs
Solution solve(Problem P) {
if (baseCase(P))
return baseSolve(P);
else {
Problem subProblems[N]; Solution subSolutions[N];
subProblems = split(P);
for (int i = 0; i < N; i++)
subSolutions[i] = solve(subProblems[i]);
return merge(subSolutions);
}
}
Solution
AsequentialdivideandconqueralgorithmhasthestructureshowninFig.4.6.Thecornerstoneof
thisstructureisarecursivelyinvokedfunction(solve ())thatdriveseachstageinthesolution.
Insidesolve,theproblemiseithersplitintosmallersubproblems(usingsplit())oritisdirectly
solved(usingbaseSolve()).Intheclassicalstrategy,recursioncontinuesuntilthesubproblems
aresimpleenoughtobesolveddirectly,oftenwithjustafewlinesofcodeeach.However,efficiency
canbeimprovedbyadoptingtheviewthatbaseSolve()shouldbecalledwhen(1)theoverheadof
performingfurthersplitsandmergessignificantlydegradesperformance,or(2)thesizeofthe
problemisoptimalforthetargetsystem(forexample,whenthedatarequiredforabaseSolve()
fitsentirelyincache).
Theconcurrencyinadivideandconquerproblemisobviouswhen,asisusuallythecase,the
subproblemscanbesolvedindependently(andhence,concurrently).Thesequentialdivideand
conqueralgorithmmapsdirectlyontoataskparallelalgorithmbydefiningonetaskforeach
invocationofthesolve()function,asillustratedinFig.4.7.Notetherecursivenatureofthedesign,
witheachtaskineffectdynamicallygeneratingandthenabsorbingataskforeachsubproblem.
Figure 4.7. Parallelizing the divide-and-conquer strategy. Each dashed-line box
represents a task.
Atsomelevelofrecursion,theamountofcomputationrequiredforasubproblemscanbecomeso
smallthatitisnotworththeoverheadofcreatinganewtasktosolveit.Inthiscase,ahybridprogram
thatcreatesnewtasksatthehigherlevelsofrecursion,thenswitchestoasequentialsolutionwhenthe
subproblemsbecomesmallerthansomethreshold,willbemoreeffective.Asdiscussednext,thereare
tradeoffsinvolvedinchoosingthethreshold,whichwilldependonthespecificsoftheproblemand
thenumberofPEsavailable.Thus,itisagoodideatodesigntheprogramsothatthis"granularity
knob"iseasytochange.
Mapping tasks to UEs and PEs
Conceptually,thispatternfollowsastraightforwardfork/joinapproach(seetheFork/Joinpattern).
Onetasksplitstheproblem,thenforksnewtaskstocomputethesubproblems,waitsuntilthe
subproblemsarecomputed,andthenjoinswiththesubtaskstomergetheresults.
Theeasiestsituationiswhenthesplitphasegeneratessubproblemsthatareknowntobeaboutthe
samesizeintermsofneededcomputation.Then,astraightforwardimplementationofthefork/join
strategy,mappingeachtasktoaUEandstoppingtherecursionwhenthenumberofactivesubtasksis
thesameasthenumberofPEs,workswell.
Inmanysituations,theproblemwillnotberegular,anditisbesttocreatemore,finergrainedtasks
anduseamaster/workerstructuretomaptaskstounitsofexecution.Thisimplementationofthis
approachisdescribedindetailintheMaster/Workerpattern.Thebasicideaistoconceptually
maintainaqueueoftasksandapoolofUEs,typicallyoneperPE.Whenasubproblemissplit,the
newtasksareplacedinthequeue.WhenaUEfinishesatask,itobtainsanotheronefromthequeue.
Inthisway,alloftheUEstendtoremainbusy,andthesolutionshowsagoodloadbalance.Finer
grainedtasksallowabetterloadbalanceatthecostofmoreoverheadfortaskmanagement.
Manyparallelprogrammingenvironmentsdirectlysupportthefork/joinconstruct.Forexample,in
OpenMP,wecouldeasilyproduceaparallelapplicationbyturningtheforloopofFig.4.6intoan
OpenMPparallel forconstruct.Thenthesubproblemswillbesolvedconcurrentlyratherthan
insequence,withtheOpenMPruntimeenvironmenthandlingthethreadmanagement.Unfortunately,
thistechniquewillonlyworkwithimplementationsofOpenMPthatsupporttruenestingofparallel
regions.Currently,onlyafewOpenMPimplementationsdoso.ExtendingOpenMPtobetteraddress
recursiveparallelalgorithmsisanactiveareaofresearchintheOpenMPcommunity[Mat03].One
proposallikelytobeadoptedinafutureOpenMPspecificationistoaddanexplicittaskqueue
constructdesignedtosupporttheexpressionofrecursivealgorithms[SHPT00].
TheFJTaskframeworkforJava[Lea00b,Lea]providessupportforfork/joinprogramswithapoolof
threadsbackingtheimplementation.Severalexampleprogramsusingadivideandconquerstrategy
areprovidedwiththepackage.
Communication costs
Becausetasksaregenerateddynamicallyfromasingletopleveltask,ataskcanbeexecutedona
differentPEthantheonethatgeneratedit.Inadistributedmemorysystem,ahigherleveltaskwill
typicallyhavethedatanecessarytosolveitsentireproblem,therelevantdatamustbemovedtothe
subproblem'sPE,andtheresultmovedbacktothesource.Thusitpaystoconsiderhowtoefficiently
representtheparametersandresults,andconsiderwhetheritmakessensetoreplicatesomedataatthe
beginningofthecomputation.
Dealing with dependencies
Inmostalgorithmsformulatedusingthedivideandconquerstrategy,thesubproblemscanbesolved
independentlyfromeachother.Lesscommonly,thesubproblemsrequireaccesstoacommondata
structure.ThesedependenciescanbehandledusingthetechniquesdescribedintheSharedData
pattern.
Other optimizations
Afactorlimitingthescalabilityofthispatternistheserialsplitandmergesections.Reducingthe
numberoflevelsofrecursionrequiredbysplittingeachproblemintomoresubproblemscanoften
help,especiallyifthesplitandmergephasescanbeparallelizedthemselves.Thismightrequire
restructuring,butcanbequiteeffective,especiallyinthelimitingcaseof"onedeepdivideand
conquer",inwhichtheinitialsplitisintoPsubproblems,wherePisthenumberofavailablePEs.
Examplesofthisapproacharegivenin[Tho95].
Examples
Mergesort
Mergesortisawellknownsortingalgorithmbasedonthedivideandconquerstrategy,appliedas
followstosortanarrayofNelements.
Thebasecaseisanarrayofsizelessthansomethreshold.Thisissortedusinganappropriate
sequentialsortingalgorithm,oftenquicksort.
Inthesplitphase,thearrayissplitbysimplypartitioningitintotwocontiguoussubarrays,
eachofsizeN/2.
Inthesolvesubproblemsphase,thetwosubarraysaresorted(byapplyingthemergesort
procedurerecursively).
Inthemergephase,thetwo(sorted)subarraysarerecombinedintoasinglesortedarray.
Thisalgorithmisreadilyparallelizedbyperformingthetworecursivemergesortsinparallel.
ThisexampleisrevisitedwithmoredetailintheFork/JoinpatternintheSupportingStructuresdesign
space.
Matrix diagonalization
DongarraandSorensen([DS87])describeaparallelalgorithmfordiagonalizing(computingthe
eigenvectorsandeigenvaluesof)asymmetrictridiagonalmatrixT.TheproblemistofindamatrixQ
suchthatQTTQisdiagonal;thedivideandconquerstrategygoesasfollows(omittingthe
mathematicaldetails).
Thebasecaseisasmallmatrixwhichisdiagonalizedsequentially.
ThesplitphaseconsistsoffindingmatrixT'andvectorsu,v,suchthatT=T'+uvT,andT'
hastheform
whereT1andT2aresymmetrictridiagonalmatrices(whichcanbediagonalizedbyrecursive
callstothesameprocedure).
ThemergephaserecombinesthediagonalizationsofT1andT2intoadiagonalizationofT.
Detailscanbefoundin[DS87]orin[GL96].
Known uses
Anyintroductoryalgorithmstextwillhavemanyexamplesofalgorithmsbasedonthedivideand
conquerstrategy,mostofwhichcanbeparallelizedwiththispattern.
SomealgorithmsfrequentlyparallelizedwiththisstrategyincludetheBarnesHut[BH86]andFast
Multipole[GG90]algorithmsusedinNbodysimulations;signalprocessingalgorithms,suchas
discreteFouriertransforms;algorithmsforbandedandtridiagonallinearsystems,suchasthosefound
intheScaLAPACKpackage[CD97,Sca];andalgorithmsfromcomputationalgeometry,suchas
convexhullandnearestneighbor.
AparticularlyrichsourceofproblemsthatusetheDivideandConquerpatternistheFLAMEproject
[GGHvdG01].Thisisanambitiousprojecttorecastlinearalgebraproblemsinrecursivealgorithms.
Themotivationistwofold.First,mathematically,thesealgorithmsarenaturallyrecursive;infact,most
pedagogicaldiscussionsofthesealgorithmsarerecursive.Second,theserecursivealgorithmshave
proventobeparticularlyeffectiveatproducingcodethatisbothportableandhighlyoptimizedforthe
cachearchitecturesofmodernmicroprocessors.
Related Patterns
Justbecauseanalgorithmisbasedonasequentialdivideandconquerstrategydoesnotmeanthatit
mustbeparallelizedwiththeDivideandConquerpattern.Ahallmarkofthispatternistherecursive
arrangementofthetasks,leadingtoavaryingamountofconcurrencyandpotentiallyhighoverheads
onmachinesforwhichmanagingtherecursionisexpensive.Iftherecursivedecompositionintosub
problemscanbereused,however,itmightbemoreeffectivetodotherecursivedecomposition,and
thenusesomeotherpattern(suchastheGeometricDecompositionpatternortheTaskParallelism
pattern)fortheactualcomputation.Forexample,thefirstproductionlevelmoleculardynamics
programtousethefastmultipolemethod,PMD[Win95],usedtheGeometricDecompositionpattern
toparallelizethefastmultipolealgorithm,eventhoughtheoriginalfastmultipolealgorithmused
divideandconquer.Thisworkedbecausethemultipolecomputationwascarriedoutmanytimesfor
eachconfigurationofatoms.
4.6. THE GEOMETRIC DECOMPOSITION PATTERN

Problem
Howcananalgorithmbeorganizedaroundadatastructurethathasbeendecomposedinto
concurrentlyupdatable"chunks"?
Context
Manyimportantproblemsarebestunderstoodasasequenceofoperationsonacoredatastructure.
Theremaybeotherworkinthecomputation,butaneffectiveunderstandingofthefullcomputation
canbeobtainedbyunderstandinghowthecoredatastructuresareupdated.Forthesetypesof
problems,oftenthebestwaytorepresenttheconcurrencyisintermsofdecompositionsofthesecore
datastructures.(Thisformofconcurrencyissometimesknownasdomaindecomposition,orcoarse
graineddataparallelism.)
Thewaythesedatastructuresarebuiltisfundamentaltothealgorithm.Ifthedatastructureis
recursive,anyanalysisoftheconcurrencymusttakethisrecursionintoaccount.Forrecursivedata
structures,theRecursiveDataandDivideandConquerpatternsarelikelycandidates.Forarraysand
otherlineardatastructures,wecanoftenreducetheproblemtopotentiallyconcurrentcomponentsby
decomposingthedatastructureintocontiguoussubstructures,inamanneranalogoustodividinga
geometricregionintosubregionshencethenameGeometricDecomposition.Forarrays,this
decompositionisalongoneormoredimensions,andtheresultingsubarraysareusuallycalledblocks.
Wewillusethetermchunksforthesubstructuresorsubregions,toallowforthepossibilityofmore
generaldatastructures,suchasgraphs.
Thisdecompositionofdataintochunksthenimpliesadecompositionoftheupdateoperationinto
tasks,whereeachtaskrepresentstheupdateofonechunk,andthetasksexecuteconcurrently.Ifthe
computationsarestrictlylocal,thatis,allrequiredinformationiswithinthechunk,theconcurrencyis
embarrassinglyparallelandthesimplerTaskParallelismpatternshouldbeused.Inmanycases,
however,theupdaterequiresinformationfrompointsinotherchunks(frequentlyfromwhatwecan
callneighboringchunkschunkscontainingdatathatwasnearbyintheoriginalglobaldata
structure).Inthesecases,informationmustbesharedbetweenchunkstocompletetheupdate.
Example: mesh-computation program
Theproblemistomodel1Dheatdiffusion(thatis,diffusionofheatalonganinfinitelynarrowpipe).
Initially,thewholepipeisatastableandfixedtemperature.Attime0,wesetbothendstodifferent
temperatures,whichwillremainfixedthroughoutthecomputation.Wethencalculatehow
temperatureschangeintherestofthepipeovertime.(Whatweexpectisthatthetemperatureswill
convergetoasmoothgradientfromoneendofthepipetotheother.)Mathematically,theproblemis
tosolvea1Ddifferentialequationrepresentingheatdiffusion:
Equation4.2
Theapproachusedistodiscretizetheproblemspace(representingUbyaonedimensionalarrayand
computingvaluesforasequenceofdiscretetimesteps).Wewilloutputvaluesforeachtimestepas
theyarecomputed,soweneedonlysavevaluesforUfortwotimesteps;wewillcallthesearraysuk
(Uatthetimestepk)andukp1(Uattimestepk+1).Ateachtimestep,wethenneedtocomputefor
eachpointinarrayukp1thefollowing:
ukp1[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]);
Variablesdtanddxrepresenttheintervalsbetweendiscretetimestepsandbetweendiscretepoints,
respectively.
Observethatwhatisbeingcomputedisanewvalueforvariableukp1ateachpoint,basedondataat
thatpointanditsleftandrightneighbors.
Wecanbegintodesignaparallelalgorithmforthisproblembydecomposingthearraysukandukp1
intocontiguoussubarrays(thechunksdescribedearlier).Thesechunkscanbeoperatedon
concurrently,givingusexploitableconcurrency.Noticethatwehaveasituationinwhichsome
elementscanbeupdatedusingonlydatafromwithinthechunk,whileothersrequiredatafrom
neighboringchunks,asillustratedbyFig.4.8.
Figure 4.8. Data dependencies in the heat-equation problem. Solid boxes indicate the
element being updated; shaded boxes the elements containing needed data.
Example: matrix-multiplication program
Considerthemultiplicationoftwosquarematrices(thatis,computeC=AB).Asdiscussedin
+
[FJL
88
],thematricescanbedecomposedintoblocks.Thesummationsinthedefinitionofmatrix
multiplicationarelikewiseorganizedintoblocks,allowingustowriteablockwisematrix
multiplicationequation
Equation4.3
whereateachstepinthesummation,wecomputethematrixproductAikBkjandaddittothe
runningmatrixsum.
ThisequationimmediatelyimpliesasolutionintermsoftheGeometricDecompositionpattern;that
is,oneinwhichthealgorithmisbasedondecomposingthedatastructureintochunks(squareblocks
here)thatcanbeoperatedonconcurrently.
Tohelpvisualizethisalgorithmmoreclearly,considerthecasewherewedecomposeallthree
matricesintosquareblockswitheachtask"owning"correspondingblocksofA,B,andC.Eachtask
willrunthroughthesumoverktocomputeitsblockofC,withtasksreceivingblocksfromother
tasksasneeded.InFig.4.9,weillustratetwostepsinthisprocessshowingablockbeingupdated(the
solidblock)andthematrixblocksrequiredattwodifferentsteps(theshadedblocks),whereblocksof
theAmatrixarepassedacrossarowandblocksoftheBmatrixarepassedaroundacolumn.
Figure 4.9. Data dependencies in the matrix-multiplication problem. Solid boxes
indicate the "chunk" being updated (C); shaded boxes indicate the chunks of A (row)
and B (column) required to update C at each of the two steps.
Forces
Toexploitthepotentialconcurrencyintheproblem,wemustassignchunksofthe
decomposeddatastructuretoUEs.Ideally,wewanttodothisinawaythatissimple,portable,
scalable,andefficient.AsnotedinSection4.1,however,thesegoalsmayconflict.Akey
considerationisbalancingtheload,thatis,ensuringthatallUEshaveroughlythesame
amountofworktodo.
Wemustalsoensurethatthedatarequiredfortheupdateofeachchunkispresentwhen
needed.Thisproblemissomewhatanalogoustotheproblemofmanagingdatadependencies
intheTaskParallelismpattern,andagainthedesignmustkeepinmindthesometimes
conflictinggoalsofsimplicity,portability,scalability,andefficiency.
Solution
Designsforproblemsthatfitthispatterninvolvethefollowingkeyelements:partitioningtheglobal
datastructureintosubstructuresor"chunks"(thedatadecomposition),ensuringthateachtaskhas
accesstoallthedataitneedstoperformtheupdateoperationforitschunk(theexchangeoperation),
updatingthechunks(theupdateoperation),andmappingchunkstoUEsinawaythatgivesgood
performance(thedatadistributionandtaskschedule).
Data decomposition
Thegranularityofthedatadecompositionhasasignificantimpactontheefficiencyoftheprogram.
Inacoarsegraineddecomposition,thereareasmallernumberoflargechunks.Thisresultsina
smallernumberoflargemessages,whichcangreatlyreducecommunicationoverhead.Afinegrained
decomposition,ontheotherhand,resultsinalargernumberofsmallerchunks,inmanycasesleading
tomanymorechunksthanPEs.Thisresultsinalargernumberofsmallermessages(andhence
increasescommunicationoverhead),butitgreatlyfacilitatesloadbalancing.
Althoughitmightbepossibleinsomecasestomathematicallyderiveanoptimumgranularityforthe
datadecomposition,programmersusuallyexperimentwitharangeofchunksizestoempirically
determinethebestsizeforagivensystem.Thisdepends,ofcourse,onthecomputationalperformance
ofthePEsandontheperformancecharacteristicsofthecommunicationnetwork.Therefore,the
programshouldbeimplementedsothatthegranularityiscontrolledbyparametersthatcanbeeasily
changedatcompileorruntime.
Theshapeofthechunkscanalsoaffecttheamountofcommunicationneededbetweentasks.Often,
thedatatosharebetweentasksislimitedtotheboundariesofthechunks.Inthiscase,theamountof
sharedinformationscaleswiththesurfaceareaofthechunks.Becausethecomputationscaleswith
thenumberofpointswithinachunk,itscalesasthevolumeoftheregion.Thissurfacetovolume
effectcanbeexploitedtomaximizetheratioofcomputationtocommunication.Therefore,higher
dimensionaldecompositionsareusuallypreferred.Forexample,considertwodifferent
decompositionsofanNbyNmatrixintofourchunks.Inonecase,wedecomposetheprobleminto
fourcolumnchunksofsizeNbyN/4.Inthesecondcase,wedecomposetheproblemintofoursquare
chunksofsizeN/2byN/2.Forthecolumnblockdecomposition,thesurfaceareais2N+2(N/4)or
5N/2.Forthesquarechunkcase,thesurfaceareais4(N/2)or2N.Hence,thetotalamountofdatathat
mustbeexchangedislessforthesquarechunkdecomposition.
Insomecases,thepreferredshapeofthedecompositioncanbedictatedbyotherconcerns.Itmaybe
thecase,forexample,thatexistingsequentialcodecanbemoreeasilyreusedwithalower
dimensionaldecomposition,andthepotentialincreaseinperformanceisnotworththeeffortof
reworkingthecode.Also,aninstanceofthispatterncanbeusedasasequentialstepinalarger
computation.Ifthedecompositionusedinanadjacentstepdiffersfromtheoptimaloneforthis
patterninisolation,itmayormaynotbeworthwhiletoredistributethedataforthisstep.Thisis
especiallyanissueindistributedmemorysystemswhereredistributingthedatacanrequire
significantcommunicationthatwilldelaythecomputation.Therefore,datadecompositiondecisions
musttakeintoaccountthecapabilitytoreusesequentialcodeandtheneedtointerfacewithother
stepsinthecomputation.Noticethattheseconsiderationsmightleadtoadecompositionthatwould
besuboptimalunderothercircumstances.
Communicationcanoftenbemoreeffectivelymanagedbyreplicatingthenonlocaldataneededto
updatethedatainachunk.Forexample,ifthedatastructureisanarrayrepresentingthepointsona
meshandtheupdateoperationusesalocalneighborhoodofpointsonthemesh,acommon
communicationmanagementtechniqueistosurroundthedatastructurefortheblockwithaghost
boundarytocontainduplicatesofdataattheboundariesofneighboringblocks.Sonoweachchunk
hastwoparts:aprimarycopyownedbytheUE(thatwillbeupdateddirectly)andzeroormoreghost
copies(alsoreferredtoasshadowcopies).Theseghostcopiesprovidetwobenefits.First,theiruse
mayconsolidatecommunicationintopotentiallyfewer,largermessages.Onlatencysensitive
networks,thiscangreatlyreducecommunicationoverhead.Second,communicationoftheghost
copiescanbeoverlapped(thatis,itcanbedoneconcurrently)withtheupdateofpartsofthearray
thatdon'tdependondatawithintheghostcopy.Inessence,thishidesthecommunicationcostbehind
usefulcomputation,therebyreducingtheobservedcommunicationoverhead.
Forexample,inthecaseofthemeshcomputationexamplediscussedearlier,eachofthechunks
wouldbeextendedbyonecelloneachside.Theseextracellswouldbeusedasghostcopiesofthe
cellsontheboundariesofthechunks.Fig.4.10illustratesthisscheme.
Figure 4.10. A data distribution with ghost boundaries. Shaded cells are ghost copies;
arrows point from primary copies to corresponding secondary copies.
The exchange operation
Akeyfactorinusingthispatterncorrectlyisensuringthatnonlocaldatarequiredfortheupdate
operationisobtainedbeforeitisneeded.
Ifallthedataneededispresentbeforethebeginningoftheupdateoperation,thesimplestapproachis
toperformtheentireexchangebeforebeginningtheupdate,storingtherequirednonlocaldataina
localdatastructuredesignedforthatpurpose(forexample,theghostboundaryinamesh
computation).Thisapproachisrelativelystraightforwardtoimplementusingeithercopyingor
messagepassing.
Moresophisticatedapproachesinwhichcomputationandcommunicationoverlaparealsopossible.
Suchapproachesarenecessaryifsomedataneededfortheupdateisnotinitiallyavailable,andmay
improveperformanceinothercasesaswell.Forexample,intheexampleofameshcomputation,the
exchangeofghostcellsandtheupdateofcellsintheinteriorregion(whichdonotdependonthe
ghostcells)canproceedconcurrently.Aftertheexchangeiscomplete,theboundarylayer(thevalues
thatdodependontheghostcells)canbeupdated.Onsystemswherecommunicationandcomputation
occurinparallel,thesavingsfromsuchanapproachcanbesignificant.Thisissuchacommonfeature
ofparallelalgorithmsthatstandardcommunicationAPIs(suchasMPI)includewholeclassesof
messagepassingroutinestooverlapcomputationandcommunication.Thesearediscussedinmore
detailintheMPIappendix.
Thelowleveldetailsofhowtheexchangeoperationisimplementedcanhavealargeimpacton
efficiency.Programmersshouldseekoutoptimizedimplementationsofcommunicationpatternsused
intheirprograms.Inmanyapplications,forexample,thecollectivecommunicationroutinesin
messagepassinglibrariessuchasMPIareuseful.Thesehavebeencarefullyoptimizedusing
techniquesbeyondtheabilityofmanyparallelprogrammers(wediscusssomeoftheseinSec.6.4.2)
andshouldbeusedwheneverpossible.
The update operation
Updatingthedatastructureisdonebyexecutingthecorrespondingtasks(eachresponsibleforthe
updateofonechunkofthedatastructures)concurrently.Ifalltheneededdataispresentatthe
beginningoftheupdateoperation,andifnoneofthisdataismodifiedduringthecourseofthe
update,parallelizationiseasierandmorelikelytobeefficient.
Iftherequiredexchangeofinformationhasbeenperformedbeforebeginningtheupdateoperation,
theupdateitselfisusuallystraightforwardtoimplementitisessentiallyidenticaltotheanalogous
updateinanequivalentsequentialprogram,particularlyifgoodchoiceshavebeenmadeabouthowto
representnonlocaldata.Iftheexchangeandupdateoperationsoverlap,morecareisneededtoensure
thattheupdateisperformedcorrectly.Ifasystemsupportslightweightthreadsthatarewellintegrated
withthecommunicationsystem,thenoverlapcanbeachievedviamultithreadingwithinasingletask,
withonethreadcomputingwhileanotherhandlescommunication.Inthiscase,synchronization
betweenthethreadsisrequired.
Insomesystems,forexampleMPI,nonblockingcommunicationissupportedbymatching
communicationprimitives:onetostartthecommunication(withoutblocking),andtheother
(blocking)tocompletetheoperationandusetheresults.Formaximaloverlap,communicationshould
bestartedassoonaspossible,andcompletedaslateaspossible.Sometimes,operationscanbe
reorderedtoallowmoreoverlapwithoutchangingthealgorithmsemantics.
Data distribution and task scheduling
Thefinalstepindesigningaparallelalgorithmforaproblemthatfitsthispatternisdecidinghowto
mapthecollectionoftasks(eachcorrespondingtotheupdateofonechunk)toUEs.EachUEcan
thenbesaidto"own"acollectionofchunksandthedatatheycontain.Thus,wehaveatwotiered
schemefordistributingdataamongUEs:partitioningthedataintochunksandthenassigningthese
chunkstoUEs.Thisschemeisflexibleenoughtorepresentavarietyofpopularschemesfor
distributingdataamongUEs.
Inthesimplestcase,eachtaskcanbestaticallyassignedtoaseparateUE;thenalltaskscanexecute
concurrently,andtheintertaskcoordinationneededtoimplementtheexchangeoperationis
straightforward.Thisapproachismostappropriatewhenthecomputationtimesofthetasksare
uniformandtheexchangeoperationhasbeenimplementedtooverlapcommunicationand
computationwithineachtasks.
Thesimpleapproachcanleadtopoorloadbalanceinsomesituations,however.Forexample,consider
alinearalgebraprobleminwhichelementsofthematrixaresuccessivelyeliminatedasthe
computationproceeds.Earlyinthecomputation,alltherowsandcolumnsofthematrixhave
numerouselementstoworkwithanddecompositionsbasedonassigningfullrowsorcolumnstoUEs
areeffective.Laterinthecomputation,however,rowsorcolumnsbecomesparse,theworkperrow
becomesuneven,andthecomputationalloadbecomespoorlybalancedbetweenUEs.Thesolutionis
todecomposetheproblemintomanymorechunksthanthereareUEsandtoscatterthemamongthe
UEswithacyclicorblockcyclicdistribution.(Cyclicandblockcyclicdistributionsarediscussedin
theDistributedArraypattern.)Then,aschunksbecomesparse,thereare(withhighprobability)other
nonsparsechunksforanygivenUEtoworkon,andtheloadbecomeswellbalanced.Aruleofthumb
isthatoneneedsaroundtentimesasmanytasksasUEsforthisapproachtoworkwell.
Itisalsopossibletousedynamicloadbalancingalgorithmstoperiodicallyredistributethechunks
amongtheUEstoimprovetheloadbalance.Theseincuroverheadthatmustbetradedoffagainstthe
improvementlikelytooccurfromtheimprovedloadbalanceandincreasedimplementationcosts.In
addition,theresultingprogramismorecomplexthanthosethatuseoneofthestaticmethods.
Generally,oneshouldconsiderthe(static)cyclicallocationstrategyfirst.
Program structure
TheoverallprogramstructureforapplicationsofthispatternwillnormallyuseeithertheLoop
ParallelismpatternortheSPMDpattern,withthechoicedeterminedlargelybythetargetplatform.
ThesepatternsaredescribedintheSupportingStructuresdesignspace.
Examples
Weincludetwoexampleswiththispattern:ameshcomputationandmatrixmultiplication.The
challengesinworkingwiththeGeometricDecompositionpatternarebestappreciatedinthelowlevel
detailsoftheresultingprograms.Therefore,eventhoughthetechniquesusedintheseprogramsare
notfullydevelopeduntilmuchlaterinthebook,weprovidefullprogramsinthissectionratherthan
highleveldescriptionsofthesolutions.
Mesh computation
ThisproblemisdescribedintheContextsectionofthispattern.Fig.4.11presentsasimplesequential
versionofaprogram(somedetailsomitted)thatsolvesthe1Dheatdiffusionproblem.Theprogram
associatedwiththisproblemisstraightforward,althoughonedetailmightneedfurtherexplanation:
Aftercomputingnewvaluesinukp1ateachstep,conceptuallywhatwewanttodoiscopythemto
ukforthenextiteration.Weavoidatimeconsumingactualcopybymakingukandukp1pointersto
theirrespectivearraysandsimplyswappingthemattheendofeachstep.Thiscausesuktopointto
thenewlycomputedvaluesandukp1topointtotheareatouseforcomputingnewvaluesinthenext
iteration.
Figure 4.11. Sequential heat-diffusion program
CodeView:Scroll/ShowAll
#include <stdio.h>
#include <stdlib.h>
#define NX 100
#define LEFTVAL 1.0

#define RIGHTVAL 10.0
#define NSTEPS 10000
void initialize(double uk[], double ukp1[]) {
uk[0] = LEFTVAL; uk[NX-1] = RIGHTVAL;
for (int i = 1; i < NX-1; ++i)
uk[i] = 0.0;
for (int i = 0; i < NX; ++i)
ukp1[i] = uk[i];
}
void printValues(double uk[], int step) { /* NOT SHOWN */ }
int main(void) {
/* pointers to arrays for two iterations of algorithm */
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
double dx = 1.0/NX;
double dt = 0.5*dx*dx;
initialize(uk, ukpl);
for (int k = 0; k < NSTEPS; ++k) {
/* compute new values */
for (int i = 1; i < NX-1; ++i) {
}
/* "copy" ukp1 to uk by swapping pointers */
temp = ukp1; ukp1 = uk; uk = temp;
printValues(uk, k);
}
return 0;
}
Thisprogramcombinesatoplevelsequentialcontrolstructure(thetimesteploop)withanarray
updateoperation,whichcanbeparallelizedusingtheGeometricDecompositionpattern.Weshow
parallelimplementationsofthisprogramusingOpenMPandMPI.
OpenMP solution
AparticularlysimpleversionoftheprogramusingOpenMPandtheLoopParallelismpatternis
showninFig.4.12.BecauseOpenMPisasharedmemoryprogrammingmodel,thereisnoneedto
explicitlypartitionanddistributethetwokeyarrays(ukandukp1).Thecreationofthethreadsand
distributionoftheworkamongthethreadsareaccomplishedwiththeparallel fordirective.
#pragma parallel for schedule(static)
Theschedule (static)clausedecomposestheiterationsoftheparallelloopintoone
contiguousblockperthreadwitheachblockbeingapproximatelythesamesize.Thisscheduleis
importantforLoopParallelismprogramsimplementingtheGeometricDecompositionpattern.Good
performanceformostGeometricDecompositionproblems(andmeshprogramsinparticular)requires
thatthedataintheprocessor'scachebeusedmanytimesbeforeitisdisplacedbydatafromnewcache
lines.Usinglargeblocksofcontiguousloopiterationsincreasesthechancethatmultiplevalues
fetchedinacachelinewillbeutilizedandthatsubsequentloopiterationsarelikelytofindatleast
someoftherequireddataincache.
ThelastdetailtodiscussfortheprograminFig.4.12isthesynchronizationrequiredtosafelycopythe
pointers.Itisessentialthatallofthethreadscompletetheirworkbeforethepointerstheymanipulate
areswappedinpreparationforthenextiteration.Inthisprogram,thissynchronizationhappens
automaticallyduetotheimpliedbarrier(seeSec.6.3.2)attheendoftheparallelloop.
TheprograminFig.4.12workswellwithasmallnumberofthreads.Whenlargenumbersofthreads
areinvolved,however,theoverheadincurredbyplacingthethreadcreationanddestructioninsidethe
loopoverkwouldbeprohibitive.Wecanreducethreadmanagementoverheadbysplittingthe
parallel fordirectiveintoseparateparallelandfordirectivesandmovingthethread
creationoutsidetheloopoverk.ThisapproachisshowninFig.4.13.Becausethewholekloopis
nowinsideaparallelregion,wemustbemorecarefulabouthowdataissharedbetweenthreads.The
privateclausecausestheloopindiceskanditobelocaltoeachthread.Thepointersukand
ukp1areshared,however,sotheswapoperationmustbeprotected.Theeasiestwaytodothisisto
ensurethatonlyonememberoftheteamofthreadsdoestheswap.InOpenMP,thisismosteasily
donebyplacingtheupdateinsideasingleconstruct.AsdescribedinmoredetailintheOpenMP
appendix,AppendixA,thefirstthreadtoencountertheconstructwillcarryouttheswapwhilethe
otherthreadswaitattheendoftheconstruct.
Figure 4.12. Parallel heat-diffusion program using OpenMP
#include <stdio.h>
#include <stdlib.h>
#define NX 100
#define LEFTVAL 1.0
for (int i = 1; i < NX-1; ++i)
uk[i] = 0.0;
for (int i=0; i < NX; ++i)
ukp1[i] = uk[i];
}
int main(void) {

double *temp;
double dx = 1.0/NX;
initialize(uk, ukp1);
#pragma omp parallel for schedule(static)
/* compute new values */
for (int i = 1; i < NX-1; ++i) {
}
temp = ukp1; ukpl = uk; uk = temp;
printValues(uk, k);
}
return 0;
}
MPI solution
AnMPIbasedprogramforthisexampleisshowninFig.4.14and4.15.Theapproachusedinthis
programusesadatadistributionwithghostcellsandtheSPMDpattern.
EachprocessisgivenasinglechunkofthedatadomainofsizeNX/NP,whereNXisthetotalsizeof
theglobaldataarrayandNPisthenumberofprocesses.Forsimplicity,weassumeNXisevenly
dividedbyNP.
Theupdateofthechunkisstraightforwardandessentiallyidenticaltothatfromthesequentialcode.
ThelengthandgreatercomplexityinthisMPIprogramarisesfromtwosources.First,thedata
initializationismorecomplex,becauseitmustaccountforthedatavaluesattheedgesofthefirstand
lastchunks.Second,messagepassingroutinesarerequiredinsidetheloopoverktoexchangeghost
cells.
Figure 4.13. Parallel heat-diffusion program using OpenMP. This version has less
thread-management overhead.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0

void initialize(double uk[], double ukp1[]){/* NOT SHOWN */}
int main(void) {
double *ukp1 = malloc(sizeof(double) * NX); double *temp;
int i,k;
double dx = 1.0/NX;
#pragma omp parallel private (k, i)
{
for (k = 0; k < NSTEPS; ++k) {
#pragma omp for schedule(static)
for (i = 1; i < NX-1; ++i) {
}
#pragma omp single
{ temp = ukp1; ukp1 = uk; uk = temp; }
}
}
return 0;
}
ThedetailsofthemessagepassingfunctionscanbefoundintheMPIappendix,AppendixB.Briefly,
transmittingdataconsistsofoneprocessdoingasendoperation,specifyingthebuffercontainingthe
data,andanotherprocessdoingareceiveoperation,specifyingthebufferintowhichthedatashould
beplaced.Weneedseveraldifferentpairsofsendsandreceivesbecausetheprocessthatownsthe
leftmostchunkofthearraydoesnothavealeftneighboritneedstocommunicatewith,andsimilarly
theprocessthatownstherightmostchunkdoesnothavearightneighbortocommunicatewith.
Figure 4.14. Parallel heat-diffusion program using MPI (continued in Fig. 4.15)
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mpi.h>
#define NX 100
#define LEFTVAL 1.0
void initialize(double uk[], double ukp1[], int numPoints,
int numProcs, int myID) {

for (int i = 1; i <= numPoints; ++i)
uk[i] = 0.0;
/* left endpoint */
if (myID == 0) uk[1] = LEFTVAL;
/* right endpoint */
if (myID == numProcs-1) uk[numPoints] = RIGHTVAL;
/* copy values to ukpl */
for (int i = 1; i <= numPoints; ++i) ukp1[i] = uk[i];
void printValues(double uk[], int step, int numPoints, int myID)

{ /* NOT SHOWN */ }
int main(int argc, char *argv[]) {
double *uk, *ukp1, *temp;
double dx = 1.0/NX; double dt = 0.5*dx*dx;
int numProcs, myID, leftNbr, rightNbr, numPoints;
MPI_Status status;
/* MPI initialization */
MPI-Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myID); //get own ID
/* initialization of other variables */
leftNbr = myID - 1; // ID of left "neighbor" process
rightNbr = myID + 1; // ID of right "neighbor" process
numPoints = (NX / numProcs);
/* uk, ukp1 include a "ghost cell" at each end */
uk = malloc(sizeof(double) * (numPoints+2));
ukp1 = malloc(sizeof(double) * (numPoints+2));
initialize(uk, ukp1, numPoints, numProcs, myID);
/* continued in next figure */
WecouldfurthermodifythecodeinFig.4.14and4.15tousenonblockingcommunicationtooverlap
computationandcommunication,asdiscussedearlierinthispattern.Thefirstpartoftheprogramis
unchangedfromourfirstmeshcomputationMPIprogram(thatis,Fig.4.14).Thedifferencesforthis
casearecontainedinthesecondpartoftheprogramcontainingthemaincomputationloop.Thiscode
isshowninFig.4.16.
Figure 4.15. Parallel heat-diffusion program using MPI (continued from Fig. 4.14)
/* continued from Figure 4.14 */
/* exchange boundary information */

if (myID != 0)
MPI_Send(&uk[1], 1, MPI_DOUBLE, leftNbr, 0,
MPI_COMM_WORLD);
if (myID != numProcs-1)
MPI_Send(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0,
MPI_COMM_WORLD);
if (myID != 0)
MPI_Recv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0,
MPI_COMM_WORLD, &status);
if (myID != numProcs-1)
MPI_Recv(&uk[numPoints+l],1, MPI_DOUBLE, rightNbr, 0,
MPI_COMM_WORLD, &status);
/* compute new values for interior points */
for (int i = 2; i < numPoints; ++i) {
ukpl[i]=uk[i]+ (dt/(dx*dx))*(uk[i+1]-2*uk[i]+uk[i-1]);
}
/* compute new values for boundary points */
if (myID != 0) {
int i=1;
}
if (myID != numProcs-1) {
int i=numPoints;
}
printValues(uk, k, numPoints, myID);
}
/* clean up and end */

MPI_Finalize();
return 0;
Whilethebasicalgorithmisthesame,thecommunicationisquitedifferent.Theimmediatemode
communicationroutines,MPI_IsendandMPI_Irecv,areusedtosetupandthenlaunchthe
communicationevents.Thesefunctions(describedinmoredetailintheMPIappendix,AppendixB)
returnimmediately.Theupdateoperationsontheinteriorpointscanthentakeplacebecausethey
don'tdependontheresultsofthecommunication.Wethencallfunctionstowaituntilthe
communicationiscompleteandupdatetheedgesofeachUE'schunksusingtheresultsofthe
communicationevents.Inthiscase,themessagesaresmallinsize,soitisunlikelythatthisversionof
theprogramwouldbeanyfasterthanourfirstone.Butitiseasytoimaginecaseswherelarge,
complexcommunicationeventswouldbeinvolvedandbeingabletodousefulworkwhilethe
messagesmoveacrossthecomputernetworkwouldresultinsignificantlygreaterperformance.
Figure 4.16. Parallel heat-diffusion program using MPI with overlapping

communication/ computation (continued from Fig. 4.14)
/* continued */
MPI_Request reqRecvL, reqRecvR, reqSendL, reqSendR; //needed for
// nonblocking I/0
/* initiate communication to exchange boundary information */
if (myID != 0) {
MPI_Irecv(&uk[0], 1, MPI_DOUBLE, leftNbr, 0,
MPI_COMM_WORLD, &regRecvL);
MPI_Isend(&uk[l], 1, MPI_DOUBLE, leftNbr, 0,
MPI_COMM_WORLD, &regSendL);
}
MPI_Irecv(&uk[numPoints+1],1, MPI_DOUBLE, rightNbr, 0,
MPI_COMM_WORLD, &reqRecvR);
MPI_Isend(&uk[numPoints], 1, MPI_DOUBLE, rightNbr, 0,
MPI_COMM_WORLD, &reqSendR);
}
/* compute new values for interior points */
for (int i = 2; i < numPoints; ++i) {
}
/* wait for communication to complete */
if (myID != 0) {
MPI_Wait(&reqRecvL, &status); MPI_Wait(&reqSendL, &status);
}
MPI_Wait(&reqRecvR, &status); MPI_Wait(&reqSendR, &status);
}
/* compute new values for boundary points */
if (myID != 0) {
int i=1;
}
int i=numPoints;
}
printValues(uk, k, numPoints, myID);
}
/* clean up and end */
MPI_Finalize();
return 0;
}
Figure 4.17. Sequential matrix multiplication
#include <stdio.h>
#include <stdlib.h>
#define N 100
#define NB 4
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \
(M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk))
/* matrix dimensions */
int dimN = N; int dimP = N; int dimM = N;
/* block dimensions */
int dimNb = dimN/NB; int dimPb = dimP/NB; int dimMb = dimM/NB;
/* allocate
double *A =
double *B =
double *C =
memory for matrices */

malloc(dimN*dimP*sizeof(double));
malloc(dimP*dimM*sizeof(double));
malloc(dimN*dimM*sizeof(double));
/* Initialize matrices */
initialize(A, B, dimN, dimP, dimM);
/* Do the matrix multiplication */
for (int ib=0; ib < NB; ++ib) {
for (int jb=0; jb < NB; ++jb) {
/* find block[ib][jb] of C */
double * blockPtr = blockstart(C, ib, jb, dimNb, dimMb, dimM);
/* clear block[ib] [jb] of C (set all elements to zero) */
matclear(blockPtr, dimNb, dimMb, dimM);
for (int kb=0; kb < NB; ++kb) {
/* compute product of block[ib][kb] of A and
block[kb] [jb] of B and add to block[ib][jb] of C */
matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP),
blockstart(B, kb, jb, dimPb, dimMb, dimM),
blockPtr, dimNb, dimPb, dimMb, dimP, dimM, dimM);
}
}
/* Code to print results not shown */
}
return 0;
ThematrixmultiplicationproblemisdescribedintheContextsection.Fig.4.17presentsasimple
sequentialprogramtocomputethedesiredresult,basedondecomposingtheNbyNmatrixinto
NB*NBsquareblocks.Thenotationblock [i] [j]incommentsindicatesthe(i,j)thblockas
describedearlier.TosimplifythecodinginC,werepresentthematricesas1Darrays(internally
arrangedinrowmajororder)anddefineamacroblockstarttofindthetopleftcornerofa
submatrixwithinoneofthese1Darrays.Weomitcodeforfunctionsinitialize(initialize
matricesAandB),printMatrix(printamatrix'svalues),matclear(clearamatrixsetall
valuestozero),andmatmul_add(computethematrixproductofthetwoinputmatricesandaddit
totheoutputmatrix).Parameterstomostofthesefunctionsincludematrixdimensions,plusastride
thatdenotesthedistancefromthestartofonerowofthematrixtothestartofthenextandallowsus
toapplythefunctionstosubmatricesaswellastowholematrices.
Figure 4.18. Sequential matrix multiplication, revised. We do not show the parts of
the program that are not changed from the program in Fig. 4.17.
/* Declarations, initializations, etc. not shown -- same as
first version */
/* Do the multiply */
matclear(C, dimN, dimM, dimM); /* sets all elements to zero */

/* compute product of block[ib][kb] of A and
block[kb][jb] of B and add to block[ib][jb] of C */
matmul_add(blockstart(A, ib, kb, dimNb, dimPb, dimP),
blockstart(B, kb, jb, dimPb, dimMb, dimM),
blockstart(C, ib, jb, dimNb, dimMb, dimM),
dimNb, dimPb, dimMb, dimP, dimM, dimM);
}
}
/* Remaining code is the same as for the first version */
Wefirstobservethatwecanrearrangetheloopswithoutaffectingtheresultofthecomputation,as
showninFig.4.18.
Observethatwiththistransformation,wehaveaprogramthatcombinesahighlevelsequential
structure(theloopoverkb)withaloopstructure(thenestedloopsoveribandjb)thatcanbe
parallelizedwiththeGeometricDecompositionpattern.
OpenMP solution
Wecanproduceaparallelversionofthisprogramforasharedmemoryenvironmentbyparallelizing
theinnernestedloops(overiband/orjb)withOpenMPloopdirectives.Aswiththemeshexample,
itisimportanttokeepthreadmanagementoverheadsmall,soonceagaintheparalleldirective
shouldappearoutsideoftheloopoverkb.Afordirectivewouldthenbeplacedpriortooneofthe
innerloops.Theissuesraisedbythisalgorithmandtheresultingsourcecodemodificationsare
essentiallythesameasthosearisingfromthemeshprogramexample,sowedonotshowprogram
sourcecodehere.
MPI solution
AparallelversionofthematrixmultiplicationprogramusingMPIisshowninFig.4.19and4.20.The
naturalapproachwithMPIistousetheSPMDpatternwiththeGeometricDecompositionpattern.
Wewillusethematrixmultiplicationalgorithmdescribedearlier.
Figure 4.19. Parallel matrix multiplication with message passing (continued in Fig.
4.20)
#include
#include
#include
#include
#include
<stdio.h>
<stdlib.h>
<string.h>
<math.h>
<mpi.h> #define N 100
#define blockstart(M,i,j,rows_per_blk,cols_per_blk,stride) \
(M + ((i)*(rows_per_blk))*(stride) + (j)*(cols_per_blk))
/* matrix dimensions */
int dimN = N; int dimP = N; int dimM = N;
/* block dimensions */
int dimNb, dimPb, dimMb;
/* matrices */
double *A, *B, *C;
/* buffers for receiving sections of A, B from other processes */
double *Abuffer, *Bbuffer;
int numProcs, myID, myID_i, myID_j, NB;
MPI_Status status;
/* MPI initialization */
MPI_Init(&argc, &argv);
MPI_Comm_size (MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &myID) ;
/* initialize other variables */
NB = (int) sqrt((double) numProcs);
myID_i = myID / NB;
myID_j = myID % NB;
dimNb = dimN/NB; dimPb = dimP/NB; dimMb = dimM/NB;
A = malloc(dimNb*dimPb*sizeof(double));
B = malloc(dimPb*dimMb*sizeof(double));
C = malloc(dimNb*dimMb*sizeof(double));
Abuffer = malloc(dimNb*dimPb*sizeof(double));
Bbuffer = malloc(dimPb*dimMb*sizeof(double));
/* Initialize matrices */
initialize(A, B, dimNb, dimPb, dimPb, dimPb, NB, myID_i, myID_j);
Thethreematrices(A,B,andC)aredecomposedintoblocks.TheUEs(processesinthecaseofMPI)
involvedinthecomputationareorganizedintoagridsuchthattheindicesofthematrixblocksmap
ontothecoordinatesoftheprocesses(thatis,matrixblock(i,j)isassociatedwiththeprocesswith
rowindexiandcolumnindexj).Forsimplicity,weassumethenumberofprocessesnumProcsisa
perfectsquareanditssquarerootevenlydividestheorderofthematrices(N).
Figure 4.20. Parallel matrix multiplication with message-passing (continued from Fig.
4.19)
/* continued from previous figure */
/* Do the multiply */
matclear(C, dimNb, dimMb, dimMb);
if (myID_j == kb) {
/* send A to other processes in the same "row" */
if (jb != myID_j)
MPI_Send(A, dimNb*dimPb, MPI_DOUBLE,
myID_i*NB + jb, 0, MPI_COMM_WORLD);
}
/* copy A to Abuffer */
memcpy(Abuffer, A, dimNb*dimPb*sizeof(double));
}
else {
MPI_Recv(Abuffer, dimNb*dimPb, MPI_DOUBLE,
myID_i*NB + kb, 0, MPI_COMM_WORLD, &status);
)
if (myID_i == kb) {
/* send B to other processes in the same "column" */
if (ib != myID_i)
MPI_Send(B, dimPb*dimMb, MPI_DOUBLE,
ib*NB + myID_j, 0, MPI_COMM_WORLD);
}
/* copy B to Bbuffer */
memcpy(Bbuffer, B, dimPb*dimMb*sizeof(double));
}
else {
MPI_Recv(Bbuffer, dimPb*dimMb, MPI_DOUBLE,
kb*NB + myID_j, 0, MPI_COMM_WORLD, &status);
}
/* compute product of block[ib] [kb] of A and
block[kb][jb] of B and add to block[ib][jb] of C */
matmul_add(Abuffer, Bbuffer, C,
dimNb, dimPb, dimMb, dimPb, dimMb, dimMb); }
}
/* Code to print results not shown */
/* Clean up and end */
MPI_Finalize();
return 0;
}
Althoughthealgorithmmayseemcomplexatfirst,theoverallideaisstraightforward.The
computationproceedsthroughanumberofphases(theloopoverkb).Ateachphase,theprocess
whoserowindexequalsthekbindexsendsitsblocksofAacrosstherowofprocesses.Likewise,the
processwhosecolumnindexequalskbsendsitsblocksofBalongthecolumnofprocesses.
Followingthecommunicationoperations,eachprocessthenmultipliestheAandBblocksitreceived
andsumstheresultintoitsblockofC.AfterNBphases,theblockoftheCmatrixoneachprocesswill
holdthefinalproduct.
ThesetypesofalgorithmsareverycommonwhenworkingwithMPI.Thekeytounderstandingthese
algorithmsistothinkintermsofthesetofprocesses,thedataownedbyeachprocess,andhowdata
fromneighboringprocessesflowsamongtheprocessesasthecalculationunfolds.Werevisitthese
issuesintheSPMDandDistributedArraypatternsaswellasintheMPIappendix.
Agreatdealofresearchhasbeencarriedoutonparallelmatrixmultiplicationandrelatedlinear
algebraalgorithms.Amoresophisticatedapproach,inwhichtheblocksofAandBcirculateamong
+
processes,arrivingateachprocessjustintimetobeused,isgivenin[FJL
88
].
Known uses
MostproblemsinvolvingthesolutionofdifferentialequationsusetheGeometricDecomposition
pattern.Afinitedifferencingschemedirectlymapsontothispattern.Anotherclassofproblemsthat
usethispatterncomesfromcomputationallinearalgebra.TheparallelroutinesintheScaLAPACK
+
[Sca,BCC
97
]libraryareforthemostpartbasedonthispattern.Thesetwoclassesofproblemscover
alargeportionofallparallelapplicationsinscientificcomputing.
Related Patterns
Iftheupdaterequiredforeachchunkcanbedonewithoutdatafromotherchunks,thenthispattern
reducestotheembarrassinglyparallelalgorithmdescribedintheTaskParallelismpattern.Asan
exampleofsuchacomputation,considercomputinga2DFFT(FastFourierTransform)byfirst
applyinga1DFFTtoeachrowofthematrixandthenapplyinga1DFFTtoeachcolumn.Although
thedecompositionmayappeardatabased(byrows/bycolumns),infactthecomputationconsistsof
twoinstancesoftheTaskParallelismpattern.
Ifthedatastructuretobedistributedisrecursiveinnature,thentheDivideandConquerorRecursive
Datapatternmaybeapplicable
4.7. THE RECURSIVE DATA PATTERN

Problem
Supposetheprobleminvolvesanoperationonarecursivedatastructure(suchasalist,tree,orgraph)
thatappearstorequiresequentialprocessing.Howcanoperationsonthesedatastructuresbe
performedinparallel?
Context
Someproblemswithrecursivedatastructuresnaturallyusethedivideandconquerstrategydescribed
intheDivideandConquerpatternwithitsinherentpotentialforconcurrency.Otheroperationson
thesedatastructures,however,seemtohavelittleifanypotentialforconcurrencybecauseitappears
thattheonlywaytosolvetheproblemistosequentiallymovethroughthedatastructure,computinga
resultatoneelementbeforemovingontothenext.Sometimes,however,itispossibletoreshapethe
operationsinawaythataprogramcanoperateconcurrentlyonallelementsofthedatastructure.
Anexamplefrom[J92]illustratesthesituation:Supposewehaveaforestofrooteddirectedtrees
(definedbyspecifying,foreachnode,itsimmediateancestor,witharootnode'sancestorbeingitself)
andwanttocompute,foreachnodeintheforest,therootofthetreecontainingthatnode.Todothis
inasequentialprogram,wewouldprobablytracedepthfirstthrougheachtreefromitsroottoitsleaf
nodes;aswevisiteachnode,wehavetheneededinformationaboutthecorrespondingroot.Total
runningtimeofsuchaprogramforaforestofNnodeswouldbeO(N).Thereissomepotentialfor
concurrency(operatingonsubtreesconcurrently),butthereisnoobviouswaytooperateonall
elementsconcurrently,becauseitappearsthatwecannotfindtherootforaparticularnodewithout
knowingitsparent'sroot.
However,arethinkingoftheproblemexposesadditionalconcurrency:Wefirstdefineforeachnodea
"successor",whichinitiallywillbeitsparentandultimatelywillbetherootofthetreetowhichthe
nodebelongs.Wethencalculateforeachnodeits"successor'ssuccessor".Fornodesone"hop"from
theroot,thiscalculationdoesnotchangethevalueofitssuccessor(becausearoot'sparentisitself).
Fornodesatleasttwo"hops"awayfromaroot,thiscalculationmakesthenode'ssuccessoritsparent's
parent.Werepeatthiscalculationuntilitconverges(thatis,thevaluesproducedbyonesteparethe
sameasthoseproducedbytheprecedingstep),atwhichpointeverynode'ssuccessoristhedesired
value.Fig.4.21showsanexamplerequiringthreestepstoconverge.Ateachstepwecanoperateon
allNnodesinthetreeconcurrently,andthealgorithmconvergesinatmostlogNsteps.
Figure 4.21. Finding roots in a forest. Solid lines represent the original parent-child
relationships among nodes; dashed lines point from nodes to their successors.
Whatwehavedoneistransformtheoriginalsequentialcalculation(findrootsfornodesone"hop"
fromaroot,thenfindrootsfornodestwo"hops"fromaroot,etc.)intoacalculationthatcomputesa
partialresult(successor)foreachnodeandthenrepeatedlycombinesthesepartialresults,firstwith
neighboringresults,thenwithresultsfromnodestwohopsaway,thenwithresultsfromnodesfour
hopsaway,andsoon.Thisstrategycanbeappliedtootherproblemsthatatfirstappearunavoidably
sequential;theExamplessectionpresentsotherexamples.Thistechniqueissometimesreferredtoas
pointerjumpingorrecursivedoubling.
Aninterestingaspectofthisrestructuringisthatthenewalgorithminvolvessubstantiallymoretotal
workthantheoriginalsequentialone(O(NlogN)versusO(N)),buttherestructuredalgorithm
containspotentialconcurrencythatiffullyexploitedreducestotalrunningtimetoO(logN)(versus
O(N)).Moststrategiesandalgorithmsbasedonthispatternsimilarlytradeoffanincreaseintotal
workforapotentialdecreaseinexecutiontime.Noticealsothattheexploitableconcurrencycanbe
extremelyfinegrained(asinthepreviousexample),whichmaylimitthesituationsinwhichthis
patternyieldsanefficientalgorithm.Nevertheless,thepatterncanstillserveasaninspirationfor
lateralthinkingabouthowtoparallelizeproblemsthatatfirstglanceappeartobeinherently
sequential.
Forces
Recastingtheproblemtotransformaninherentlysequentialtraversaloftherecursivedata
structureintoonethatallowsallelementstobeoperateduponconcurrentlydoessoatthecost
ofincreasingthetotalworkofthecomputation.Thismustbebalancedagainsttheimproved
performanceavailablefromrunninginparallel.
Thisrecastingmaybedifficulttoachieve(becauseitrequireslookingattheoriginalproblem
fromanunusualperspective)andmayleadtoadesignthatisdifficulttounderstandand
maintain.
Whethertheconcurrencyexposedbythispatterncanbeeffectivelyexploitedtoimprove
performancedependsonhowcomputationallyexpensivetheoperationisandonthecostof
communicationrelativetocomputationonthetargetparallelcomputersystem.
Solution
Themostchallengingpartofapplyingthispatternisrestructuringtheoperationsoverarecursivedata
structureintoaformthatexposesadditionalconcurrency.Generalguidelinesaredifficulttoconstruct,
butthekeyideasshouldbeclearfromtheexamplesprovidedwiththispattern.
Aftertheconcurrencyhasbeenexposed,itisnotalwaysthecasethatthisconcurrencycanbe
effectivelyexploitedtospeedupthesolutionofaproblem.Thisdependsonanumberoffactors
includinghowmuchworkisinvolvedaseachelementoftherecursivedatastructureisupdatedandon
thecharacteristicsofthetargetparallelcomputer.
Data decomposition
Inthispattern,therecursivedatastructureiscompletelydecomposedintoindividualelementsand
eachelementisassignedtoaseparateUE.IdeallyeachUEwouldbeassignedtoadifferentPE,butit
isalsopossibletoassignmultipleUEstoeachPE.IfthenumberofUEsperPEistoolarge,however,
theoverallperformancewillbepoorbecausetherewillnotbeenoughconcurrencytoovercomethe
increaseinthetotalamountofwork.
Forexample,considertherootfindingproblemdescribedearlier.We'llignoreoverheadinour
computations.IfN=1024andtisthetimetoperformonestepforonedataelement,thentherunning
timeofasequentialalgorithmwillbeabout1024t.IfeachUEisassigneditsownPE,thenthe
runningtimeoftheparallelalgorithmwillbearound(logN)tor10t.IfonlytwoPEsareavailablefor
theparallelalgorithm,however,thenallNlogNor10240computationstepsmustbeperformedon
thetwoPEs,andtheexecutiontimewillbeatleast5120t,considerablymorethanthesequential
algorithm.
Structure
Typicallytheresultofapplyingthispatternisanalgorithmwhosetoplevelstructureisasequential
compositionintheformofaloop,inwhicheachiterationcanbedescribedas"performthisoperation
simultaneouslyonall(orselected)elementsoftherecursivedatastructure".Typicaloperations
include"replaceeachelement'ssuccessorwithitssuccessor'ssuccessor"(asintheexampleinthe
Contextsection)and"replaceavalueheldatthiselementwiththesumofthecurrentvalueandthe
valueofthepredecessor'selement."
Synchronization
Algorithmsthatfitthispatternaredescribedintermsofsimultaneouslyupdatingallelementsofthe
datastructure.Sometargetplatforms(forexample,SIMDarchitecturessuchastheearlyConnection
Machines)makethistrivialtoaccomplishbyassigningeachdataelementtoaseparatePE(possiblya
logicalPE)andexecutinginstructionsinalockstepfashionateachPE.MIMDplatformswiththe
rightsupportingprogrammingenvironments(forexample,HighPerformanceFortran[HPF97])
providesimilarsemantics.
Ifthetargetplatformdoesn'tprovidetherequiredsynchronizationimplicitly,itwillbenecessaryto
introducethesynchronizationexplicitly.Forexample,iftheoperationperformedduringaloop
iterationcontainstheassignment
next [k] = next [next[k]]
thentheparallelalgorithmmustensurethatnext [k]isnotupdatedbeforeotherUEsthatneedits
valuefortheircomputationhavereceivedit.Onecommontechniqueistointroduceanewvariable,
saynext2,ateachelement.Evennumberediterationsthenreadnextbutupdatenext2,while
oddnumberediterationsreadnext2andupdatenext.Thenecessarysynchronizationis
accomplishedbyplacingabarrier(asdescribedintheImplementationMechanismsdesignspace)
betweeneachsuccessivepairofiterations.Noticethatthiscansubstantiallyincreasetheoverhead
associatedwiththeparallelalgorithm,whichcanoverwhelmanyspeedupderivedfromtheadditional
concurrency.Thisismostlikelytobeafactorifthecalculationrequiredforeachelementistrivial
(which,alas,formanyoftheexamplesitis).
IftherearefewerPEsthandataelements,theprogramdesignermustdecidewhethertoassigneach
dataelementtoaUEandassignmultipleUEstoeachPE(therebysimulatingsomeoftheparallelism)
orwhethertoassignmultipledataelementstoeachUEandprocessthemserially.Thelatterisless
straightforward(requiringanapproachsimilartothatsketchedpreviously,inwhichvariablesinvolved
inthesimultaneousupdateareduplicated),butcanbemoreefficient.
Examples
Partial sums of a linked list
Inthisexample,adoptedfromHillisandSteele[HS86],theproblemistocomputetheprefixsumsof
alltheelementsinalinkedlistinwhicheachelementcontainsavaluex.Inotherwords,afterthe
computationiscomplete,thefirstelementwillcontainx0,thesecondwillcontainx0+x1thethirdx0
+x1+x2,etc.
Fig.4.22showspseudocodeforthebasicalgorithm.Fig.4.23showstheevolutionofthecomputation
wherexiistheinitialvalueofthe(i+1)thelementinthelist.
Figure 4.23. Steps in finding partial sums of a list. Straight arrows represent links
between elements; curved arrows indicate additions.
Thisexamplecanbegeneralizedbyreplacingadditionwithanyassociativeoperatorandissometime
knownasaprefixscan.Itcanbeusedinavarietyofsituations,includingsolvingvarioustypesof
recurrencerelations.
Known uses
Algorithmsdevelopedwiththispatternareatypeofdataparallelalgorithm.Theyarewidelyusedon
SIMDplatformsandtoalesserextentinlanguagessuchasHighPerformanceFortran[HPF97].
Theseplatformssupportthefinegrainedconcurrencyrequiredforthepatternandhandle
synchronizationautomaticallybecauseeverycomputationstep(logicallyifnotphysically)occursin
locksteponalltheprocessors.HillisandSteele[HS86]describeseveralinterestingapplicationsof
thispattern,includingfindingtheendofalinkedlist,computingallpartialsumsofalinkedlist,
regionlabelingintwodimensionalimages,andparsing.
Figure 4.22. Pseudocode for finding partial sums of a list
for all k in parallel
{
temp[k] = next[k];
while temp[k] != null
{
x[temp[k]] = x[k] + x[temp[k]];
temp[k] = temp [temp [k] ];
}
}
Incombinatorialoptimization,problemsinvolvingtraversingallnodesinagraphortreecanoftenbe
solvedwiththispatternbyfirstfindinganorderingonthenodestocreatealist.Eulertoursandear
decomposition[EG88]arewellknowntechniquestocomputethisordering.
JJ[J92]alsodescribesseveralapplicationsofthispattern:findingtherootsoftreesinaforestof
rooteddirectedtrees,computingpartialsumsonasetofrooteddirectedtrees(similartothepreceding
examplewithlinkedlists),andlistranking(determiningforeachelementofthelistitsdistancefrom
thestart/endofthelist).
Related Patterns
Withrespecttotheactualconcurrency,thispatternisverymuchliketheGeometricDecomposition
pattern,adifferencebeingthatinthispatternthedatastructurecontainingtheelementstobeoperated
onconcurrentlyisrecursive(atleastconceptually).Whatmakesitdifferentistheemphasison
fundamentallyrethinkingtheproblemtoexposefinegrainedconcurrency.
4.8. THE PIPELINE PATTERN

Problem
Supposethattheoverallcomputationinvolvesperformingacalculationonmanysetsofdata,where
thecalculationcanbeviewedintermsofdataflowingthroughasequenceofstages.Howcanthe
potentialconcurrencybeexploited?
Context
Anassemblylineisagoodanalogyforthispattern.Supposewewanttomanufactureanumberof
cars.Themanufacturingprocesscanbebrokendownintoasequenceofoperationseachofwhich
addssomecomponent,saytheengineorthewindshield,tothecar.Anassemblyline(pipeline)
assignsacomponenttoeachworker.Aseachcarmovesdowntheassemblyline,eachworkerinstalls
thesamecomponentoverandoveronasuccessionofcars.Afterthepipelineisfull(anduntilitstarts
toempty)theworkerscanallbebusysimultaneously,allperformingtheiroperationsonthecarsthat
arecurrentlyattheirstations.
Examplesofpipelinesarefoundatmanylevelsofgranularityincomputersystems,includingtheCPU
hardwareitself.
InstructionpipelineinmodernCPUs.Thestages(fetchinstruction,decode,execute,etc.)are
doneinapipelinedfashion;whileoneinstructionisbeingdecoded,itspredecessorisbeing
executedanditssuccessorisbeingfetched.
Vectorprocessing(looplevelpipelining).Specializedhardwareinsomesupercomputers
allowsoperationsonvectorstobeperformedinapipelinedfashion.Typically,acompileris
expectedtorecognizethataloopsuchas
for(i = 0; i < N; i++) { a[i] = b[i] + c[i]; }
canbevectorizedinawaythatthespecialhardwarecanexploit.Afterashortstartup,one
a[i]valuewillbegeneratedeachclockcycle.
Algorithmlevelpipelining.Manyalgorithmscanbeformulatedasrecurrencerelationsand
implementedusingapipelineoritshigherdimensionalgeneralization,asystolicarray.Such
implementationsoftenexploitspecializedhardwareforperformancereasons.
Signalprocessing.Passingastreamofrealtimesensordatathroughasequenceoffilterscan
bemodeledasapipeline,witheachfiltercorrespondingtoastageinthepipeline.
Graphics.Processingasequenceofimagesbyapplyingthesamesequenceofoperationsto
eachimagecanbemodeledasapipeline,witheachoperationcorrespondingtoapipeline
stage.Somestagesmaybeimplementedbyspecializedhardware.
ShellprogramsinUNIX.Forexample,theshellcommand
cat sampleFile | grep "word" | we
createsathreestagepipeline,withoneprocessforeachcommand(cat,grep,andwc).
Theseexamplesandtheassemblylineanalogyhaveseveralaspectsincommon.Allinvolveapplying
asequenceofoperations(intheassemblylinecaseitisinstallingtheengine,installingthe
windshield,etc.)toeachelementinasequenceofdataelements(intheassemblyline,thecars).
Althoughtheremaybeorderingconstraintsontheoperationsonasingledataelement(forexample,it
mightbenecessarytoinstalltheenginebeforeinstallingthehood),itispossibletoperformdifferent
operationsondifferentdataelementssimultaneously(forexample,onecaninstalltheengineonone
carwhileinstallingthehoodonanother.)
Thepossibilityofsimultaneouslyperformingdifferentoperationsondifferentdataelementsisthe
potentialconcurrencythispatternexploits.IntermsoftheanalysisdescribedintheFinding
Concurrencypatterns,eachtaskconsistsofrepeatedlyapplyinganoperationtoadataelement
(analogoustoanassemblylineworkerinstallingacomponent),andthedependenciesamongtasksare
orderingconstraintsenforcingtheorderinwhichoperationsmustbeperformedoneachdataelement
(analogoustoinstallingtheenginebeforethehood).
Forces
Agoodsolutionshouldmakeitsimpletoexpresstheorderingconstraints.Theordering
constraintsinthisproblemaresimpleandregularandlendthemselvestobeingexpressedin
termsofdataflowingthroughapipeline.
Thetargetplatformcanincludespecialpurposehardwarethatcanperformsomeofthe
desiredoperations.
Insomeapplications,futureadditions,modifications,orreorderingofthestagesinthe
pipelineareexpected.
Insomeapplications,occasionalitemsintheinputsequencecancontainerrorsthatprevent
theirprocessing.
Solution
Thekeyideaofthispatterniscapturedbytheassemblylineanalogy,namelythatthepotential
concurrencycanbeexploitedbyassigningeachoperation(stageofthepipeline)toadifferentworker
andhavingthemworksimultaneously,withthedataelementspassingfromoneworkertothenextas
operationsarecompleted.Inparallelprogrammingterms,theideaistoassigneachtask(stageofthe
pipeline)toaUEandprovideamechanismwherebyeachstageofthepipelinecansenddataelements
tothenextstage.Thisstrategyisprobablythemoststraightforwardwaytodealwiththistypeof
orderingconstraints.Itallowstheapplicationtotakeadvantageofspecialpurposehardwareby
appropriatemappingofpipelinestagestoPEsandprovidesareasonablemechanismforhandling
errors,describedlater.Italsoislikelytoyieldamodulardesignthatcanlaterbeextendedor
modified.
Beforegoingfurther,itmayhelptoillustratehowthepipelineissupposedtooperate.LetCirepresent
amultistepcomputationondataelementi.Ci(j)isthejthstepofthecomputation.Theideaistomap
computationstepstopipelinestagessothateachstageofthepipelinecomputesonestep.Initially,the
firststageofthepipelineperformsC1(1).Afterthatcompletes,thesecondstageofthepipeline
receivesthefirstdataitemandcomputesC1(2)whilethefirststagecomputesthefirststepofthe
seconditem,C2(1).Next,thethirdstagecomputesC1(3),whilethesecondstagecomputesC2(2)and
thefirststageC3(1).Fig.4.24illustrateshowthisworksforapipelineconsistingoffourstages.Notice
thatconcurrencyisinitiallylimitedandsomeresourcesremainidleuntilallthestagesareoccupied
withusefulwork.Thisisreferredtoasfillingthepipeline.Attheendofthecomputation(draining
thepipeline),againthereislimitedconcurrencyandidleresourcesasthefinalitemworksitsway
throughthepipeline.Wewantthetimespentfillingordrainingthepipelinetobesmallcomparedto
thetotaltimeofthecomputation.Thiswillbethecaseifthenumberofstagesissmallcomparedto
thenumberofitemstobeprocessed.Noticealsothatoverallthroughput/efficiencyismaximizedif
thetimetakentoprocessadataelementisroughlythesameforeachstage.
Figure 4.24. Operation of a pipeline. Each pipeline stage i computes the i-th step of
the computation.
Thisideacanbeextendedtoincludesituationsmoregeneralthanacompletelylinearpipeline.For
example,Fig.4.25illustratestwopipelines,eachwithfourstages.Inthesecondpipeline,thethird
stageconsistsoftwooperationsthatcanbeperformedconcurrently.
Figure 4.25. Example pipelines
Defining the stages of the pipeline
Normallyeachpipelinestagewillcorrespondtoonetask.Fig.4.26showsthebasicstructureofeach
stage.
Ifthenumberofdataelementstobeprocessedisknowninadvance,theneachstagecancountthe
numberofelementsandterminatewhenthesehavebeenprocessed.Alternatively,asentinel
indicatingterminationmaybesentthroughthepipeline.
Itisworthwhiletoconsideratthispointsomefactorsthataffectperformance.
Theamountofconcurrencyinafullpipelineislimitedbythenumberofstages.Thus,alarger
numberofstagesallowsmoreconcurrency.However,thedatasequencemustbetransferred
betweenthestages,introducingoverheadtothecalculation.Thus,weneedtoorganizethe
computationintostagessuchthattheworkdonebyastageislargecomparedtothe
communicationoverhead.Whatis"largeenough"ishighlydependentontheparticular
architecture.Specializedhardware(suchasvectorprocessors)allowsveryfinegrained
parallelism.
Thepatternworksbetteriftheoperationsperformedbythevariousstagesofthepipelineare
allaboutequallycomputationallyintensive.Ifthestagesinthepipelinevarywidelyin
computationaleffort,thesloweststagecreatesabottleneckfortheaggregatethroughput.
Figure 4.26. Basic structure of a pipeline stage
initialize
while (more data)
{
receive data element from previous stage
perform operation on data element
send data element to next stage
}
finalize
Thepatternworksbetterifthetimerequiredtofillanddrainthepipelineissmallcomparedto
theoverallrunningtime.Thistimeisinfluencedbythenumberofstages(morestagesmeans
morefill/draintime).
Therefore,itisworthwhiletoconsiderwhethertheoriginaldecompositionintotasksshouldbe
revisitedatthispoint,possiblycombininglightlyloadedadjacentpipelinestagesintoasinglestage,
ordecomposingaheavilyloadedstageintomultiplestages.
ItmayalsobeworthwhiletoparallelizeaheavilyloadedstageusingoneoftheotherAlgorithm
Structurepatterns.Forexample,ifthepipelineisprocessingasequenceofimages,itisoftenthecase
thateachstagecanbeparallelizedusingtheTaskParallelismpattern.
Structuring the computation
Wealsoneedawaytostructuretheoverallcomputation.OnepossibilityistousetheSPMDpattern
(describedintheSupportingStructuresdesignspace)anduseeachUE'sIDtoselectanoptionina
caseorswitchstatement,witheachcasecorrespondingtoastageofthepipeline.
Toincreasemodularity,objectorientedframeworkscanbedevelopedthatallowstagestobe
representedbyobjectsorproceduresthatcaneasilybe"pluggedin"tothepipeline.Suchframeworks
arenotdifficulttoconstructusingstandardOOPtechniques,andseveralareavailableascommercial
orfreelyavailableproducts.
Representing the dataflow among pipeline elements
Howdataflowbetweenpipelineelementsisrepresenteddependsonthetargetplatform.
Inamessagepassingenvironment,themostnaturalapproachistoassignoneprocesstoeach
operation(stageofthepipeline)andimplementeachconnectionbetweensuccessivestagesofthe
pipelineasasequenceofmessagesbetweenthecorrespondingprocesses.Becausethestagesare
hardlyeverperfectlysynchronized,andtheamountofworkcarriedoutatdifferentstagesalmost
alwaysvaries,thisflowofdatabetweenpipelinestagesmustusuallybebothbufferedandordered.
Mostmessagepassingenvironments(e.g.,MPI)makethiseasytodo.Ifthecostofsendingindividual
messagesishigh,itmaybeworthwhiletoconsidersendingmultipledataelementsineachmessage;
thisreducestotalcommunicationcostattheexpenseofincreasingthetimeneededtofillthepipeline.
Ifamessagepassingprogrammingenvironmentisnotagoodfitwiththetargetplatform,thestages
ofthepipelinecanbeconnectedexplicitlywithbufferedchannels.Suchabufferedchannelcanbe
implementedasaqueuesharedbetweenthesendingandreceivingtasks,usingtheSharedQueue
pattern.
Iftheindividualstagesarethemselvesimplementedasparallelprograms,thenmoresophisticated
approachesmaybecalledfor,especiallyifsomesortofdataredistributionneedstobeperformed
betweenthestages.Thismightbethecaseif,forexample,thedataneedstobepartitionedalonga
differentdimensionorpartitionedintoadifferentnumberofsubsetsinthesamedimension.For
example,anapplicationmightincludeonestageinwhicheachdataelementispartitionedintothree
subsetsandanotherstageinwhichitispartitionedintofoursubsets.Thesimplestwaystohandlesuch
situationsaretoaggregateanddisaggregatedataelementsbetweenstages.Oneapproachwouldbeto
haveonlyonetaskineachstagecommunicatewithtasksinotherstages;thistaskwouldthenbe
responsibleforinteractingwiththeothertasksinitsstagetodistributeinputdataelementsandcollect
outputdataelements.Anotherapproachwouldbetointroduceadditionalpipelinestagestoperform
aggregation/disaggregationoperations.Eitheroftheseapproaches,however,involvesafairamountof
communication.Itmaybepreferabletohavetheearlierstage"know"abouttheneedsofitssuccessor
andcommunicatewitheachtaskreceivingpartofitsdatadirectlyratherthanaggregatingthedataat
onestageandthendisaggregatingatthenext.Thisapproachimprovesperformanceatthecostof
reducedsimplicity,modularity,andflexibility.
Lesstraditionally,networkedfilesystemshavebeenusedforcommunicationbetweenstagesina
pipelinerunninginaworkstationcluster.Thedataiswrittentoafilebyonestageandreadfromthe
filebyitssuccessor.Networkfilesystemsareusuallymatureandfairlywelloptimized,andthey
provideforthevisibilityofthefileatallPEsaswellasmechanismsforconcurrencycontrol.Higher
levelabstractionssuchastuplespacesandblackboardsimplementedovernetworkedfilesystemscan
alsobeused.Filesystembasedsolutionsareappropriateinlargegrainedapplicationsinwhichthe
timeneededtoprocessthedataateachstageislargecomparedwiththetimetoaccessthefilesystem.
Handling errors
Forsomeapplications,itmightbenecessarytogracefullyhandleerrorconditions.Onesolutionisto
createaseparatetasktohandleerrors.Eachstageoftheregularpipelinesendstothistaskanydata
elementsitcannotprocessalongwitherrorinformationandthencontinueswiththenextiteminthe
pipeline.Theerrortaskdealswiththefaultydataelementsappropriately.
Processor allocation and task scheduling
ThesimplestapproachistoallocateonePEtoeachstageofthepipeline.Thisgivesgoodloadbalance
ifthePEsaresimilarandtheamountofworkneededtoprocessadataelementisroughlythesamefor
eachstage.Ifthestageshavedifferentrequirements(forexample,oneismeanttoberunonspecial
purposehardware),thisshouldbetakenintoconsiderationinassigningstagestoPEs.
IftherearefewerPEsthanpipelinestages,thenmultiplestagesmustbeassignedtothesamePE,
preferablyinawaythatimprovesoratleastdoesnotmuchreduceoverallperformance.Stagesthatdo
notsharemanyresourcescanbeallocatedtothesamePE;forexample,astagethatwritestoadisk
andastagethatinvolvesprimarilyCPUcomputationmightbegoodcandidatestoshareaPE.Ifthe
amountofworktoprocessadataelementvariesamongstages,stagesinvolvinglessworkmaybe
allocatedtothesamePE,therebypossiblyimprovingloadbalance.Assigningadjacentstagestothe
samePEcanreducecommunicationcosts.Itmightalsobeworthwhiletoconsidercombining
adjacentstagesofthepipelineintoasinglestage.
IftherearemorePEsthanpipelinestages,itisworthwhiletoconsiderparallelizingoneormoreofthe
pipelinestagesusinganappropriateAlgorithmStructurepattern,asdiscussedpreviously,and
allocatingmorethanonePEtotheparallelizedstage(s).Thisisparticularlyeffectiveifthe
parallelizedstagewaspreviouslyabottleneck(takingmoretimethantheotherstagesandthereby
draggingdownoverallperformance).
AnotherwaytomakeuseofmorePEsthanpipelinestages,iftherearenotemporalconstraintsamong
thedataitemsthemselves(thatis,itdoesn'tmatterif,say,dataitem3iscomputedbeforedataitem2),
istorunmultipleindependentpipelinesinparallel.ThiscanbeconsideredaninstanceoftheTask
Parallelismpattern.Thiswillimprovethethroughputoftheoverallcalculation,butdoesnot
significantlyimprovethelatency,however,sinceitstilltakesthesameamountoftimeforadata
elementtotraversethepipeline.
Throughput and latency
Therearefewmorefactorstokeepinmindwhenevaluatingwhetheragivendesignwillproduce
acceptableperformance.
InmanysituationswherethePipelinepatternisused,theperformancemeasureofinterestisthe
throughput,thenumberofdataitemspertimeunitthatcanbeprocessedafterthepipelineisalready
full.Forexample,iftheoutputofthepipelineisasequenceofrenderedimagestobeviewedasan
animation,thenthepipelinemusthavesufficientthroughput(numberofitemsprocessedpertime
unit)togeneratetheimagesattherequiredframerate.
Inanothersituation,theinputmightbegeneratedfromrealtimesamplingofsensordata.Inthiscase,
theremightbeconstraintsonboththethroughput(thepipelineshouldbeabletohandleallthedataas
itcomesinwithoutbackinguptheinputqueueandpossiblylosingdata)andthelatency(theamount
oftimebetweenthegenerationofaninputandthecompletionofprocessingofthatinput).Inthis
case,itmightbedesirabletominimizelatencysubjecttoaconstraintthatthethroughputissufficient
tohandletheincomingdata.
Examples
Fourier-transform computations
Atypeofcalculationwidelyusedinsignalprocessinginvolvesperformingthefollowing
computationsrepeatedlyondifferentsetsofdata.
1. PerformadiscreteFouriertransform(DFT)onasetofdata.
2. Manipulatetheresultofthetransformelementwise.
3. PerformaninverseDFTontheresultofthemanipulation.
Examplesofsuchcalculationsincludeconvolution,correlation,andfilteringoperations([PTV93]).
Acalculationofthisformcaneasilybeperformedbyathreestagepipeline.
ThefirststageofthepipelineperformstheinitialFouriertransform;itrepeatedlyobtainsone
setofinputdata,performsthetransform,andpassestheresulttothesecondstageofthe
pipeline.
Thesecondstageofthepipelineperformsthedesiredelementwisemanipulation;itrepeatedly
obtainsapartialresult(ofapplyingtheinitialFouriertransformtoaninputsetofdata)from
thefirststageofthepipeline,performsitsmanipulation,andpassestheresulttothethirdstage
ofthepipeline.ThisstagecanoftenitselfbeparallelizedusingoneoftheotherAlgorithm
Structurepatterns.
ThethirdstageofthepipelineperformsthefinalinverseFouriertransform;itrepeatedly
obtainsapartialresult(ofapplyingtheinitialFouriertransformandthentheelementwise
manipulationtoaninputsetofdata)fromthesecondstageofthepipeline,performsthe
inverseFouriertransform,andoutputstheresult.
Eachstageofthepipelineprocessesonesetofdataatatime.However,exceptduringtheinitialfilling
ofthepipeline,allstagesofthepipelinecanoperateconcurrently;whilethefirststageisprocessing
theNthsetofdata,thesecondstageisprocessingthe(Nl)thsetofdata,andthethirdstageis
processingthe(N2)thsetofdata.
Java pipeline framework
ThefiguresforthisexampleshowasimpleJavaframeworkforpipelinesandanexampleapplication.
Theframeworkconsistsofabaseclassforpipelinestages,PipelineStage,showninFig.4.27,
andabaseclassforpipelines,LinearPipeline,showninFig.4.28.Applicationsprovidea
subclassofPipelineStageforeachdesiredstage,implementingitsthreeabstractmethodsto
indicatewhatthestageshoulddoontheinitialstep,thecomputationsteps,andthefinalstep,anda
subclassofLinearPipelinethatimplementsitsabstractmethodstocreateanarraycontaining
thedesiredpipelinestagesandthedesiredqueuesconnectingthestages.Forthequeueconnectingthe
stages,weuseLinkedBlockingQueue,animplementationoftheBlockingQueueinterface.
Theseclassesarefoundinthejava.util. concurrentpackage.Theseclassesusegenericsto
specifythetypeofobjectsthequeuecanhold.Forexample,new
LinkedBlockingQueue<String>createsaBlockingQueueimplementedbyanunderlying
linkedlistthatcanholdStrings.Theoperationsofinterestareput,toaddanobjecttothequeue,
andtake,toremoveanobject,takeblocksifthequeueisempty.TheclassCountDownLatch,
alsofoundinthejava.util.concurrentpackage,isasimplebarrierthatallowstheprogramto
printamessagewhenithasterminated.Barriersingeneral,andCountDownLatchinparticular,
arediscussedintheImplementationMechanismsdesignspace.
Theremainingfiguresshowcodeforanexampleapplication,apipelinetosortintegers.Fig.4.29is
therequiredsubclassofLinearPipeline,andFig.4.30istherequiredsubclassof
PipelineStage.Additionalpipelinestagestogenerateorreadtheinputandtohandletheoutput
arenotshown.
Known uses
Manyapplicationsinsignalandimageprocessingareimplementedaspipelines.
TheOPUS[SR98]systemisapipelineframeworkdevelopedbytheSpaceTelescopeScienceInstitute
originallytoprocesstelemetrydatafromtheHubbleSpaceTelescopeandlateremployedinother
applications.OPUSusesablackboardarchitecturebuiltontopofanetworkfilesystemforinterstage
communicationandincludesmonitoringtoolsandsupportforerrorhandling.
Figure 4.27. Base class for pipeline stages
import java.util.concurrent.*;
abstract class PipelineStage implements Runnable {
BlockingQueue in;
BlockingQueue out;
CountDownLatch s;
boolean done;
//override to
abstract void
//override to
abstract void
//override to
abstract void
specify initialization step

firstStep() throws Exception;
specify compute step
step() throws Exception;
specify finalization step
lastStep() throws Exception;
void handleComputeException(Exception e)
{ e.printStackTrace(); }
public void run()
{
try
{ firstStep();
while(!done){ step();}
lastStep();
}
catch(Exception e){handleComputeException(e);}
finally {s.countDown();}
}
public void init(BlockingQueue in,
BlockingQueue out,
CountDownLatch s)
{ this.in = in; this.out = out; this.s = s;}
}
Airbornesurveillanceradarsusespacetimeadaptiveprocessing(STAP)algorithms,whichhavebeen
+
implementedasaparallelpipeline[CLW
00
].Eachstageisitselfaparallelalgorithm,andthe
pipelinerequiresdataredistributionbetweensomeofthestages.
Fx[GOS94],aparallelizingFortrancompilerbasedonHPF[HPF97],hasbeenusedtodevelop
+
severalexampleapplications[DGO
94
,SSOG93]thatcombinedataparallelism(similartotheform
ofparallelismcapturedintheGeometricDecompositionpattern)andpipelining.Forexample,one
applicationperforms2DFouriertransformsonasequenceofimagesviaatwostagepipeline(one
stagefortherowtransformsandonestageforthecolumntransforms),witheachstagebeingitself
parallelizedusingdataparallelism.TheSIGPLANpaper([SSOG93])isespeciallyinterestinginthat
itpresentsperformancefigurescomparingthisapproachwithastraightdataparallelismapproach.
Figure 4.28. Base class for linear pipeline
abstract class LinearPipeline {

PipelineStage[] stages;
BlockingQueue[] queues;
int numStages;
CountDownLatch s;
//override method to create desired array of pipeline stage objects
abstract PipelineStage[] getPipelineStages(String[] args);
//override method to create desired array of BlockingQueues
//element i of returned array contains queue between stages i and i+1
abstract BlockingQueue[] getQueues(String[] args);
LinearPipeline(String[] args)
{ stages = getPipelineStages(args);
queues = getQueues(args);
numStages = stages.length;
s = new CountDownLatch(numStages);
BlockingQueue in = null;
BlockingQueue out = queues[0];
for (int i = 0; i != numStages; i++)
{ stages[i].init(in,out,s);
in = out;
if (i < numStages-2) out = queues[i+1]; else out = null;
}
}
public void start()

{ for (int i = 0; i != numStages; i++)
{ new Thread(stages[i]).start();
}
}
[J92]presentssomefinergrainedapplicationsofpipelining,includinginsertingasequenceof
elementsintoa23treeandpipelinedmergesort.
Related Patterns
+
ThispatternisverysimilartothePipesandFilterspatternof[BMR
96
];thekeydifferenceisthat
thispatternexplicitlydiscussesconcurrency.
Forapplicationsinwhichtherearenotemporaldependenciesbetweenthedatainputs,analternative
tothispatternisadesignbasedonmultiplesequentialpipelinesexecutinginparallelandusingthe
TaskParallelismpattern.
Figure 4.29. Pipelined sort (main class)
import java.uti1.concurrent.*;
class SortingPipeline extends LinearPipeline {

/*Creates an array of pipeline stages with the
number of sorting stages given via args. Input
and output stages are also included at the
beginning and end of the array. Details are omitted.
*/
PipelineStage[] getPipelineStages(String[] args)
{ //....
return stages;
}
/* Creates an array of LinkedBlockingQueues to serve as
communication channels between the stages. For this
example, the first is restricted to hold Strings,
the rest can hold Comparables. */
BlockingQueue[] getQueues(String[] args)
{ BlockingQueue[] queues = new BlockingQueue[totalStages - 1];
queues[0] = new LinkedBlockingQueue<String>();
for (int i = 1; i!= totalStages -1; i++)
{ queues[i] = new LinkedBlockingQueue<Comparable>();}
return queues;
}
SortingPipeline(String[] args)
{ super(args);
}
public static void main(String[] args)
throws InterruptedException
{ // create pipeline
LinearPipeline 1 = new SortingPipeline(args);
1.start(); //start threads associated with stages
l.s.await(); //terminate thread when all stages terminated
System.out.println("All threads terminated");
}
}
Atfirstglance,onemightalsoexpectthatsequentialsolutionsbuiltusingtheChainofResponsibility
pattern[GHJV95]couldbeeasilyparallelizedusingthePipelinepattern.InChainofResponsibility,
orCOR,an"event"ispassedalongachainofobjectsuntiloneormoreoftheobjectshandlethe
event.Thispatternisdirectlysupported,forexample,intheJavaServletSpecification[1][SER]to
enablefilteringofHTTPrequests.WithServlets,aswellasothertypicalapplicationsofCOR,
however,thereasonforusingthepatternistosupportmodularstructuringofaprogramthatwillneed
tohandleindependenteventsindifferentwaysdependingontheeventtype.Itmaybethatonlyone
objectinthechainwillevenhandletheevent.Weexpectthatinmostcases,theTaskParallelism
patternwouldbemoreappropriatethanthePipelinepattern.Indeed,Servletcontainer
implementationsalreadysupportingmultithreadingtohandleindependentHTTPrequestsprovidethis
solutionforfree.
[1]
AServletisaJavaprograminvokedbyaWebserver.TheJavaServletstechnologyis
includedintheJava2EnterpriseEditionplatformforWebserverapplications.
Figure 4.30. Pipelined sort (sorting stage)
class SortingStage extends PipelineStage
{
Comparable val = null;
Comparable input = null;
void firstStep() throws InterruptedException
{ input = (Comparable)in.take();
done = (input.equals("DONE"));
val = input;
return;
}
void step() throws InterruptedException
{ input = (Comparable)in.take();
done = (input.equals("DONE"));
if (!done)
{ if(val.compareTo(input)<0)
{ out.put(val); val = input; }
else { out.put(input); }
} else out.put(val);
}
void lastStep() throws InterruptedException

{ out.put("DONE"); }
ThePipelinepatternissimilartotheEventBasedCoordinationpatterninthatbothpatternsapplyto
problemswhereitisnaturaltodecomposethecomputationintoacollectionofsemiindependent
tasks.ThedifferenceisthattheEventBasedCoordinationpatternisirregularandasynchronous
wherethePipelinepatternisregularandsynchronous:InthePipelinepattern,thesemiindependent
tasksrepresentthestagesofthepipeline,thestructureofthepipelineisstatic,andtheinteraction
betweensuccessivestagesisregularandlooselysynchronous.IntheEventBasedCoordination
pattern,however,thetaskscaninteractinveryirregularandasynchronousways,andthereisno
requirementforastaticstructure.
4.9. THE EVENT-BASED COORDINATION PATTERN
Problem
Supposetheapplicationcanbedecomposedintogroupsofsemiindependenttasksinteractinginan
irregularfashion.Theinteractionisdeterminedbytheflowofdatabetweenthemwhichimplies
orderingconstraintsbetweenthetasks.Howcanthesetasksandtheirinteractionbeimplementedso
theycanexecuteconcurrently?
Context
Someproblemsaremostnaturallyrepresentedasacollectionofsemiindependententitiesinteracting
inanirregularway.WhatthismeansisperhapsclearestifwecomparethispatternwiththePipeline
pattern.InthePipelinepattern,theentitiesformalinearpipeline,eachentityinteractsonlywiththe
entitiestoeitherside,theflowofdataisoneway,andinteractionoccursatfairlyregularand
predictableintervals.IntheEventBasedCoordinationpattern,incontrast,thereisnorestrictiontoa
linearstructure,norestrictionthattheflowofdatabeoneway,andtheinteractiontakesplaceat
irregularandsometimesunpredictableintervals.
Asarealworldanalogy,consideranewsroom,withreporters,editors,factcheckers,andother
employeescollaboratingonstories.Asreportersfinishstories,theysendthemtotheappropriate
editors;aneditorcandecidetosendthestorytoafactchecker(whowouldtheneventuallysendit
back)orbacktothereporterforfurtherrevision.Eachemployeeisasemiindependententity,and
theirinteraction(forexample,areportersendingastorytoaneditor)isirregular.
Manyotherexamplescanbefoundinthefieldofdiscreteeventsimulation,thatis,simulationofa
physicalsystemconsistingofacollectionofobjectswhoseinteractionisrepresentedbyasequenceof
discrete"events".Anexampleofsuchasystemisthecarwashfacilitydescribedin[Mis86]:The
facilityhastwocarwashmachinesandanattendant.Carsarriveatrandomtimesattheattendant.
Eachcarisdirectedbytheattendanttoanonbusycarwashmachineifoneexists,orqueuedifboth
machinesarebusy.Eachcarwashmachineprocessesonecaratatime.Thegoalistocompute,fora
givendistributionorarrivaltimes,theaveragetimeacarspendsinthesystem(timebeingwashed
plusanytimewaitingforanonbusymachine)andtheaveragelengthofthequeuethatbuildsupatthe
attendant.The"events"inthissystemincludecarsarrivingattheattendant,carsbeingdirectedtothe
carwashmachines,andcarsleavingthemachines.Fig.4.31sketchesthisexample.Noticethatit
includes"source"and"sink"objectstomakeiteasiertomodelcarsarrivingandleavingthefacility.
Noticealsothattheattendantmustbenotifiedwhencarsleavethecarwashmachinessothatitknows
whetherthemachinesarebusy.
Figure 4.31. Discrete-event simulation of a car-wash facility. Arrows indicate the flow
of events.
Also,itissometimesdesirabletocomposeexisting,possiblysequential,programcomponentsthat
interactinpossiblyirregularwaysintoaparallelprogramwithoutchangingtheinternalsofthe
components.
Forproblemssuchasthis,itmightmakesensetobaseaparallelalgorithmondefiningatask(ora
groupoftightlycoupledtasks)foreachcomponent,orinthecaseofdiscreteeventsimulation,
simulationentity.Interactionbetweenthesetasksisthenbasedontheorderingconstraintsdetermined
bytheflowofdatabetweenthem.
Forces
Agoodsolutionshouldmakeitsimpletoexpresstheorderingconstraints,whichcanbe
numerousandirregularandevenarisedynamically.Itshouldalsomakeitpossibleforasmany
activitiesaspossibletobeperformedconcurrently.
Orderingconstraintsimpliedbythedatadependenciescanbeexpressedbyencodingtheminto
theprogram(forexample,viasequentialcomposition)orusingsharedvariables,butneither
approachleadstosolutionsthataresimple,capableofexpressingcomplexconstraints,and
easytounderstand.
Solution
Agoodsolutionisbasedonexpressingthedataflowusingabstractionscalledevents,witheachevent
havingataskthatgeneratesitandataskthatprocessesit.Becauseaneventmustbegeneratedbefore
itcanbeprocessed,eventsalsodefineorderingconstraintsbetweenthetasks.Computationwithin
eachtaskconsistsofprocessingevents.
Defining the tasks
Thebasicstructureofeachtaskconsistsofreceivinganevent,processingit,andpossiblygenerating
events,asshowninFig.4.32.
Iftheprogramisbeingbuiltfromexistingcomponents,thetaskwillserveasaninstanceofthe
Facadepattern[GHJV95]byprovidingaconsistenteventbasedinterfacetothecomponent.
Theorderinwhichtasksreceiveeventsmustbeconsistentwiththeapplication'sorderingconstraints,
asdiscussedlater.
Figure 4.32. Basic structure of a task in the Event-Based Coordination pattern
initialize
while(not done)
{
receive event
process event
send events
}
finalize
Representing event flow
Toallowcommunicationandcomputationtooverlap,onegenerallyneedsaformofasynchronous
communicationofeventsinwhichataskcancreate(send)aneventandthencontinuewithoutwaiting
fortherecipienttoreceiveit.Inamessagepassingenvironment,aneventcanberepresentedbya
messagesentasynchronouslyfromthetaskgeneratingtheeventtothetaskthatistoprocessit.Ina
sharedmemoryenvironment,aqueuecanbeusedtosimulatemessagepassing.Becauseeachsuch
queuewillbeaccessedbymorethanonetask,itmustbeimplementedinawaythatallowssafe
concurrentaccess,asdescribedintheSharedQueuepattern.Othercommunicationabstractions,such
astuplespacesasfoundintheLindacoordinationlanguageorJavaSpaces[FHA99],canalsobeused
effectivelywiththispattern.Linda[CG91]isasimplelanguageconsistingofonlysixoperationsthat
readandwriteanassociative(thatis,contentaddressable)sharedmemorycalledatuplespace.A
tuplespaceisaconceptuallysharedrepositoryfordatacontainingobjectscalledtuplesthattasksuse
forcommunicationinadistributedsystem.
Enforcing event ordering
Theenforcementoforderingconstraintsmaymakeitnecessaryforatasktoprocesseventsina
differentorderfromtheorderinwhichtheyaresent,ortowaittoprocessaneventuntilsomeother
eventfromagiventaskhasbeenreceived,soitisusuallynecessarytobeabletolookaheadinthe
queueormessagebufferandremoveelementsoutoforder.Forexample,considerthesituationinFig.
4.33.Task1generatesaneventandsendsittotask2,whichwillprocessit,andalsosendsittotask3,
whichisrecordinginformationaboutallevents.Task2processestheeventfromtask1andgenerates
anewevent,acopyofwhichisalsosenttotask3.Supposethatthevagariesoftheschedulingand
underlyingcommunicationlayercausetheeventfromtask2toarrivebeforetheeventfromtask1.
Dependingonwhattask3isdoingwiththeevents,thismayormaynotbeproblematic.Iftask3is
simplytallyingthenumberofeventsthatoccur,thereisnoproblem.Iftask3iswritingalogentry
thatshouldreflecttheorderinwhicheventsarehandled,however,simplyprocessingeventsinthe
orderinwhichtheyarrivewouldinthiscaseproduceanincorrectresult.Iftask3iscontrollingagate,
andtheeventfromtask1resultsinopeningthegateandtheeventfromtask2inclosingthegate,then
theoutofordermessagescouldcausesignificantproblems,andtask3shouldnotprocessthefirst
eventuntilaftertheeventfromtask1hasarrivedandbeenprocessed.
Figure 4.33. Event-based communication among three tasks. Task 2 generates its
event in response to the event received from task 1. The two events sent to task 3
can arrive in either order.
Indiscreteeventsimulations,asimilarproblemcanoccurbecauseofthesemanticsoftheapplication
domain.Aneventarrivesatastation(task)alongwithasimulationtimewhenitshouldbescheduled.
Aneventcanarriveatastationbeforeothereventswithearliersimulationtimes.
Thefirststepistodeterminewhether,inaparticularsituation,outofordereventscanbeaproblem.
Therewillbenoproblemifthe"event"pathislinearsothatnooutofordereventswilloccur,orif,
accordingtotheapplicationsemantics,outofordereventsdonotmatter.
Ifoutofordereventsmaybeaproblem,theneitheranoptimisticorpessimisticapproachcanbe
chosen.Anoptimisticapproachrequirestheabilitytorollbacktheeffectsofeventsthatare
mistakenlyexecuted(includingtheeffectsofanyneweventsthathavebeencreatedbytheoutof
orderexecution).Intheareaofdistributedsimulation,thisapproachiscalledtimewarp[Jef85].
Optimisticapproachesareusuallynotfeasibleifaneventcausesinteractionwiththeoutsideworld.
Pessimisticapproachesensurethattheeventsarealwaysexecutedinorderattheexpenseofincreased
latencyandcommunicationoverhead.Pessimisticapproachesdonotexecuteeventsuntilitcanbe
guaranteed"safe"todoso.Inthefigure,forexample,task3cannotprocessaneventfromtask2until
it"knows"thatnoearliereventwillarrivefromtask1andviceversa.Providingtask3withthat
knowledgemayrequireintroducingnulleventsthatcontainnoinformationusefulforanythingexcept
theeventordering.Manyimplementationsofpessimisticapproachesarebasedontimestampsthatare
consistentwiththecausalityinthesystem[Lam78].
Muchresearchanddevelopmentefforthasgoneintoframeworksthattakecareofthedetailsofevent
+
orderingindiscreteeventsimulationforbothoptimistic[RMC
98
]andpessimisticapproaches
+
[CLL
99
].Similarly,middlewareisavailablethathandleseventorderingproblemsinprocessgroups
causedbythecommunicationsystem.AnexampleistheEnsemblesystemdevelopedatCornell
+
[vRBH
98
].
Avoiding deadlocks
Itispossibleforsystemsusingthispatterntodeadlockattheapplicationlevelforsomereasonthe
systemarrivesinastatewherenotaskcanproceedwithoutfirstreceivinganeventfromanothertask
thatwillneverarrive.Thiscanhappenbecauseofaprogrammingerror;inthecaseofasimulation,it
canalsobecausedbyproblemsinthemodelthatisbeingsimulated.Inthelattercase,thedeveloper
mustrethinkthesolution.
Ifpessimistictechniquesareusedtocontroltheorderinwhicheventsareprocessed,thendeadlocks
canoccurwhenaneventisavailableandactuallycouldbeprocessed,butisnotprocessedbecausethe
eventisnotyetknowntobesafe.Thedeadlockcanbebrokenbyexchangingenoughinformationthat
theeventcanbesafelyprocessed.Thisisaverysignificantproblemastheoverheadofdealingwith
deadlockscancancelthebenefitsofparallelismandmaketheparallelalgorithmsslowerthana
sequentialsimulation.Approachestodealingwiththistypeofdeadlockrangefromsendingfrequent
enough"nullmessages"toavoiddeadlocksaltogether(atthecostofmanyextramessages)tousing
deadlockdetectionschemestodetectthepresenceofadeadlockandthenresolveit(atthecostof
possiblesignificantidletimebeforethedeadlockisdetectedandresolved).Theapproachofchoice
willdependonthefrequencyofdeadlock.Amiddlegroundsolutionistousetimeoutsinsteadof
accuratedeadlockdetection,andisoftenthebestapproach.
Scheduling and processor allocation
ThemoststraightforwardapproachistoallocateonetaskperPEandallowallthetaskstoexecute
concurrently.IfinsufficientPEsareavailabletodothis,thenmultipletaskscanbeallocatedtoeach
PE.Thisshouldbedoneinawaythatachievesgoodloadbalance.Loadbalancingisadifficult
probleminthispatternduetoitspotentiallyirregularstructureandpossibledynamicnature.Some
infrastructuresthatsupportthispatternallowtaskmigrationsothattheloadcanbebalanced
dynamicallyatruntime.
Efficient communication of events
Iftheapplicationistoperformwell,themechanismusedtocommunicateeventsmustbeasefficient
asisfeasible.Inasharedmemoryenvironment,thismeansmakingsurethemechanismdoesnothave
thepotentialtobecomeabottleneck.Inamessagepassingenvironment,thereareseveralefficiency
considerations;forexample,whetheritmakessensetosendmanyshortmessagesbetweentasksortry
+
tocombinethem.[YWC
96
]and[WY95]describesomeconsiderationsandsolutions.
Examples
Known uses
Anumberofdiscreteeventsimulationapplicationsusethispattern.TheDPATsimulationusedto
analyzeairtrafficcontrolsystems[Wie01]isasuccessfulsimulationthatusesoptimistictechniques.
+
ItisimplementedusingtheGTW(GeorgiaTechTimeWarp)System[DFP
94
].Thepaper([Wie01])
describesapplicationspecifictuningandseveralgeneraltechniquesthatallowthesimulationtowork
wellwithoutexcessiveoverheadfortheoptimisticsynchronization.TheSynchronousParallel
EnvironmentforEmulationandDiscreteEventSimulation(SPEEDES)[Met]isanotheroptimistic
simulationenginethathasbeenusedforlargescalewargamingexercises.TheScalableSimulation
+
Framework(SSF)[CLL
99
]isasimulationframeworkwithpessimisticsynchronizationthathasbeen
usedforlargescalemodelingoftheInternet.
+
TheCSWEBapplicationdescribedin[YWC
96
]simulatesthevoltageoutputofcombinational
digitalcircuits(thatis,circuitswithoutfeedbackpaths).Thecircuitispartitionedintosubcircuits;
associatedwitheachareinputsignalportsandoutputvoltageports,whichareconnectedtoforma
representationofthewholecircuit.Thesimulationofeachsubcircuitproceedsinatimestepped
fashion;ateachtimestep,thesubcircuit'sbehaviordependsonitspreviousstateandthevaluesreadat
itsinputports(whichcorrespondtovaluesatthecorrespondingoutputportsofothersubcircuitsat
previoustimesteps).Simulationofthesesubcircuitscanproceedconcurrently,withordering
constraintsimposedbytherelationshipbetweenvaluesgeneratedforoutputportsandvaluesreadon
+
inputports.Thesolutiondescribedin[YWC
96
]fitstheEventBasedCoordinationpattern,defining
ataskforeachsubcircuitandrepresentingtheorderingconstraintsasevents.
Related Patterns
ThispatternissimilartothePipelinepatterninthatbothpatternsapplytoproblemsinwhichitis
naturaltodecomposethecomputationintoacollectionofsemiindependententitiesinteractingin
termsofaflowofdata.Therearetwokeydifferences.First,inthePipelinepattern,theinteraction
amongentitiesisfairlyregular,withallstagesofthepipelineproceedinginalooselysynchronous
way,whereasintheEventBasedCoordinationpatternthereisnosuchrequirement,andtheentities
caninteractinveryirregularandasynchronousways.Second,inthePipelinepattern,theoverall
structure(numberoftasksandtheirinteraction)isusuallyfixed,whereasintheEventBased
Coordinationpattern,theproblemstructurecanbemoredynamic.
Chapter 5. The Supporting Structures Design

Space
5.1INTRODUCTION
5.2FORCES
5.3CHOOSINGTHEPATTERNS
5.4THESPMDPATTERN
5.5THEMASTER/WORKERPATTERN
5.6THELOOPPARALLELISMPATTERN
5.7THEFORK/JOINPATTERN
5.8THESHAREDDATAPATTERN
5.9THESHAREDQUEUEPATTERN
5.10THEDISTRIBUTEDARRAYPATTERN
5.11OTHERSUPPORTINGSTRUCTURES
5.1. INTRODUCTION
TheFindingConcurrencyandAlgorithmStructuredesignspacesfocusonalgorithmexpression.At
somepoint,however,algorithmsmustbetranslatedintoprograms.ThepatternsintheSupporting
Structuresdesignspaceaddressthatphaseoftheparallelprogramdesignprocess,representingan
intermediatestagebetweentheproblemorientedpatternsoftheAlgorithmStructuredesignspaceand
thespecificprogrammingmechanismsdescribedintheImplementationMechanismsdesignspace.
WecallthesepatternsSupportingStructuresbecausetheydescribesoftwareconstructionsor
"structures"thatsupporttheexpressionofparallelalgorithms.Anoverviewofthisdesignspaceand
itsplaceinthepatternlanguageisshowninFig.5.1.
Figure 5.1. Overview of the Supporting Structures design space and its place in the
pattern language
Thetwogroupsofpatternsinthisspacearethosethatrepresentprogramstructuringapproachesand
thosethatrepresentcommonlyusedshareddatastructures.Thesepatternsarebrieflydescribedinthe
nextsection.Insomeprogrammingenvironments,someofthesepatternsaresowellsupportedthat
thereislittleworkfortheprogrammer.Weneverthelessdocumentthemaspatternsfortworeasons:
First,understandingthelowleveldetailsbehindthesestructuresisimportantforeffectivelyusing
them.Second,describingthesestructuresaspatternsprovidesguidanceforprogrammerswhomight
needtoimplementthemfromscratch.Thefinalsectionofthischapterdescribesstructuresthatwere
notdeemedimportantenough,forvariousreasons,towarrantadedicatedpattern,butwhichdeserve
mentionforcompleteness.
5.1.1. Program Structuring Patterns
Patternsinthisfirstgroupdescribeapproachesforstructuringsourcecode.Thesepatternsincludethe
following.
SPMD.InanSPMD(SingleProgram,MultipleData)program,allUEsexecutethesame
program(SingleProgram)inparallel,buteachhasitsownsetofdata(MultipleData).
DifferentUEscanfollowdifferentpathsthroughtheprogram.Inthesourcecode,thelogicto
controlthisisexpressedusingaparameterthatuniquelylabelseachUE(forexampleaprocess
ID).
Master/Worker.Amasterprocessorthreadsetsupapoolofworkerprocessesorthreadsanda
bagoftasks.Theworkersexecuteconcurrently,witheachworkerrepeatedlyremovingatask
fromthebagoftasksandprocessingit,untilalltaskshavebeenprocessedorsomeother
terminationconditionhasbeenreached.Insomeimplementations,noexplicitmasteris
present.
LoopParallelism.Thispatternaddressestheproblemoftransformingaserialprogramwhose
runtimeisdominatedbyasetofcomputeintensiveloopsintoaparallelprogramwherethe
differentiterationsoftheloopareexecutedinparallel.
Fork/Join.AmainUEforksoffsomenumberofotherUEsthatthencontinueinparallelto
accomplishsomeportionoftheoverallwork.OftentheforkingUEwaitsuntilthechildUEs
terminateandjoin.
Whilewedefineeachoftheseprogramstructuresasadistinctpattern,thisissomewhatartificial.Itis
possible,forexample,toimplementtheMaster/WorkerpatternusingtheFork/Joinpatternorthe
SPMDpattern.Thesepatternsdonotrepresentexclusive,uniquewaystostructureaparallelprogram.
Rather,theydefinethemajoridiomsusedbyexperiencedparallelprogrammers.
Thesepatternsalsoinevitablyexpressabiasrootedinthesubsetofparallelprogramming
environmentsweconsiderinthispatternlanguage.ToanMPIprogrammer,forexample,allprogram
structurepatternsareessentiallyavariationontheSPMDpattern.ToanOpenMPprogrammer,
however,thereisahugedifferencebetweenprogramsthatutilizethreadIDs(thatis,theSPMD
pattern)versusprogramsthatexpressallconcurrencyintermsoflooplevelworksharingconstructs
(thatis,theLoopParallelismpattern).
Therefore,inusingthesepatterns,don'tthinkofthemtoorigidly.Thesepatternsexpressimportant
techniquesandareworthyofconsiderationinisolation,butdonothesitatetocombinethemin
differentwaystomeettheneedsofaparticularproblem.Forexample,intheSPMDpattern,wewill
discussparallelalgorithmsbasedonparallelizingloopsbutexpressedwiththeSPMDpattern.It
mightseemthatthisindicatesthattheSPMDandLoopParallelismpatternsarenotreallydistinct
patterns,butinfactitshowshowflexibletheSPMDpatternis.
5.1.2. Patterns Representing Data Structures
Patternsinthissecondgrouphavetodowithmanagingdatadependencies.TheSharedDatapattern
dealswiththegeneralcase.Theothersdescribespecificfrequentlyuseddatastructures.
SharedData.Thispatternaddressesthegeneralproblemofhandlingdatathatissharedby
morethanoneUE,discussingbothcorrectnessandperformanceissues.
SharedQueue.Thispatternrepresentsa"threadsafe"implementationofthefamiliarqueue
abstractdatatype(ADT),thatis,animplementationofthequeueADTthatmaintainsthe
correctsemanticsevenwhenusedbyconcurrentlyexecutingUEs.
DistributedArray.Thispatternrepresentsaclassofdatastructuresoftenfoundinparallel
scientificcomputing,namelyarraysofoneormoredimensionsthataredecomposedinto
subarraysanddistributedamongprocessesorthreads.
5.2. FORCES
Alloftheprogramstructuringpatternsaddressthesamebasicproblem:howtostructuresourcecode
tobestsupportalgorithmstructuresofinterest.Uniqueforcesareapplicabletoeachpattern,butin
designingaprogramaroundthesestructures,therearesomecommonforcestoconsiderinmostcases:
Clarityofabstraction.Istheparallelalgorithmclearlyapparentfromthesourcecode?
Inawellstructuredprogram,thealgorithmleapsfromthepage.Thereadercanseethedetails
ofthealgorithmwithlittlementaleffort.Werefertothisqualityasclarityofabstraction.
Goodclarityofabstractionisalwaysimportantforwritingcorrectcode,butisparticularly
essentialforparallelprograms:Parallelprogrammersmustdealwithmultiplesimultaneous
tasksthatinteractinsubtleways.Gettingthisrightcanbeverydifficult,especiallyifitishard
tofigureoutwhatthealgorithmisdoingbylookingatthesourcecode.
Scalability.Howmanyprocessorscantheparallelprogrameffectivelyutilize?
Thescalabilityofaprogramisrestrictedbythreefactors.First,thereistheamountof
concurrencyavailableinthealgorithm.Ifanalgorithmonlyhastenconcurrenttasks,then
runningwithmorethantenPEswillprovidenobenefit.Second,thefractionoftheruntime
spentdoinginherentlyserialworklimitshowmanyprocessorscanbeused.Thisisdescribed
quantitativelybyAmdahl'slawasdiscussedinChapter2.Finally,theparalleloverheadofthe
algorithmcontributestotheserialfractionmentionedinAmdahl'slawandlimitsscalability.
Efficiency.Howclosedoestheprogramcometofullyutilizingtheresourcesoftheparallel
computer?RecallthequantitativedefinitionofefficiencygiveninChapter2:
Equation5.1
Equation5.2
PisthenumberofPEs,T(1)issomesequentialreferencetime,andT(P)istheparalleltime
withPPEs.S(P)isthespeedup.
ThemostrigorousdefinitionofefficiencysetsT(1)totheexecutiontimeofthebestsequential
algorithmcorrespondingtotheparallelalgorithmunderstudy.Whenanalyzingparallel
programs,"best"sequentialalgorithmsarenotalwaysavailable,anditiscommontousethe
runtimefortheparallelprogramonasinglePEasthereferencetime.Thiscaninflatethe
efficiencybecausemanagingtheparallelcomputationalwaysincurssomeoverhead,even
whenexecutingonasinglePE.Efficiencyiscloselyrelatedtoscalabilitybecauseeveryhighly
scalablealgorithmisalsohighlyefficient.Evenwhenthescalabilityislimitedbytheavailable
numberoftasksortheparallelhardware,however,algorithmscandifferintheirefficiency.
Maintainability.Istheprogrameasytodebug,verify,andmodify?
Castinganalgorithmintosourcecodeisalmostneveraonetimeproposition.Programsneed
tobedebugged,newfeaturesadded,performancetuned,etc.Thesechangestothesourcecode
arereferredtoasmaintenance.Programsaremoreorlessmaintainabledependingonhow
harditistomakethesechangesandtodothemcorrectly.
Environmentalaffinity.Istheprogramwellalignedwiththeprogrammingenvironmentand
hardwareofchoice?
Ifthehardware,forexample,lackssupportforsharedmemory,analgorithmstructurebased
onsharedmemorywouldbeapoorchoice.Thisissuealsocomesupwhenconsidering
programmingenvironments.Whencreatingaprogrammingenvironment,thecreatorsusually
haveaparticularstyleofprogramminginmind.Forexample,OpenMPisdesigned
specificallyforprogramsconsistingofaseriesofloops,theiterationsofwhichwillbesplit
betweenmultiplethreads(loopbasedparallelism).Itismucheasiertowritesoftwarewhen
theprogramstructureemployediswellalignedwiththeprogrammingenvironment.
Sequentialequivalence.Whereappropriate,doesaprogramproduceequivalentresultswhen
runwithmanyUEsaswithone?Ifnotequivalent,istherelationshipbetweenthemclear?
Itishighlydesirablethattheresultsofanexecutionofaparallelprogrambethesame
regardlessofthenumberofPEsused.Thisisnotalwayspossible,especiallyifbitwise
equivalenceisdesired,becausefloatingpointoperationsperformedinadifferentordercan
producesmall(or,forillconditionedalgorithms,notsosmall)changesintheresultingvalues.
However,ifweknowthattheparallelprogramgivesequivalentresultswhenexecutedonone
processorasmany,thenwecanreasonaboutcorrectnessanddomostofthetestingonthe
singleprocessorversion.Thisismucheasier,andthus,whenpossible,sequentialequivalence
isahighlydesirablegoal.
5.3. CHOOSING THE PATTERNS

Choosingwhichprogramstructurepatterntouseisusuallystraightforward.Inmostcases,the
programmingenvironmentselectedfortheprojectandthepatternsusedfromtheAlgorithmStructure
designspacepointtotheappropriateprogramstructurepatterntouse.Wewillconsiderthesetwo
factorsseparately.
TherelationshipbetweenthepatternsintheAlgorithmStructureandSupportingStructuresdesign
spacesisshowninTable5.1.NoticethattheSupportingStructurespatternscanbeusedwithmultiple
AlgorithmStructurepatterns.Forexample,considertherangeofapplicationsusingthe
Master/Workerpattern:In[BCM+91,CG91,CGMS94],itisusedtoimplementeverythingfrom
embarrassinglyparallelprograms(aspecialcaseoftheTaskParallelismpattern)tothoseusingthe
GeometricDecompositionpattern.TheSPMDpatternisevenmoreflexibleandcoversthemost
importantalgorithmstructuresusedinscientificcomputing(whichtendstoemphasizetheGeometric
Decomposition,TaskParallelism,andDivideandConquerpatterns).Thisflexibilitycanmakeit
difficulttochooseaprogramstructurepatternsolelyonthebasisofthechoiceofAlgorithmStructure
pattern(s).
Table 5.1. Relationship between Supporting Structures patterns and Algorithm
Structure patterns. The number of stars (ranging from zero to four) is an indication of
the likelihood that the given Supporting Structures pattern is useful in the
implementation of the Algorithm Structure pattern.
Task
Parallelism
Divide
Geometric Recursive
EventBased
and
Pipeline
Decomposition
Data
Coordination
Conquer
SPMD
Loop
Parallelism
Master/Wor
ker
Fork/Join
Thechoiceofprogrammingenvironment,however,helpsnarrowthechoiceconsiderably.InTable
5.2,weshowtherelationshipbetweenprogrammingenvironmentsandtheSupportingStructures
patterns.MPI,theprogrammingenvironmentofchoiceonanydistributedmemorycomputer,strongly
favorstheSPMDpattern.OpenMP,thestandardprogrammingmodelusedonvirtuallyeveryshared
memorycomputeronthemarket,iscloselyalignedwiththeLoopParallelismpattern.The
combinationofprogrammingenvironmentandAlgorithmStructurepatternstypicallyselectswhich
SupportingStructurespatternstouse.
Table 5.2. Relationship between Supporting Structures patterns and programming
environments. The number of stars (ranging from zero to four) is an indication of the
likelihood that the given Supporting Structures pattern is useful in the programming
environment.
OpenMP
MPI
SPMD
LoopParallelism
Master/Worker
Fork/Join
Java
5.4. THE SPMD PATTERN

Problem
TheinteractionsbetweenthevariousUEscausemostoftheproblemswhenwritingcorrectand
efficientparallelprograms.Howcanprogrammersstructuretheirparallelprogramstomakethese
interactionsmoremanageableandeasiertointegratewiththecorecomputations?
Context
Aparallelprogramtakescomplexitytoanewlevel.Thereareallthenormalchallengesofwritingany
program.Ontopofthosechallenges,theprogrammermustmanagemultipletasksrunningon
multipleUEs.Inaddition,thesetasksandUEsinteract,eitherthroughexchangeofmessagesorby
sharingmemory.Inspiteofthesecomplexities,theprogrammustbecorrect,andtheinteractions
mustbewellorchestratedifexcessoverheadistobeavoided.
Fortunately,formostparallelalgorithms,theoperationscarriedoutoneachUEaresimilar.Thedata
mightbedifferentbetweenUEs,orslightlydifferentcomputationsmightbeneededonasubsetof
UEs(forexample,handlingboundaryconditionsinpartialdifferentialequationsolvers),butforthe
mostparteachUEwillcarryoutsimilarcomputations.Hence,inmanycasesthetasksandtheir
interactionscanbemademoremanageablebybringingthemalltogetherintoonesourcetree.This
way,thelogicforthetasksissidebysidewiththelogicfortheinteractionsbetweentasks,thereby
makingitmucheasiertogetthemright.
Thisisthesocalled"SingleProgram,MultipleData"(SPMD)approach.Itemergedasthedominant
waytostructureparallelprogramsearlyintheevolutionofscalablecomputing,andprogramming
environments,notablyMPI,havebeendesignedtosupportthisapproach.[1]
[1]
ItisnotthattheavailableprogrammingenvironmentspushedSPMD;theforcewasthe
otherwayaround.TheprogrammingenvironmentsforMIMDmachinespushedSPMD
becausethatisthewayprogrammerswantedtowritetheirprograms.Theywrotethem
thiswaybecausetheyfoundittobethebestwaytogetthelogiccorrectandefficientfor
whatthetasksdoandhowtheyinteract.Forexample,theprogrammingenvironment
PVM,sometimesconsideredapredecessortoMPI,inadditiontotheSPMDprogram
structurealsosupportedrunningdifferentprogramsondifferentUEs(sometimescalled
theMPMDprogramstructure).TheMPIdesigners,withthebenefitofthePVM
experience,chosetosupportonlySPMD.
Inadditiontotheadvantagestotheprogrammer,SPMDmakesmanagementofthesolutionmuch
easier.Itismucheasiertokeepasoftwareinfrastructureuptodateandconsistentifthereisonlyone
programtomanage.ThisfactorbecomesespeciallyimportantonsystemswithlargenumbersofPEs.
Thesecangrowtohugenumbers.Forexample,thetwofastestcomputersintheworldaccordingto
theNovember2003top500list[Top],theEarthSimulatorattheEarthSimulatorCenterinJapanand
theASCIQatLosAlamosNationalLabs,have5120and8192processors,respectively.IfeachPE
runsadistinctprogram,managingtheapplicationsoftwarecouldquicklybecomeprohibitively
difficult.
Thispatternisbyfarthemostcommonlyusedpatternforstructuringparallelprograms.It
isparticularlyrelevantforMPIprogrammersandproblemsusingtheTaskParallelismand
GeometricDecompositionpatterns.Ithasalsoprovedeffectiveforproblemsusingthe
DivideandConquerandRecursiveDatapatterns.
Forces
UsingsimilarcodeforeachUEiseasierfortheprogrammer,butmostcomplexapplications
requirethatdifferentoperationsrunondifferentUEsandwithdifferentdata.
Softwaretypicallyoutlivesanygivenparallelcomputer.Hence,programsshouldbeportable.
Thiscompelstheprogrammertoassumethelowestcommondenominatorinprogramming
environments,andtoassumethatonlybasicmechanismsforcoordinatingtaskswillbe
available.
Achievinghighscalabilityandgoodefficiencyinaparallelprogramrequiresthattheprogram
bewellalignedwiththearchitectureoftheparallelcomputer.Therefore,thedetailsofthe
parallelsystemmustbeexposedand,whereappropriate,undertheprogrammer'scontrol.
Solution
TheSPMDpatternsolvesthisproblembycreatingasinglesourcecodeimagethatrunsoneachofthe
UEs.Thesolutionconsistsofthefollowingbasicelements.
Initialize.TheprogramisloadedontoeachUEandopenswithbookkeepingoperationsto
establishacommoncontext.Thedetailsofthisprocedurearetiedtotheparallelprogramming
environmentandtypicallyinvolveestablishingcommunicationchannelswithotherUEs.
Obtainauniqueidentifier.Nearthetopoftheprogram,anidentifierissetthatisuniqueto
eachUE.ThisisusuallytheUE'srankwithintheMPIgroup(thatis,anumberintheinterval
from0toN1,whereNisthenumberofUEs)orthethreadIDinOpenMP.Thisunique
identifierallowsdifferentUEstomakedifferentdecisionsduringprogramexecution.
RunthesameprogramoneachUE,usingtheuniqueIDtodifferentiatebehaviorondifferent
UEs.ThesameprogramrunsoneachUE.Differencesintheinstructionsexecutedbydifferent
UEsareusuallydrivenbytheidentifier.(TheycouldalsodependontheUE'sdata.)Thereare
manywaystospecifythatdifferentUEstakedifferentpathsthroughthesourcecode.The
mostcommonare(1)branchingstatementstogivespecificblocksofcodetodifferentUEs
and(2)usingtheUEidentifierinloopindexcalculationstosplitloopiterationsamongthe
UEs.
Distributedata.ThedataoperatedonbyeachUEisspecializedtothatUEbyoneoftwo
techniques:(1)decomposingglobaldataintochunksandstoringtheminthelocalmemoryof
eachUE,andlater,ifrequired,recombiningthemintothegloballyrelevantresultsor(2)
sharingorreplicatingtheprogram'smajordatastructuresandusingtheUEidentifierto
associatesubsetsofthedatawithparticularUEs.
Finalize.Theprogramclosesbycleaningupthesharedcontextandshuttingdownthe
computation.IfgloballyrelevantdatawasdistributedamongUEs,itwillneedtobe
recombined.
Discussion
AnimportantissuetokeepinmindwhendevelopingSPMDprogramsistheclarityofabstraction,
thatis,howeasyitistounderstandthealgorithmfromreadingtheprogram'ssourcecode.Depending
onhowthedataishandled,thiscanrangefromawfultogood.IfcomplexindexalgebraontheUE
identifierisneededtodeterminethedatarelevanttoaUEortheinstructionbranch,thealgorithmcan
bealmostimpossibletofollowfromthesourcecode.(TheDistributedArraypatterndiscussesuseful
techniquesforarrays.)
Insomecases,areplicateddataalgorithmcombinedwithsimpleloopsplittingisthebestoption
becauseitleadstoaclearabstractionoftheparallelalgorithmwithinthesourcecodeandan
algorithmwithahighdegreeofsequentialequivalence.Unfortunately,thissimpleapproachmightnot
scalewell,andmorecomplexsolutionsmightbeneeded.Indeed,SPMDalgorithmscanbehighly
scalable,andalgorithmsrequiringcomplexcoordinationbetweenUEsandscalingouttoseveral
thousandUEs[PH95]havebeenwrittenusingthispattern.Thesehighlyscalablealgorithmsare
usuallyextremelycomplicatedastheydistributethedataacrossthenodes(thatis,nosimplifying
replicateddatatechniques),andtheygenerallyincludecomplexloadbalancinglogic.These
algorithms,unfortunately,bearlittleresemblancetotheirserialcounterparts,reflectingacommon
criticismoftheSPMDpattern.
AnimportantadvantageoftheSPMDpatternisthatoverheadsassociatedwithstartupand
terminationaresegregatedatthebeginningandendoftheprogram,notinsidetimecriticalloops.
Thiscontributestoefficientprogramsandresultsintheefficiencyissuesbeingdrivenbythe
communicationoverhead,thecapabilitytobalancethecomputationalloadamongtheUEs,andthe
amountofconcurrencyavailableinthealgorithmitself.
SPMDprogramsarecloselyalignedwithprogrammingenvironmentsbasedonmessagepassing.For
example,mostMPIorPVMprogramsusetheSPMDpattern.Note,however,thatitispossibletouse
theSPMDpatternwithOpenMP[CPP01].Withregardtothehardware,theSPMDpatterndoesnot
assumeanythingconcerningtheaddressspacewithinwhichthetasksexecute.AslongaseachUE
canrunitsowninstructionstreamoperatingonitsowndata(thatis,thecomputercanbeclassifiedas
MIMD),theSPMDstructureissatisfied.ThisgeneralityofSPMDprogramsisoneofthestrengthsof
thispattern.
Examples
TheissuesraisedbyapplicationoftheSPMDpatternarebestdiscussedusingthreespecific
examples:
Numericalintegrationtoestimatethevalueofadefiniteintegralusingthetrapezoidrule
Moleculardynamics,forcecomputations
Mandelbrotsetcomputation
Numerical integration
Wecanuseaverysimpleprogram,frequentlyusedinteachingparallelprogramming,toexplore
manyoftheissuesraisedbytheSPMDpattern.Considertheproblemofestimatingthevalueof
usingEq.5.3.
Equation5.3
Weusetrapezoidalintegrationtonumericallysolvetheintegral.Theideaistofilltheareaundera
curvewithaseriesofrectangles.Asthewidthoftherectanglesapproacheszero,thesumoftheareas
oftherectanglesapproachesthevalueoftheintegral.
Figure 5.2. Sequential program to carry out a trapezoid rule integration to compute
#include <stdio.h>
#include <math.h>
int main () {
int i;
int num_steps = 1000000;
double x, pi, step, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++)
{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
printf("pi %lf\n",pi);
return 0;
}
AprogramtocarrythiscalculationoutonasingleprocessorisshowninFig.5.2.Tokeepthe
programassimpleaspossible,wefixthenumberofstepstouseintheintegrationat1,000,000.The
variablesumisinitializedto0andthestepsizeiscomputedastherangeinx(equalto1.0inthis
case)dividedbythenumberofsteps.Theareaofeachrectangleisthewidth(thestepsize)timesthe
height(thevalueoftheintegrandatthecenteroftheinterval).Becausethewidthisaconstant,we
pullitoutofthesummationandmultiplythesumoftherectangleheightsbythestepsize,step,to
getourestimateofthedefiniteintegral.
Wewilllookatseveralversionsoftheparallelalgorithm.Wecanseealltheelementsofaclassic
SPMDprograminthesimpleMPIversionofthisprogram,asshowninFig.5.3.Thesameprogramis
runoneachUE.Nearthebeginningoftheprogram,theMPIenvironmentisinitializedandtheIDfor
eachUE(my_id)isgivenbytheprocessrankforeachUEintheprocessgroupassociatedwiththe
communicatorMPI_COMM_WORLD(forinformationaboutcommunicatorsandotherMPIdetails,see
theMPIappendix,AppendixB).WeusethenumberofUEsandtheIDtoassignloopranges
(i_startandi_end)toeachUE.Becausethenumberofstepsmaynotbeevenlydividedbythe
numberofUEs,wehavetomakesurethelastUErunsuptothelaststepinthecalculation.Afterthe
partialsumshavebeencomputedoneachUE,wemultiplybythestepsize,step,andthenusethe
MPI_Reduce()routinetocombinethepartialsumsintoaglobalsum.(Reductionoperationsare
describedinmoredetailintheImplementationMechanismsdesignspace.)Thisglobalvaluewillonly
beavailableintheprocesswithmy_id == 0,sowedirectthatprocesstoprinttheanswer.
Inessence,whatwehavedoneintheexampleinFig.5.3istoreplicatethekeydata(inthiscase,the
partialsummationvalue,sum),usetheUE'sIDtoexplicitlysplituptheworkintoblockswithone
blockperUE,andthenrecombinethelocalresultsintothefinalglobalresult.Thechallengein
applyingthispatternisto(1)splitupthedatacorrectly,(2)correctlyrecombinetheresults,and(3)
achieveanevendistributionofthework.Thefirsttwostepsweretrivialinthisexample.Theload
balance,however,isabitmoredifficult.Unfortunately,thesimpleprocedureweusedinFig.5.3could
resultinsignificantlymoreworkforthelastUEifthenumberofUEsdoesnotevenlydividethe
numberofsteps.Foramoreevendistributionofthework,weneedtospreadouttheextraiterations
amongmultipleUEs.WeshowonewaytodothisintheprogramfragmentinFig.5.4.Wecompute
thenumberofiterationsleftoverafterdividingthenumberofstepsbythenumberofprocessors
(rem).WewillincreasethenumberofiterationscomputedbythefirstremUEstocoverthatamount
ofwork.ThecodeinFig.5.4accomplishesthattask.Thesesortsofindexadjustmentsarethebaneof
programmersusingtheSPMDpattern.Suchcodeiserrorproneandthesourceofhoursoffrustration
asprogramreaderstrytounderstandthereasoningbehindthislogic.
Figure 5.3. MPI program to carry out a trapezoid rule integration in parallel by
assigning one block of loop iterations to each UE and performing a reduction
#include <stdio.h>
#include <math.h>
#include <mpi.h>
int main (int argc, char *argv[]) {
int i, i_start, i_end;
int my_id, numprocs;
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
i_start = my_id * (num_steps/numprocs);
i_end = i_start + (num_steps/numprocs);
if (my_id == (numprocs-1)) i_end = num_steps;
for (i=i_start; i< i_end; i++)
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
sum *= step;
MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (my_id == 0) printf("pi %lf\n",pi);
MPI_Finalize();
return 0;
}
Finally,weusealoopsplittingstrategyforthenumericalintegrationprogram.Theresultingprogram
isshowninFig.5.5.Thisapproachusesacommontricktoachieveacyclicdistributionoftheloop
iterations:EachUEstartswiththeiterationequaltoitsrank,andthenmarchesthroughtheiterations
oftheloopwithastrideequaltothenumberofUEs.TheiterationsareinterleavedamongtheUEs,in
thesamemannerasadeckofcardswouldbedealt.Thisversionoftheprogramevenlydistributesthe
loadwithoutresortingtocomplexindexalgebra.
Figure 5.4. Index calculation that more evenly distributes the work when the number
of steps is not evenly divided by the number of UEs. The idea is to split up the
remaining tasks (rem) among the first rem UEs.
int rem = num_steps % numprocs;
i_start = my_id * (num_steps/numprocs);
i_end = i_start + (num_steps/numprocs);
if (rem != 0){
if(my_id < rem){
i_start += my_id;
i_end += (my_id + 1);
}
else {
i_start += rem;
i_end += rem;
}
}
Figure 5.5. MPI program to carry out a trapezoid rule integration in parallel using a
simple loop-splitting algorithm with cyclic distribution of iterations and a reduction
#include <stdio.h>
#include <math.h>
#include <mpi.h>
int main (int argc, char *argv[]) {
int i;

int my_id, numprocs;
MPI_Comm_rank(MPI_COMM_WORLD, &my_id);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
for (i=my_id; i< num_steps; i+= numprocs)

{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
sum *= step;
MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (my_id == 0) printf("pi %lf\n",pi);
MPI_Finalize();
return 0;
Figure 5.6. OpenMP program to carry out a trapezoid rule integration in parallel using
the same SPMD algorithm used in Fig. 5.5
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main () {
double pi, step, sum = 0.0;
#pragma omp parallel reduction(+:sum)
{
int i, id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x;
for (i=id;i< num_steps; i+=numthreads){

x = (i+0.5)*step;
sum += + 4.0/(1.0+x*x);
}
} // end of parallel region
pi = step * sum;
printf("\n pi is %lf\n",pi);
return 0;
SPMDprogramscanalsobewrittenusingOpenMPandJava.InFig.5.6,weshowanOpenMP
versionofourtrapezoidalintegrationprogram.ThisprogramisverysimilartotheanalogousMPI
program.Theprogramhasasingleparallelregion.WestartbyfindingthethreadIDandthenumber
ofthreadsintheteam.Wethenusethesametricktointerleaveiterationsamongtheteamofthreads.
AswiththeMPIprogram,weuseareductiontocombinepartialsumsintoasingleglobalsum.
Molecular dynamics
Throughoutthispatternlanguage,wehaveusedmoleculardynamicsasarecurringexample.
Moleculardynamicssimulatesthemotionsofalargemolecularsystem.Itusesanexplicittime
steppingmethodologywhereateachtimestep,theforceoneachatomiscomputedandstandard
techniquesfromclassicalmechanicsareusedtocomputehowtheforceschangeatomicmotions.
Thisproblemisidealforpresentingkeyconceptsinparallelalgorithmsbecausetherearesomany
waystoapproachtheproblembasedonthetargetcomputersystemandtheintendeduseofthe
program.Inthisdiscussion,wewillfollowtheapproachtakenin[Mat95]andassumethat(1)a
sequentialversionoftheprogramexists,(2)havingasingleprogramforsequentialandparallel
executionisimportant,and(3)thetargetsystemisasmallclusterconnectedbystandardEthernet
LAN.Morescalablealgorithmsforexecutiononmassivelyparallelsystemsarediscussedin[PH95].
Figure 5.7. Pseudocode for molecular dynamics example. This code is very similar to
the version discussed earlier, but a few extra details have been included. To support
more detailed pseudocode examples, the call to the function that initializes the force
arrays has been made explicit. Also, the fact that the neighbor list is only occasionally
updated is made explicit.
Array
Array
Array
Array
of
of
of
of
Real
Real
Real
List
::
::
::
::


initialize_forces (N, Forces)
if(time to update neighbor list)
neighbor_list (N, Atoms, neighbors)
end if
non_bonded-forces (N, atoms, neighbors, forces)
end loop
Thecorealgorithm,includingpseudocode,waspresentedinSec.3.1.3.Whilewewon'trepeatthe
discussionhere,wedoprovideacopyofthepseudocodeinFig.5.7.
TheparallelalgorithmisdiscussedinseveralofthepatternsintheFindingConcurrencyand
AlgorithmStructuredesignspaces.Followingarethekeypointsfromthosediscussionsthatwewill
needherealongwiththelocationoftheoriginaldiscussion.
1. Computingthenon_bonded_forcestakestheoverwhelmingmajorityoftheruntime
(Sec.3.1.3).
2. Incomputingthenon_bonded_force,eachatompotentiallyinteractswithalltheother
atoms.Hence,eachUEneedsreadaccesstothefullatomicpositionarray.Also,dueto
Newton'sthirdlaw,eachUEwillbescatteringcontributionstotheforceacrossthefullforce
array(theExamplessectionoftheDataSharingpattern).
3. OnewaytodecomposetheMDproblemintotasksistofocusonthecomputationsneededfor
aparticularatom,thatis,wecanparallelizethisproblembyassigningatomstoUEs(the
ExamplessectionoftheTaskDecompositionpattern).
Giventhatourtargetisasmallclusterandfrompoint(1)intheprecedinglist,wewillonlyparallelize
theforcecomputations.Becausethenetworkisslowforparallelcomputingandgiventhedata
dependencyinpoint(2),wewill:
Keepacopyofthefullforceandcoordinatearraysoneachnode.
HaveeachUEredundantlyupdatepositionsandvelocitiesfortheatoms(thatis,weassumeit
ischeapertoredundantlycomputethesetermsthantodotheminparallelandcommunicate
theresults).
HaveeachUEcomputeitscontributionstotheforcearrayandthencombine(orreduce)the
UEs'contributionsintoasingleglobalforcearraycopiedontoeachUE.
Thealgorithmisasimpletransformationfromthesequentialalgorithm.PseudocodeforthisSPMD
programisshowninFig.5.8.AswithanyMPIprogram,theMPIincludefileisreferencedatthetop
oftheprogram.TheMPIenvironmentisinitializedandtheIDisassociatedwiththerankoftheMPI
process.
Figure 5.8. Pseudocode for an SPMD molecular dynamics program using MPI
#include <mpi.h>
Int const LN // maximum number of atoms assigned to a UE
Int ID // an ID for each UE
Int num_UEs // the number of UEs in the parallel computation
Array
Array
Array
Array
Array
Array
of
of
of
of
of
of
Real :: atoms (3,N) //3D coordinates

Real :: velocities (3,N) //velocity vector
Real :: forces (3,N) //force in each dimension
Real :: final_forces(3,N) //globally summed force
List :: neighbors(LN) //atoms in cutoff volume
Int :: local_atoms(LN) //atoms for this UE
ID = 0 // default ID (used by the serial code)

num_UEs = 1 // default num_UEs (used by the serial code)
MPI_Init()
MPI_Comm_size(MPI_COMM_WORLD, &ID)
MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs)
initialize_forces (N, forces, final_forces)

neighbor_list (N, LN, atoms, neighbors)
end if
vibrational_forces (N, LN, local_atoms, atoms, forces)
rotational_forces (N, LN, local_atoms, atoms, forces)
non_bonded_forces (N, LN, atoms, local_atoms, neighbors,
forces)
MPI_All_reduce{forces, final.forces, 3*N, MPI_REAL, MPI_SUM,
MPI_COMM_WORLD)
N, atoms, velocities, final_forces)
end loop
MPI_Finalize()
Figure 5.9. Pseudocode for the nonbonded computation in a typical parallel molecular
dynamics code. This code is almost identical to the sequential version of the function
shown in Fig. 4.4. The only major change is a new array of integers holding the indices
for the atoms assigned to this UE, local_atoms. We've also assumed that the neighbor
list has been generated to hold only those atoms assigned to this UE. For the sake of
allocating space for these arrays, we have added a parameter LN which is the largest
number of atoms that can be assigned to a single UE.
function non_bonded_forces (N, LN, atoms, local_atoms,
neighbors, Forces)
Int N // number of atoms
Int LN // maximum number of atoms assigned to a UE
Array of List :: neighbors(LN) //atoms in cutoff volume
Array of Int :: local_atoms(LN) //atoms assigned to this UE
real :: forceX, forceY, forceZ
loop [i] over local_atoms
forceX = non_bond_force(atoms(1,i),
forceY = non_bond_force(atoms(2,i),
forceZ = non_bond_force(atoms(3,i),
force{l,i) += forceX; force{1,j) -=
force(2,i) += forceY; force{2,j) -=
force{3,i) += forceZ; force{3,j) -=
end loop [j]
end loop [i]
atoms(l,j))
atoms(2,j))
atoms(3,j))
forceX;
forceY;
forceZ;
Onlyafewchangesaremadetothesequentialfunctions.First,asecondforcearraycalled
final_forcesisdefinedtoholdthegloballyconsistentforcearrayappropriatefortheupdateof
theatomicpositionsandvelocities.Second,alistofatomsassignedtotheUEiscreatedandpassedto
anyfunctionthatwillbeparallelized.Finally,theneighbor_listismodifiedtoholdthelistfor
onlythoseatomsassignedtotheUE.
Finally,withineachofthefunctionstobeparallelized(theforcescalculations),theloopoveratomsis
replacedbyaloopoverthelistoflocalatoms.
WeshowanexampleofthesesimplechangesinFig.5.9.Thisisalmostidenticaltothesequential
versionofthisfunctiondiscussedintheTaskParallelismpattern.Asdiscussedearlier,thefollowing
arethekeychanges.
AnewarrayhasbeenaddedtoholdindicesfortheatomsassignedtothisUE.Thisarrayisof
lengthLNwhereLNisthemaximumnumberofatomsthatcanbeassignedtoasingleUE.
Theloopoverallatoms(loopoveri)hasbeenreplacedbyaloopovertheelementsofthe
local_atomslist.
Weassumethattheneighborlisthasbeenmodifiedtocorrespondtotheatomslistedinthe
local_atomslist.
TheresultingcodecanbeusedforasequentialversionoftheprogrambysettingLNtoNandby
puttingthefullsetofatomindicesintolocal_atoms.Thisfeaturesatisfiesoneofourdesign
goals:thatasinglesourcecodewouldworkforbothsequentialandparallelversionsoftheprogram.
Thekeytothisalgorithmisinthefunctiontocomputetheneighborlist.Theneighborlistfunction
containsaloopovertheatoms.Foreachatomi,thereisaloopoverallotheratomsandatestto
determinewhichatomsareintheneighborhoodofatomi.Theindicesfortheseneighboringatoms
aresavedinneighbors,alistoflists.PseudocodeforthiscodeisshowninFig.5.10.
Figure 5.10. Pseudocode for the neighbor list computation. For each atom i, the
indices for atoms within a sphere of radius cutoff are added to the neighbor list for
atom i. Notice that the second loop (over j) only considers atoms with indices greater
than i. This accounts for the symmetry in the force computation due to Newton's third
law of motion, that is, that the force between atom i and atom j is just the negative
of the force between atom j and atom i.
function neighbor (N, LN, ID, cutoff, atoms, local_atoms,
neighbors)
Int LN // max number of atoms assigned to a UE
Real cutoff // radius of sphere defining neighborhood
Array of Int :: local_atoms(LN) //atoms assigned to this UE
real :: dist_squ
initialize_lists (local_atoms, neighbors)
loop [i] over atoms on UE //split loop iterations among UEs

add_to_list (i, local_atoms)
loop [j] over atoms greater than i
dist_squ = square(atom(l,i)-atom(i,j)) +
square(atom(2,i)-atom(2,j)) +
square(atom(3,i)-atom(3,j))
if(dist_squ < (cutoff * cutoff))
add_to_list (j, neighbors(i))
end if
end loop [j]
end loop [i]
end function neighbors
ThelogicdefininghowtheparallelismisdistributedamongtheUEsiscapturedinthesingleloopin
Fig.5.10:
loop [i] over atoms on UE //split loop iterations among UEs
ThedetailsofhowthisloopissplitamongUEsdependsontheprogrammingenvironment.An
approachthatworkswellwithMPIisthecyclicdistributionweusedinFig.5.5:
for (i=id;i<number_of_atoms; i+= number_of UEs){
}
Morecomplexorevendynamicdistributionscanbehandledbycreatinganownercomputesfilter
[Mat95].Anownercomputesfilterprovidesaflexibleandreusablescheduleformappingloop
iterationsontoUEs.ThefilterisabooleanfunctionoftheIDandtheloopiteration.Thevalueofthe
functiondependsonwhetheraUE"owns"aparticulariterationofaloop.Forexample,inamolecular
dynamicsprogram,thecalltotheownercomputesfunctionwouldbeaddedatthetopofthe
parallelizedloopsoveratoms:
for (1=0;i<number_of_atoms; i++){
if !(is_owner (i)) break
}
Nootherchangestotheloopareneededtosupportexpressionofconcurrency.Ifthelogicmanaging
theloopisconvoluted,thisapproachpartitionstheiterationsamongtheUEswithoutalteringthat
logic,andtheindexpartitioninglogicislocatedclearlyinoneplaceinthesourcecode.Another
advantageoccurswhenseveralloopsthatshouldbescheduledthesamewayarespreadout
throughoutaprogram.Forexample,onaNUMAmachineoraclusteritisveryimportantthatdata
broughtclosetoaPEbeusedasmanytimesaspossible.Often,thismeansreusingthesameschedule
inmanyloops.
Thisapproachisdescribedfurtherformoleculardynamicsapplicationsin[Mat95].Itcouldbe
importantinthisapplicationsincetheworkloadcapturedintheneighborlistgenerationmaynot
accuratelyreflecttheworkloadinthevariousforcecomputations.Onecouldeasilycollect
informationaboutthetimerequiredforeachatomandthenreadjusttheis_ownerfunctionto
producemoreoptimalworkloads.
Figure 5.11. Pseudocode for a parallel molecular dynamics program using OpenMP
#include <omp.h>
Int const LN // maximum number of atoms assigned to a UE
Int ID // an ID for each UE
Int num_UEs // number of UEs in the parallel computation
Array of Real :: atoms(3,N) //3D coordinates
Array of Real :: velocities(3,N) //velocity vector
Array of Real :: forces(3,N) //force in each dim
Array of Int :: local_atoms(LN) //atoms for this UE
ID = 0
num_UEs = 1
#pragma omp parallel private (ID, num_UEs, local_atoms, forces) {
ID = omp_get_thread_num()
num_UEs = omp_get_num_threads()
initialize_forces (N, forces, final_forces)
neighbor_list (N, LN, atoms, neighbors)
end if
vibrational_forces (N, LN, local_atoms, atoms, forces)
rotational_forces (N, LN, local_atoms, atoms, forces)
non_bonded_forces (N, LN, atoms, local_atoms,
neighbors, forces)
#pragma critical
final_forces += forces
#barrier
#pragma single
{
} // remember, the end of a single implies a barrier
end loop
} // end of OpenMP parallel region
TheseSPMDalgorithmsworkforOpenMPprogramsaswell.Allofthebasicfunctionsremainthe
same.ThetoplevelprogramischangedtoreflecttheneedsofOpenMP.ThisisshowninFig.5.11.
Theloopovertimeisplacedinsideasingleparallelregion.Theparallelregioniscreatedwiththe
parallelpragma:
#pragma omp parallel private (ID, num_UEs, local_atoms, forces)
Thispragmacausesateamofthreadstobecreatedwitheachmemberoftheteamexecutingtheloop
overtime.TheprivateclausecausescopiesofthelistedvariablestobecreatedforeachUE.The
reductioniscarriedoutinacriticalsection:
#pragma critical
final_forces += forces
Areductionclauseontheparallelregioncannotbeusedinthiscasebecausetheresultwouldnot
beavailableuntiltheparallelregioncompletes.Thecriticalsectionproducesthecorrectresult,butthe
algorithmusedhasaruntimethatislinearinthenumberofUEsandishencesuboptimalrelativeto
otherreductionalgorithmsasdiscussedintheImplementationMechanismsdesignspace.Onsystems
withamodestnumberofprocessors,however,thereductionwithacriticalsectionworksadequately.
Thebarrierfollowingthecriticalsectionisrequiredtomakesurethereductioncompletesbeforethe
atomicpositionsandvelocitiesareupdated.WethenuseanOpenMPsingleconstructtocauseonly
oneUEtodotheupdate.Anadditionalbarrierisnotneededfollowingthesinglesincethecloseof
asingleconstructimpliesabarrier.Thefunctionsusedtocomputetheforcesareunchanged
betweentheOpenMPandMPIversionsoftheprogram.
Mandelbrot set computation
ConsiderthewellknownMandelbrotset[Dou86].Wediscussedthisproblemanditsparallelization
asataskparallelproblemintheTaskParallelismpattern.Eachpixeliscoloredbasedonthebehavior
ofthequadraticrecurrencerelationinEq.5.4.
Equation5.4
CandZarecomplexnumbersandtherecurrenceisstartedwithz0=C.Theimageplotsthe
imaginarypartofContheverticalaxis(1.5to1.5)andtherealpartonthehorizontalaxis(1to2).
Thecolorofeachpixelisblackiftherecurrencerelationconvergestoastablevalueoriscolored
dependingonhowrapidlytherelationdiverges.
IntheTaskParallelismpattern,wedescribedaparallelalgorithmwhereeachtaskcorrespondstothe
computationofarowintheimage.AstaticschedulewithmoretasksthanUEsshouldbepossible
thatachievesaneffectivestatisticalbalanceoftheloadamongnodes.Wewillshowhowtosolvethis
problemusingtheSPMDpatternwithMPI.
PseudocodeforthesequentialversionofthiscodeisshowninFig.5.12.Theinterestingpartofthe
problemishiddeninsidetheroutinecompute_Row().Becausethedetailsofthisroutinearenot
importantforunderstandingtheparallelalgorithm,wewillnotshowthemhere,however.Atahigh
level,foreachpointintherowthefollowinghappens.
Figure 5.12. Pseudocode for a sequential version of the Mandelbrot set generation
program
Int const Nrows // number of rows in the image
Int const RowSize // number of pixels in a row
Int const M // number of colors in color map
Real :: conv // divergence rate for a pixel
Array of Int :: color_map (M) // pixel color based on conv rate
Array of Int :: row (RowSize) // Pixels to draw
Array of Real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize

for (int i = 0; i<Nrows; i++){
compute_Row (RowSize, ranges, row)
graph(i, RowSize, M, color_map, ranges, row)
} // end loop [i] over rows
EachpixelcorrespondstoavalueofCinthequadraticrecurrence.Wecomputethisvalue
basedontheinputrangeandthepixelindices.
Wethencomputethetermsintherecurrenceandsetthevalueofthepixelbasedonwhetherit
convergestoafixedvalueordiverges.Ifitdiverges,wesetthepixelvaluebasedontherateof
divergence.
Oncecomputed,therowsareplottedtomakethewellknownMandelbrotsetimages.Thecolorsused
forthepixelsaredeterminedbymappingdivergenceratesontoacolormap.
AnSPMDprogrambasedonthisalgorithmisstraightforward;codeisshowninFig.5.13.Wewill
assumethecomputationisbeingcarriedoutonsomesortofdistributedmemorymachine(acluster
orevenanMPP)andthatthereisonemachinethatservesastheinteractivegraphicsnode,whilethe
othersarerestrictedtocomputation.Wewillassumethatthegraphicsnodeistheonewithrank0.
TheprogramstartswiththeusualMPIsetup,asdescribedintheMPIappendix,AppendixB.TheUE
withrank0takesinputfromtheuserandthenbroadcaststhistotheotherUEs.Itthenloopsoverthe
numberofrowsintheimage,receivingrowsastheyfinishandplottingthem.UEswithrankother
than0useacyclicdistributionofloopiterationsandsendtherowstothegraphicsUEastheyfinish.
Known uses
TheoverwhelmingmajorityofMPIprogramsusethispattern.Pedagogicallyorienteddiscussionsof
SPMDprogramsandexamplescanbefoundinMPItextbookssuchas[GLS99]and[Pac96].
Representativeapplicationsusingthispatternincludequantumchemistry[WSG95],finiteelement
+
+
methods[ABKP03,KLK
03
],and3Dgasdynamics[MHC
99
].
Figure 5.13. Pseudocode for a parallel MPI version of the Mandelbrot set generation
program
#include <mpi.h>
Int :: inRowSize // size of received row
Int :: ID // ID of each UE (process)
Int :: num_UEs // number of UEs (processes)
Int :: nworkers // number of UEs computing rows
MPI_Status :: stat // MPI status parameter
MPI_Init()
MPI_Comm_size(MPI_COMM_WORLD, &ID)
MPI_Comm_rank(MPI_COMM_WORLD, &num_UEs)
// Algorithm requires at least two UEs since we are
// going to dedicate one to graphics
if (num_UEs < 2) MPI_Abort(MPI_COMM_WORLD, 1)
if (ID == 0 ){
manage_user_input(ranges, color_map) // input ranges, color map
initialize_graphics(RowSize, Nrows, M, ranges, color_map)
}
// Broadcast data from rank 0 process to all other processes
MPI_Bcast (ranges, 2, MPI_REAL, 0, MPI_COMM_WORLD);
if (ID == 0) { // UE with rank 0 does graphics
MPI_Recv(row, &inRowSize, MPI_REAL, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &stat)
row_index = stat(MPI_TAG)
graph(row_index, RowSize, M, color_map, ranges, Row)
} // end loop over i
else { // The other UEs compute the rows
nworkers = num_UEs - 1
for (int i = ID-1; i<Nrows; i+=nworkers){
MPI_Send (row, RowSize, MPI_REAL, 0, i, MPI_COMM_WORLD);
} // end loop over i
}
MPI_Finalize()
ExamplesoftheSPMDpatternincombinationwiththeDistributedArraypatternincludethe
+
GAMESSquantumchemistryprogram[OSG03]andtheScaLAPACKlibrary[BCC
97
,Sca].
Related Patterns
TheSPMDpatternisverygeneralandcanbeusedtoimplementotherpatterns.Manyoftheexamples
inthetextofthispatternarecloselyrelatedtotheLoopParallelismpattern.Mostapplicationsofthe
GeometricDecompositionpatternwithMPIusetheSPMDpatternaswell.TheDistributedArray
patternisessentiallyaspecialcaseofdistributingdataforprogramsusingtheSPMDpattern.
5.5. THE MASTER/WORKER PATTERN

Problem
Howshouldaprogrambeorganizedwhenthedesignisdominatedbytheneedtodynamically
balancetheworkonasetoftasksamongtheUEs?
Context
Parallelefficiencyfollowsfromanalgorithm'sparalleloverhead,itsserialfraction,andtheload
balancing.Agoodparallelalgorithmmustdealwitheachofthese,butsometimesbalancingtheload
issodifficultthatitdominatesthedesign.Problemsfallingintothiscategoryusuallyshareoneor
moreofthefollowingcharacteristics.
Theworkloadsassociatedwiththetasksarehighlyvariableandunpredictable.Ifworkloads
arepredictable,theycanbesortedintoequalcostbins,staticallyassignedtoUEs,and
parallelizedusingtheSPMDorLoopParallelismpatterns.Butiftheyareunpredictable,static
distributionstendtoproducesuboptimalloadbalance.
Theprogramstructureforthecomputationallyintensiveportionsoftheproblemdoesn'tmap
ontosimpleloops.Ifthealgorithmisloopbased,onecanusuallyachieveastatisticallynear
optimalworkloadbyacyclicdistributionofiterationsorbyusingadynamicscheduleonthe
loop(forexample,inOpenMP,byusingtheschedule (dynamic)clause).Butifthe
controlstructureintheprogramismorecomplexthanasimpleloop,moregeneralapproaches
arerequired.
ThecapabilitiesofthePEsavailablefortheparallelcomputationvaryacrosstheparallel
system,changeoverthecourseofthecomputation,orareunpredictable.
Insomecases,tasksaretightlycoupled(thatis,theycommunicateorsharereadandwritedata)and
mustbeactiveatthesametime.Inthiscase,theMaster/Workerpatternisnotapplicable:The
programmerhasnochoicebuttoexplicitlysizeorgrouptasksontoUEsdynamically(thatis,during
thecomputation)toachieveaneffectiveloadbalance.Thelogictoaccomplishthiscanbedifficultto
implement,andifoneisnotcareful,canaddprohibitivelylargeparalleloverhead.
Ifthetasksareindependentofeachother,however,orifthedependenciescansomehowbepulledout
fromtheconcurrentcomputation,theprogrammerhasmuchgreaterflexibilityinhowtobalancethe
load.Thisallowstheloadbalancingtobedoneautomaticallyandisthesituationweaddressinthis
pattern.
ThispatternisparticularlyrelevantforproblemsusingtheTaskParallelismpatternwhen
therearenodependenciesamongthetasks(embarrassinglyparallelproblems).Itcanalso
beusedwiththeFork/JoinpatternforthecaseswherethemappingoftasksontoUEsis
indirect.
Forces
Theworkforeachtask,andinsomecaseseventhecapabilitiesofthePEs,varies
unpredictablyintheseproblems.Hence,explicitpredictionsoftheruntimeforanygiventask
arenotpossibleandthedesignmustbalancetheloadwithoutthem.
Operationstobalancetheloadimposecommunicationoverheadandcanbeveryexpensive.
Thissuggeststhatschedulingshouldrevolvearoundasmallernumberoflargetasks.However,
largetasksreducethenumberofwaystaskscanbepartitionedamongthePEs,therebymaking
itmoredifficulttoachievegoodloadbalance.
Logictoproduceanoptimalloadcanbeconvolutedandrequireerrorpronechangestoa
program.Programmersneedtomaketradeoffsbetweenthedesireforanoptimaldistribution
oftheloadandcodethatiseasytomaintain.
Solution
ThewellknownMaster/Workerpatternisagoodsolutiontothisproblem.Thispatternissummarized
inFig.5.14.Thesolutionconsistsoftwologicalelements:amasterandoneormoreinstancesofa
worker.Themasterinitiatesthecomputationandsetsuptheproblem.Itthencreatesabagoftasks.In
theclassicalgorithm,themasterthenwaitsuntilthejobisdone,consumestheresults,andthenshuts
downthecomputation.
Figure 5.14. The two elements of the Master/Worker pattern are the master and the
worker. There is only one master, but there can be one or more workers. Logically, the
master sets up the calculation and then manages a bag of tasks. Each worker grabs a
task from the bag, carries out the work, and then goes back to the bag, repeating
until the termination condition is met.
Astraightforwardapproachtoimplementingthebagoftasksiswithasinglesharedqueueas
describedintheSharedQueuepattern.Manyothermechanismsforcreatingagloballyaccessible
structurewheretaskscanbeinsertedandremovedarepossible,however.Examplesincludeatuple
space[CG91,FHA99],adistributedqueue,oramonotoniccounter(whenthetaskscanbespecified
withasetofcontiguousintegers).
Meanwhile,eachworkerentersaloop.Atthetopoftheloop,theworkertakesataskfromthebagof
tasks,doestheindicatedwork,testsforcompletion,andthengoestofetchthenexttask.This
continuesuntiltheterminationconditionismet,atwhichtimethemasterwakesup,collectsthe
results,andfinishesthecomputation.
Master/workeralgorithmsautomaticallybalancetheload.Bythis,wemeantheprogrammerdoesnot
explicitlydecidewhichtaskisassignedtowhichUE.Thisdecisionismadedynamicallybythe
masterasaworkercompletesonetaskandaccessesthebagoftasksformorework.
Discussion
Master/workeralgorithmshavegoodscalabilityaslongasthenumberoftasksgreatlyexceedsthe
numberofworkersandthecostsoftheindividualtasksarenotsovariablethatsomeworkerstake
drasticallylongerthantheothers.
Managementofthebagoftaskscanrequireglobalcommunication,andtheoverheadforthiscanlimit
efficiency.Thiseffectisnotaproblemwhentheworkassociatedwiththetasksonaverageismuch
greaterthanthetimerequiredformanagement.Insomecases,thedesignermightneedtoincreasethe
sizeofeachtasktodecreasethenumberoftimestheglobaltaskbagneedstobeaccessed.
TheMaster/Workerpatternisnottiedtoanyparticularhardwareenvironment.Programsusingthis
patternworkwelloneverythingfromclusterstoSMPmachines.Itis,ofcourse,beneficialifthe
programmingenvironmentprovidessupportformanagingthebagoftasks.
Detecting completion
Oneofthechallengesinworkingwithmaster/workerprogramsistocorrectlydeterminewhenthe
entireproblemiscomplete.Thisneedstobedoneinawaythatisefficientbutalsoguaranteesthatall
oftheworkiscompletebeforeworkersshutdown.
Inthesimplestcase,alltasksareplacedinthebagbeforetheworkersbegin.Theneachtask
continuesuntilthebagisempty,atwhichpointtheworkersterminate.
Anotherapproachistouseaqueuetoimplementthetaskbagandarrangeforthemasterora
workertocheckforthedesiredterminationcondition.Whenitisdetected,apoisonpill,a
specialtaskthattellstheworkerstoterminate,iscreated.Thepoisonpillmustbeplacedinthe
baginsuchawaythatitwillbepickeduponthenextroundofwork.Dependingonhowthe
setofsharedtasksaremanaged,itmaybenecessarytocreateonepoisonpillforeach
remainingworkertoensurethatallworkersreceivetheterminationcondition.
Problemsforwhichthesetoftasksisnotknowninitiallyproduceuniquechallenges.This
occurs,forexample,whenworkerscanaddtasksaswellasconsumethem(suchasin
applicationsoftheDivideandConquerpattern).Inthiscase,itisnotnecessarilytruethat
whenaworkerfinishesataskandfindsthetaskbagemptythatthereisnomoreworktodo
anotherstillactiveworkercouldgenerateanewtask.Onemustthereforeensurethatthetask
bagisemptyandallworkersarefinished.Further,insystemsbasedonasynchronousmessage
passing,itmustbedeterminedthattherearenomessagesintransitthatcould,ontheirarrival,
resultinthecreationofanewtask.Therearemanyknownalgorithmsthatsolvethisproblem.
Forexample,supposethetasksareconceptuallyorganizedintoatree,wheretherootisthe
mastertask,andthechildrenofataskarethetasksitgenerates.Whenallofthechildrenofa
taskhaveterminated,theparenttaskcanterminate.Whenallthechildrenofthemastertask
haveterminated,thecomputationhasterminated.Algorithmsforterminationdetectionare
describedin[BT89,Mat87,DS80].
Variations
Thereareseveralvariationsonthispattern.Becauseofthesimplewayitimplementsdynamicload
balancing,thispatternisverypopular,especiallyinembarrassinglyparallelproblems(asdescribedin
theTaskParallelismpattern).Hereareafewofthemorecommonvariations.
Themastermayturnintoaworkerafterithascreatedthetasks.Thisisaneffectivetechnique
whentheterminationconditioncanbedetectedwithoutexplicitactionbythemaster(thatis,
thetaskscandetecttheterminationconditionontheirownfromthestateofthebagoftasks).
Whentheconcurrenttasksmapontoasimpleloop,themastercanbeimplicitandthepattern
canbeimplementedasaloopwithdynamiciterationassignmentasdescribedintheLoop
Parallelismpattern.
Acentralizedtaskqueuecanbecomeabottleneck,especiallyinadistributedmemory
environment.Anoptimalsolution[FLR98]isbasedonrandomworkstealing.Inthis
approach,eachPEmaintainsaseparatedoubleendedtaskqueue.Newtasksareplacedinthe
frontofthetaskqueueofthelocalPE.Whenataskiscompleted,asubproblemisremoved
fromthefrontofthelocaltaskqueue.Ifthelocaltaskqueueisempty,thenanotherPEis
chosenrandomly,andasubproblemfromthebackofitstaskqueueis"stolen".Ifthatqueueis
alsoempty,thenthePEtriesagainwithanotherrandomlychosenPE.Thisisparticularly
effectivewhenusedinproblemsbasedontheDivideandConquerpattern.Inthiscase,the
tasksatthebackofthequeuewereinsertedearlierandhencerepresentlargersubproblems.
Thus,thisapproachtendstomovelargesubproblemswhilehandlingthefinergrained
subproblemsatthePEwheretheywerecreated.Thishelpstheloadbalanceandreduces
overheadforthesmalltaskscreatedinthedeeperlevelsofrecursion.
TheMaster/Workerpatterncanbemodifiedtoprovideamodestleveloffaulttolerance
[BDK95].Themastermaintainstwoqueues:onefortasksthatstillneedtobeassignedto
workersandanotherfortasksthathavealreadybeenassigned,butnotcompleted.Afterthe
firstqueueisempty,themastercanredundantlyassigntasksfromthe"notcompleted"queue.
Hence,ifaworkerdiesandthereforecan'tcompleteitstasks,anotherworkerwillcoverthe
unfinishedtasks.
Examples
Wewillstartwithagenericdescriptionofasimplemaster/workerproblemandthenprovidea
detailedexampleofusingtheMaster/Workerpatternintheparallelimplementationofaprogramto
generatetheMandelbrotset.AlsoseetheExamplessectionoftheSharedQueuepattern,which
illustratestheuseofsharedqueuesbydevelopingamaster/workerimplementationofasimpleJava
frameworkforprogramsusingtheFork/Joinpattern.
Generic solutions
Thekeytothemaster/workerprogramisthestructurethatholdsthebagoftasks.Thecodeinthis
sectionusesataskqueue.WeimplementthetaskqueueasaninstanceoftheSharedQueuepattern.
Themasterprocess,showninFig.5.15,initializesthetaskqueue,representingeachtaskbyaninteger.
ItthenusestheFork/Joinpatterntocreatetheworkerprocessesorthreadsandwaitforthemto
complete.Whentheyhavecompleted,itconsumestheresults.
Figure 5.15. Master process for a master/worker program. This assumes a shared
address space so the task and results queues are visible to all UEs. In this simple
version, the master initializes the queue, launches the workers, and then waits for the
workers to finish (that is, the ForkJoin command launches the workers and then waits
for them to finish before returning). At that point, results are consumed and the
computation completes.
Int const Ntasks // Number of tasks
Int const Nworkers // Number of workers
SharedQueue :: task_queue; // task queue
SharedQueue :: global_results; // queue to hold results
void master()
{
void worker()
// Create and initialize shared data structures
task_queue = new SharedQueue()
global_results = new SharedQueue()
for (int i = 0; i < N; i++)
enqueue(task_queue, i)
// Create Nworkers threads executing function Worker()
ForkJoin (Nworkers, Worker)
}
consume_the_results (Ntasks)
Figure 5.16. Worker process for a master/worker program. We assume a shared

address space thereby making task_queue and global_results available to the master
and all workers. A worker loops over the task_queue and exits when the end of the
queue is encountered.
void worker()
{
Int :: i
Result :: res
while (!empty(task_queue) {
i = dequeue(task_queue)
res = do_lots_of_work(i)
enqueue(global_results, res)
}
}
Theworker,showninFig.5.16,loopsuntilthetaskqueueisempty.Everytimethroughtheloop,it
takesthenexttaskanddoestheindicatedwork,storingtheresultsinaglobalresultsqueue.Whenthe
taskqueueisempty,theworkerterminates.
Notethatweensuresafeaccesstothekeysharedvariables(task_queueandglobal_results)
byusinginstancesoftheSharedQueuepattern.
ForprogramswritteninJava,athreadsafequeuecanbeusedtoholdRunnableobjectsthatare
executedbyasetofthreadswhoserunmethodsbehaveliketheworkerthreadsdescribedpreviously:
removingaRunnableobjectfromthequeueandexecutingitsrunmethod.TheExecutor
interfaceinthejava.util.concurrentpackageinJava21.5providesdirectsupportforthe
Master/Workerpattern.Classesimplementingtheinterfaceprovideanexecutemethodthattakesa
Runnableobjectandarrangesforitsexecution.DifferentimplementationsoftheExecutor
interfaceprovidedifferentwaysofmanagingtheThreadobjectsthatactuallydothework.The
ThreadPoolExecutorimplementstheMaster/Workerpatternbyusingafixedpoolofthreadsto
executethecommands.TouseExecutor,theprograminstantiatesaninstanceofaclass
implementingtheinterface,usuallyusingafactorymethodintheExecutorsclass.Forexample,
thecodeinFig.5.17setsupaThreadPoolExecutorthatcreatesnum_threadsthreads.These
threadsexecutetasksspecifiedbyRunnableobjectsthatareplacedinanunboundedqueue.
AftertheExecutorhasbeencreated,aRunnableobjectwhoserunmethodspecifiesthe
behaviorofthetaskcanbepassedtotheexecutemethod,whicharrangesforitsexecution.For
example,assumetheRunnableobjectisreferredtobyavariabletask.Thenfortheexecutordefined
previously,exec.execute (task);willplacethetaskinthequeue,whereitwilleventuallybe
servicedbyoneoftheexecutor'sworkerthreads.
Figure 5.17. Instantiating and initializing a pooled executor
/*create a ThreadPoolExecutor with an unbounded queue*/
Executor exec = new Executors.newFixedThreadPool(num_threads);
TheMaster/WorkerpatterncanalsobeusedwithSPMDprogramsandMPI.Maintainingtheglobal
queuesismorechallenging,buttheoverallalgorithmisthesame.Amoredetaileddescriptionof
usingMPIforsharedqueuesappearsintheImplementationMechanismsdesignspace.
Mandelbrot set generation
GeneratingtheMandelbrotsetisdescribedindetailintheExamplessectionoftheSPMDpattern.
Thebasicideaistoexploreaquadraticrecurrencerelationateachpointinacomplexplaneandcolor
thepointbasedontherateatwhichtherecursionconvergesordiverges.Eachpointinthecomplex
planecanbecomputedindependentlyandhencetheproblemisembarrassinglyparallel(seetheTask
Parallelismpattern).
InFig.5.18,wereproducethepseudocodegivenintheSPMDpatternforasequentialversionofthis
problem.Theprogramloopsovertherowsoftheimagedisplayingonerowatatimeastheyare
computed.
Onhomogeneousclustersorlightlyloadedsharedmemorymultiprocessorcomputers,approaches
basedontheSPMDorLoopParallelismpatternsaremosteffective.Onaheterogeneousclusterora
multiprocessorsystemsharedamongmanyusers(andhencewithanunpredictableloadonanygiven
PEatanygiventime),amaster/workerapproachwillbemoreeffective.
Wewillcreateamaster/workerversionofaparallelMandelbrotprogrambasedonthehighlevel
structuredescribedearlier.Themasterwillberesponsibleforgraphingtheresults.Insomeproblems,
theresultsgeneratedbytheworkersinteractanditcanbeimportantforthemastertowaituntilallthe
workershavecompletedbeforeconsumingresults.Inthiscase,however,theresultsdonotinteract,so
wesplittheforkandjoinoperationsandhavethemasterplotresultsastheybecomeavailable.
FollowingtheFork,themastermustwaitforresultstoappearontheglobal_resultsqueue.
Becauseweknowtherewillbeoneresultperrow,themasterknowsinadvancehowmanyresultsto
fetchandtheterminationconditionisexpressedsimplyintermsofthenumberofiterationsofthe
loop.Afteralltheresultshavebeenplotted,themasterwaitsattheJoinfunctionuntilallthe
workershavecompleted,atwhichpointthemastercompletes.CodeisshowninFig.5.19.Noticethat
thiscodeissimilartothegenericcasediscussedearlier,exceptthatwehaveoverlappedtheprocessing
oftheresultswiththeircomputationbysplittingtheForkandJoin.Asthenamesimply,Fork
launchesUEsrunningtheindicatedfunctionandJoincausesthemastertowaitfortheworkersto
cleanlyterminate.SeetheSharedQueuepatternformoredetailsaboutthequeue.
Figure 5.18. Pseudocode for a sequential version of the Mandelbrot set generation
program
Array of Int :: color_map (M) // pixel color based on Conv rate
Array of real :: ranges(2) // ranges in X and Y dimensions
manage_user_input(ranges, color_map) // input ranges, color map initialize_graphics(RowSize

Figure 5.19. Master process for a master/worker parallel version of the Mandelbrot set
generation program
Int const Ntasks // number of tasks
Int const Nworkers // number of workers
typedef Row :: struct of {
int :: index
array of int :: pixels (RowSize)
} temp_row;
SharedQueue of Int :: task_queue; // task queue
SharedQueue of Row :: global_results; // queue to hold results
void master()
{
void worker();
manage_user_input(ranges, Color_map) // input ranges, color map
initialize_graphics(RowSize, Nrows, M, ranges, color_map)
// Create and initialize shared data structures
task_queue = new SharedQueue();
global_results = new SharedQueue();
for (int i = 0; i < Nrows; i++)
enqueue(task_queue, i);
// Create Nworkers threads executing function worker()
Fork (Nworkers, worker);
// Wait for results and graph them as they appear
for (int i = 0; i< Nrows; i++) {
while (empty(task_queue) { // wait for results
wait
}
temp_row = dequeue(global_results)
graph(temp_row_index, RowSize, M, color_map, ranges, Row.pixels)
}
// Terminate the worker UEs
Join (Nworkers);
}
Figure 5.20. Worker process for a master/worker parallel version of the Mandelbrot
set generation program. We assume a shared address space thereby making
task_queue, global_results, and ranges available to the master and the workers.
void worker()
{
Int i, irow;
Row temp_row;
while (!empty(task_queue) {
irow = dequeue(task_queue);
compute_Row (RowSize, ranges, irow, temp_row.pixels)
temp_row.index = irow
enqueue(global_results, temp_row);
}
}
ThecodefortheworkerismuchsimplerandisshowninFig5.20.First,notethatweassumethe
sharedvariablessuchasthequeuesandcomputationparametersaregloballyvisibletothemasterand
theworkers.Becausethequeueisfilledbythemasterbeforeforkingtheworkers,thetermination
conditionissimplygivenbyanemptyqueue.Eachworkergrabsarowindex,doesthecomputation,
packstherowindexandthecomputedrowintotheresultqueue,andcontinuesuntilthequeueis
empty.
Known uses
ThispatternisextensivelyusedwiththeLindaprogrammingenvironment.ThetuplespaceinLindais
ideallysuitedtoprogramsthatusetheMaster/Workerpattern,asdescribedindepthin[CG91]andin
thesurveypaper[CGMS94].
TheMaster/Workerpatternisusedinmanydistributedcomputingenvironmentsbecausethese
systemsmustdealwithextremelevelsofunpredictabilityintheavailabilityofresources.The
SETI@homeproject[SET]usestheMaster/Workerpatterntoutilizevolunteers'Internetconnected
computerstodownloadandanalyzeradiotelescopedataaspartoftheSearchforExtraterrestrial
Intelligence(SETI).ProgramsconstructedwiththeCalypsosystem[BDK95],adistributedcomputing
frameworkwhichprovidessystemsupportfordynamicchangesinthesetofPEs,alsousethe
Master/Workerpattern.Aparallelalgorithmfordetectingrepeatsingenomicdata[RHB03]usesthe
Master/WorkerpatternwithMPIonaclusterofdualprocessorPCs.
Related Patterns
ThispatterniscloselyrelatedtotheLoopParallelismpatternwhentheloopsutilizesomeformof
dynamicscheduling(suchaswhentheschedule (dynamic)clauseisusedinOpenMP).
ImplementationsoftheFork/JoinpatternsometimesusetheMaster/Workerpatternbehindthescenes.
Thispatternisalsocloselyrelatedtoalgorithmsthatmakeuseofthenextvalfunctionfrom
TCGMSG[Har91,WSG95,LDSH95].Thenextvalfunctionimplementsamonotoniccounter.If
thebagoftaskscanbemappedontoafixedrangeofmonotonicindices,thecounterprovidesthebag
oftasksandthefunctionofthemasterisimpliedbythecounter.
Finally,theownercomputesfilterdiscussedinthemoleculardynamicsexampleintheSPMDpattern
isessentiallyavariationonthemaster/workertheme.Insuchanalgorithm,allthemasterwoulddois
setupthebagoftasks(loopiterations)andassignthemtoUEs,withtheassignmentoftaskstoUEs
definedbythefilter.BecausetheUEscanessentiallyperformthisassignmentthemselves(by
examiningeachtaskwiththefilter),noexplicitmasterisneeded.
5.6. THE LOOP PARALLELISM PATTERN

Problem
Givenaserialprogramwhoseruntimeisdominatedbyasetofcomputationallyintensiveloops,how
canitbetranslatedintoaparallelprogram?
Context
Theoverwhelmingmajorityofprogramsusedinscientificandengineeringapplicationsareexpressed
intermsofiterativeconstructs;thatis,theyareloopbased.Optimizingtheseprogramsbyfocusing
strictlyontheloopsisatraditiondatingbacktotheoldervectorsupercomputers.Extendingthis
approachtomodernparallelcomputerssuggestsaparallelalgorithmstrategyinwhichconcurrent
tasksareidentifiedasiterationsofparallelizedloops.
Theadvantageofstructuringaparallelalgorithmaroundparallelizedloopsisparticularlyimportantin
problemsforwhichwellacceptedprogramsalreadyexist.Inmanycases,itisn'tpracticaltomassively
restructureanexistingprogramtogainparallelperformance.Thisisparticularlyimportantwhenthe
program(asisfrequentlythecase)containsconvolutedcodeandpoorlyunderstoodalgorithms.
Thispatternaddresseswaystostructureloopbasedprogramsforparallelcomputation.Whenexisting
codeisavailable,thegoalisto"evolve"asequentialprogramintoaparallelprogrambyaseriesof
transformationsontheloops.Ideally,allchangesarelocalizedtotheloopswithtransformationsthat
removeloopcarrieddependenciesandleavetheoverallprogramsemanticsunchanged.(Such
transformationsarecalledsemanticallyneutraltransformations).
Notallproblemscanbeapproachedinthisloopdrivenmanner.Clearly,itwillonlyworkwhenthe
algorithmstructurehasmost,ifnotall,ofthecomputationallyintensiveworkburiedinamanageable
numberofdistinctloops.Furthermore,thebodyoftheloopmustresultinloopiterationsthatwork
wellasparalleltasks(thatis,theyarecomputationallyintensive,expresssufficientconcurrency,and
aremostlyindependent).
Notalltargetcomputersystemsalignwellwiththisstyleofparallelprogramming.Ifthecodecannot
berestructuredtocreateeffectivedistributeddatastructures,somelevelofsupportforashared
addressspaceisessentialinallbutthemosttrivialcases.Finally,Amdahl'slawanditsrequirementto
minimizeaprogram'sserialfractionoftenmeansthatloopbasedapproachesareonlyeffectivefor
systemswithsmallernumbersofPEs.
Evenwiththeserestrictions,thisclassofparallelalgorithmsisgrowingrapidly.Becauseloopbased
algorithmsarethetraditionalapproachinhighperformancecomputingandarestilldominantinnew
programs,thereisalargebacklogofloopbasedprogramsthatneedtobeportedtomodernparallel
computers.TheOpenMPAPIwascreatedprimarilytosupportparallelizationoftheseloopdriven
problems.Limitationsonthescalabilityofthesealgorithmsareserious,butacceptable,giventhat
thereareordersofmagnitudemoremachineswithtwoorfourprocessorsthanmachineswithdozens
orhundredsofprocessors.
ThispatternisparticularlyrelevantforOpenMPprogramsrunningonsharedmemory
computersandforproblemsusingtheTaskParallelismandGeometricDecomposition
patterns.
Forces
Sequentialequivalence.Aprogramthatyieldsidenticalresults(exceptforroundofferrors)
whenexecutedwithonethreadormanythreadsissaidtobesequentiallyequivalent(also
knownasseriallyequivalent).Sequentiallyequivalentcodeiseasiertowrite,easierto
maintain,andletsasingleprogramsourcecodeworkforserialandparallelmachines.
Incrementalparallelism(orrefactoring).Whenparallelizinganexistingprogram,itismuch
easiertoendupwithacorrectparallelprogramif(1)theparallelizationisintroducedasa
sequenceofincrementaltransformations,oneloopatatime,and(2)thetransformationsdon't
"break"theprogram,allowingtestingtobecarriedoutaftereachtransformation.
Memoryutilization.Goodperformancerequiresthatthedataaccesspatternsimpliedbythe
loopsmeshwellwiththememoryhierarchyofthesystem.Thiscanbeatoddswiththe
previoustwoforces,causingaprogrammertomassivelyrestructureloops.
Solution
ThispatterniscloselyalignedwiththestyleofparallelprogrammingimpliedbyOpenMP.Thebasic
approachconsistsofthefollowingsteps.
Findthebottlenecks.Locatethemostcomputationallyintensiveloopseitherbyinspectionof
thecode,byunderstandingtheperformanceneedsofeachsubproblem,orthroughtheuseof
programperformanceanalysistools.Theamountoftotalruntimeonrepresentativedatasets
containedbytheseloopswillultimatelylimitthescalabilityoftheparallelprogram(see
Amdahl'slaw).
Eliminateloopcarrieddependencies.Theloopiterationsmustbenearlyindependent.Find
dependenciesbetweeniterationsorread/writeaccessesandtransformthecodetoremoveor
mitigatethem.FindingandremovingthedependenciesisdiscussedintheTaskParallelism
pattern,whileprotectingdependencieswithsynchronizationconstructsisdiscussedinthe
SharedDatapattern.
Parallelizetheloops.SplituptheiterationsamongtheUEs.Tomaintainsequential
equivalence,usesemanticallyneutraldirectivessuchasthoseprovidedwithOpenMP(as
describedintheOpenMPappendix,AppendixA).Ideally,thisshouldbedonetooneloopata
timewithtestingandcarefulinspectioncarriedoutateachpointtomakesureraceconditions
orothererrorshavenotbeenintroduced.
Optimizetheloopschedule.TheiterationsmustbescheduledforexecutionbytheUEssothe
loadisevenlybalanced.Althoughtherightschedulecanoftenbechosenbasedonaclear
understandingoftheproblem,frequentlyitisnecessarytoexperimenttofindtheoptimal
schedule.
Thisapproachisonlyeffectivewhenthecomputetimesfortheloopiterationsarelargeenoughto
compensateforparallelloopoverhead.Thenumberofiterationsperloopisalsoimportant,because
havingmanyiterationsperUEprovidesgreaterschedulingflexibility.Insomecases,itmightbe
necessarytotransformthecodetoaddresstheseissues.
Twotransformationscommonlyusedarethefollowing:
Mergeloops.Ifaproblemconsistsofasequenceofloopsthathaveconsistentlooplimits,the
loopscanoftenbemergedintoasingleloopwithmorecomplexloopiterations,asshownin
Fig.5.21.
Coalescenestedloops.Nestedloopscanoftenbecombinedintoasingleloopwithalarger
combinediterationcount,asshowninFig.5.22.Thelargernumberofiterationscanhelp
overcomeparallelloopoverhead,by(1)creatingmoreconcurrencytobetterutilizelarger
numbersofUEs,and(2)providingadditionaloptionsforhowtheiterationsarescheduledonto
UEs.
ParallelizingtheloopsiseasilydonewithOpenMPbyusingtheomp parallel fordirective.
Thisdirectivetellsthecompilertocreateateamofthreads(theUEsinasharedmemory
environment)andtosplituploopiterationsamongtheteam.ThelastloopinFig.5.22isanexample
ofaloopparallelizedwithOpenMP.WedescribethisdirectiveatahighlevelintheImplementation
Mechanismsdesignspace.SyntacticdetailsareincludedintheOpenMPappendix,AppendixA.
NoticethatinFig.5.22wehadtodirectthesystemtocreatecopiesoftheindicesiandjlocalto
eachthread.Thesinglemostcommonerrorinusingthispatternistoneglectto"privatize"key
variables.Ifiandjareshared,thenupdatesofiandjbydifferentUEscancollideandleadto
unpredictableresults(thatis,theprogramwillcontainaracecondition).Compilersusuallywillnot
detecttheseerrors,soprogrammersmusttakegreatcaretomakesuretheyavoidthesesituations.
Figure 5.21. Program fragment showing merging loops to increase the amount of work
per iteration
#define N 20
#define Npoints 512
void
void
void
void
FFT(); // a function to apply an FFT

invFFT(); // a function to apply an inverse FFT
filter(); // a frequency space filter
setH(); // Set values of filter, H
int main() {
int i, j;
double A[Npoints], B[Npoints], C[Npoints], H[Npoints];
setH(Npoints, H);
// do a bunch of work resulting in values for A and C
// method one: distinct loops to compute A and C
for(i=0; i<N; i++){
FFT (Npoints, A, B); // B = transformed A
filter(Npoints, B, H); // B = B filtered with
invFFT(Npoints, B, A); // A = inv transformed
}
for(i=0; i<N; i++){
FFT (Npoints, C, B); // B = transformed C
filter(Npoints, B, H); // B = B filtered with
invFFT(Npoints, B, C); // C = inv transformed
}
H
B
H
B
// method two: the above pair of loops combined into

// a single loop
for(i=0; i<N; i++){
FFT (Npoints, A, B); // B = transformed A
filter(Npoints, B, H); // B = B filtered with H
invFFT(Npoints, B, A); // A = inv transformed B
FFT (Npoints, C, B); // B = transformed C
filter(Npoints, B, H); // B = B filtered with H

invFFT(Npoints, B, C); // C = inv transformed B
}
return 0;
}
Thekeytotheapplicationofthispatternistousesemanticallyneutralmodificationstoproduce
sequentiallyequivalentcode.Asemanticallyneutralmodificationdoesn'tchangethemeaningofthe
singlethreadedprogram.Techniquesforloopmergingandcoalescingofnestedloopsdescribed
previously,whenusedappropriately,areexamplesofsemanticallyneutralmodifications.Inaddition,
mostofthedirectivesinOpenMParesemanticallyneutral.Thismeansthataddingthedirectiveand
runningitwithasinglethreadwillgivethesameresultasrunningtheoriginalprogramwithoutthe
OpenMPdirective.
Twoprogramsthataresemanticallyequivalent(whenrunwithasinglethread)neednotbothbe
sequentiallyequivalent.Recallthatsequentiallyequivalentmeansthattheprogramwillgivethesame
result(subjecttoroundofferrorsduetochangingtheorderoffloatingpointoperations)whetherrun
withonethreadormany.Indeed,the(semanticallyneutral)transformationsthateliminateloop
carrieddependenciesaremotivatedbythedesiretochangeaprogramthatisnotsequentially
equivalenttoonethatis.Whentransformationsaremadetoimproveperformance,eventhoughthe
transformationsaresemanticallyneutral,onemustbecarefulthatsequentialequivalencehasnotbeen
lost.
Figure 5.22. Program fragment showing coalescing nested loops to produce a single
loop with a larger number of iterations
#define N 20
#define M 10
extern double work(); // a time-consuming function
int main() {
int i, j, ij;
double A[N][M];
// method one: nested loops
for(j=0; j<N; j++){
for(i=0; i<M i++){
A[i][j] = work(i,j);
}
}
// method two: the above pair of nested loops combined into
// a single loop.
for(ij=0; ij<N*M; ij++){

j = ij/N;
i = ij%M;
}
//
//
//
//
method three: the above loop parallelized with OpenMP.
The omp pragma creates a team of threads and maps
loop iterations onto them. The private clause
tells each thread to maintain local copies of ij, j, and i.
#pragma omp parallel for private(ij, j, i)

for(ij=0; ij<N*M; ij++){
j = ij/N;
i = ij%M;
}
return 0;
}
Itismuchmoredifficulttodefinesequentiallyequivalentprogramswhenthecodementionseithera
threadIDorthenumberofthreads.AlgorithmsthatreferencethreadIDsandthenumberofthreads
tendtofavorparticularthreadsorevenparticularnumbersofthreads,asituationthatisdangerous
whenthegoalisasequentiallyequivalentprogram.
WhenanalgorithmdependsonthethreadID,theprogrammerisusingtheSPMDpattern.Thismay
beconfusing.SPMDprogramscanbeloopbased.Infact,manyoftheexamplesintheSPMDpattern
areindeedloopbasedalgorithms.ButtheyarenotinstancesoftheLoopParallelismpattern,because
theydisplaythehallmarktraitofanSPMDprogramnamely,theyusetheUEIDtoguidethe
algorithm.
Finally,we'veassumedthatadirectivebasedsystemsuchasOpenMPisavailablewhenusingthis
pattern.Itispossible,butclearlymoredifficult,toapplythispatternwithoutsuchadirectivebased
programmingenvironment.Forexample,inobjectorienteddesigns,onecanusetheLoopParallelism
patternbymakingcleveruseofanonymousclasseswithparalleliterators.Becausetheparallelismis
buriedintheiterators,theconditionsofsequentialequivalencecanbemet.
Performance considerations
Inalmosteveryapplicationofthispattern,especiallywhenusedwithOpenMP,theassumptionis
madethattheprogramwillexecuteonacomputerthathasmultiplePEssharingasingleaddress
space.Thisaddressspaceisassumedtoprovideequaltimeaccesstoeveryelementofmemory.
Unfortunately,thisisusuallynotthecase.Memoriesonmoderncomputersarehierarchical.Thereare
cachesonthePEs,memorymodulespackagedwithsubsetsofPEs,andothercomplications.While
greateffortismadeindesigningsharedmemorymultiprocessorcomputerstomakethemactlike
symmetricmultiprocessor(SMP)computers,thefactisthatallsharedmemorycomputersdisplay
somedegreeofnonuniformityinmemoryaccesstimesacrossthesystem.Inmanycases,theseeffects
areofsecondaryconcern,andwecanignorehowaprogram'smemoryaccesspatternsmatchupwith
thetargetsystem'smemoryhierarchy.Inothercases,particularlyonlargersharedmemorymachines,
programsmustbeexplicitlyorganizedaccordingtotheneedsofthememoryhierarchy.Themost
commontrickistomakesurethedataaccesspatternsduringinitializationofkeydatastructures
matchthoseduringlatercomputationusingthesedatastructures.Thisisdiscussedinmoredetailin
[Mat03,NA01]andlaterinthispatternaspartofthemeshcomputationexample.
Anotherperformanceproblemisfalsesharing.Thisoccurswhenvariablesarenotsharedbetween
UEs,buthappentoresideonthesamecacheline.Hence,eventhoughtheprogramsemanticsimplies
independence,eachaccessbyeachUErequiresmovementofacachelinebetweenUEs.Thiscan
createhugeoverheadsascachelinesarerepeatedlyinvalidatedandmovedbetweenUEsasthese
supposedlyindependentvariablesareupdated.Anexampleofaprogramfragmentthatwouldincur
highlevelsoffalsesharingisshowninFig.5.23.Inthiscode,wehaveapairofnestedloops.The
outermostloophasasmalliterationcountthatwillmapontothenumberofUEs(whichweassumeis
fourinthiscase).Theinnermostlooprunsoveralargenumberoftimeconsumingiterations.
Assumingtheiterationsoftheinnermostloopareroughlyequal,thisloopshouldparallelize
effectively.ButtheupdatestotheelementsoftheAarrayinsidetheinnermostloopmeaneachupdate
requirestheUEinquestiontoowntheindicatedcacheline.AlthoughtheelementsofAaretruly
independentbetweenUEs,theylikelysitinthesamecacheline.Hence,everyiterationinthe
innermostloopincursanexpensivecachelineinvalidateandmovementoperation.Itisnot
uncommonforthistonotonlydestroyallparallelspeedup,buttoevencausetheparallelprogramto
becomeslowerasmorePEsareadded.Thesolutionistocreateatemporaryvariableoneachthread
toaccumulatevaluesintheinnermostloop.Falsesharingisstillafactor,butonlyforthemuch
smalleroutermostloopwheretheperformanceimpactisnegligible.
Figure 5.23. Program fragment showing an example of false sharing. The small array A
is held in one or two cache lines. As the UEs access A inside the innermost loop, they
will need to take ownership of the cache line back from the other UEs. This back-andforth movement of the cache lines destroys performance. The solution is to use a
temporary variable inside the innermost loop.
#include <omp.h>
#define N 4 // Assume this equals the number of UEs
#define M 1000
extern double work(int, int); // a time-consuming function
int main() {
int i, j;
double A[N] = {0.0}; // Initialize the array to zero
// method one: a loop with false sharing from A since the elements
// of A are likely to reside in the same cache line.
#pragma omp parallel for private(j,i)
for(j=0; j<N; j++){
for(i=0; i<M; i++){
A[j] += work(i,j);
}
// method two: remove the false sharing by using a temporary
// private variable in the innermost loop
double temp;
#pragma omp parallel for private(j,i, temp)
for(j=0; j<N; j++){
temp = 0.0;
for(i=0; i<M;i++){
temp += work(i,j);
}
A[j] += temp;
}
return 0;
}
Examples
Asexamplesofthispatterninaction,wewillbrieflyconsiderthefollowing:
Numericalintegrationtoestimatethevalueofadefiniteintegralusingthetrapezoidrule
Moleculardynamics,nonbondedenergycomputation
Mandelbrotsetcomputation
Meshcomputation
Eachoftheseexampleshasbeendescribedelsewhereindetail.Wewillrestrictourdiscussioninthis
patterntothekeyloopsandhowtheycanbeparallelized.
Numerical integration
ConsidertheproblemofestimatingthevalueofusingEq.5.5.
Equation5.5
Weusetrapezoidalintegrationtonumericallysolvetheintegral.Theideaistofilltheareaundera
curvewithaseriesofrectangles.Asthewidthoftherectanglesapproaches0,thesumoftheareasof
therectanglesapproachesthevalueoftheintegral.
AprogramtocarryoutthiscalculationonasingleprocessorisshowninFig.5.24.Tokeepthe
programassimpleaspossible,wefixthenumberofstepstouseintheintegrationat1,000,000.The
variablesumisinitializedto0andthestepsizeiscomputedastherangeinx(equalto1.0inthis
case)dividedbythenumberofsteps.Theareaofeachrectangleisthewidth(thestepsize)timesthe
height(thevalueoftheintegrandatthecenteroftheinterval).Thewidthisaconstant,sowepullit
outofthesummationandmultiplythesumoftherectangleheightsbythestepsize,step,togetour
estimateofthedefiniteintegral.
Figure 5.24. Sequential program to carry out a trapezoid rule integration to compute
#include <stdio.h>
#include <math.h>
int main () {
int i;

{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
return 0;
CreatingaparallelversionofthisprogramusingtheLoopParallelismpatternissimple.Thereisonly
oneloop,sotheinspectionphaseistrivial.Tomaketheloopiterationsindependent,werecognizethat
(1)thevaluesofthevariablexarelocaltoeachiteration,sothisvariablecanbehandledasathread
localorprivatevariableand(2)theupdatestosumdefineareduction.Reductionsaresupportedby
theOpenMPAPI.Otherthanadding#include <omp.h>[2],onlyoneadditionallineofcodeis
neededtocreateaparallelversionoftheprogram.Thefollowingisplacedabovetheforloop:
[2]
TheOpenMPincludefiledefinesfunctionprototypesandopaquedatatypesusedby
OpenMP.
#pragma omp parallel for private(x) reduction(+:sum)
ThepragmatellsanOpenMPcompilerto(1)createateamofthreads,(2)createaprivatecopyofx
andsumforeachthread,(3)initializesumto0(theidentityoperandforaddition),(4)maploop
iterationsontothreadsintheteam,(5)combinelocalvaluesofsumintoasingleglobalvalue,and(6)
jointheparallelthreadswiththesinglemasterthread.Eachofthesestepsisdescribedindetailinthe
ImplementationMechanismsdesignspaceandtheOpenMPappendix,AppendixA.Foranon
OpenMPcompiler,thispragmaisignoredandthereforehasnoeffectontheprogram'sbehavior.
Molecular dynamics.
Throughoutthisbook,wehaveusedmoleculardynamicsasarecurringexample.Moleculardynamics
simulatesthemotionsofalargemolecularsystem.Itusesanexplicittimesteppingmethodology
whereateachtimesteptheforceoneachatomiscomputedandstandardtechniquesfromclassical
mechanicsareusedtocomputehowtheforceschangeatomicmotions.
Thecorealgorithm,includingpseudocode,waspresentedinSec.3.1.3andintheSPMDpattern.The
problemcomesdowntoacollectionofcomputationallyexpensiveloopsovertheatomswithinthe
molecularsystem.Theseareembeddedinatoplevelloopovertime.
Theloopovertimecannotbeparallelizedbecausethecoordinatesandvelocitiesfromtimestept1arethestartingpointfortimestept.Theindividualloopsoveratoms,however,canbeparallelized.
Themostimportantcasetoaddressisthenonbondedenergycalculation.Thecodeforthis
computationisshowninFig.5.25.UnliketheapproachusedintheexamplesfromtheSPMDpattern,
weassumethattheprogramanditsdatastructuresareunchangedfromtheserialcase.
Figure 5.25. Pseudocode for the nonbonded computation in a typical parallel
molecular dynamics code. This is code is almost identical to the sequential version of
the function shown previously in Fig. 4.4.
function non_bonded_forces (N, Atoms, neighbors, Forces)
Array of List :: neighbors(N) //atoms in cutoff volume
Real :: forceX, forceY, forceZ
loop [i] over atoms
forceX = non_bond_force(atoms(l,i), atoms(l,j))
forceY = non_bond_force(atoms(2,i), atoms(2,j))
forceZ = non_bond_force(atoms(3,i), atoms(3,j))
force,(1,i) += forceX; force(l,j) -= forceX;
force(2,i) += forceY; force{2,j) -= forceY;
force{3,i) += forceZ; force{3,j) -= forceZ;
end loop [j]
end loop [i]
Wewillparallelizetheloop [i] over atoms.NoticethatthevariablesforceX,forceY,and

forceZaretemporaryvariablesusedinsideaniteration.Wewillneedtocreatelocalcopiesofthese
privatetoeachUE.Theupdatestotheforcearraysarereductions.Parallelizationofthisfunction
wouldthereforerequireaddingasingledirectivebeforetheloopoveratoms:
#pragma omp parallel for private(j, forceX, forceY, forceZ) \
reduction (+ : force)
Theworkassociatedwitheachatomvariesunpredictablydependingonhowmanyatomsarein"its
neighborhood".Althoughthecompilermightbeabletoguessaneffectiveschedule,incasessuchas
thisone,itisusuallybesttotrydifferentschedulestofindtheonethatworksbest.Theworkperatom
isunpredictable,sooneofthedynamicschedulesavailablewithOpenMP(anddescribedinthe
OpenMPappendix,AppendixA)shouldbeused.Thisrequirestheadditionofasingleschedule
clause.Doingsogivesusourfinalpragmaforparallelizingthisprogram:
#pragma omp parallel for private(j, forceX, forceY, forceZ) \
reduction (+ : force) schedule (dynamic,10)
Thisscheduletellsthecompilertogrouptheloopiterationsintoblocksofsize10andassignthem
dynamicallytotheUEs.Thesizeoftheblocksisarbitraryandchosentobalancedynamicscheduling
overheadversushoweffectivelytheloadcanbebalanced.
OpenMP2.0forC/C++doesnotsupportreductionsoverarrayssothereductionwouldneedtobe
doneexplicitly.ThisisstraightforwardandisshowninFig.5.11.AfuturereleaseofOpenMPwill
correctthisdeficiencyandsupportreductionsoverarraysforalllanguagesthatsupportOpenMP.
Thesamemethodusedtoparallelizethenonbondedforcecomputationcouldbeusedthroughoutthe
moleculardynamicsprogram.TheperformanceandscalabilitywilllagtheanalogousSPMDversion
oftheprogram.Theproblemisthateachtimeaparalleldirectiveisencountered,anewteamof
threadsisinprinciplecreated.MostOpenMPimplementationsuseathreadpool,ratherthanactually
creatinganewteamofthreadsforeachparallelregion,whichminimizesthreadcreationand
destructionoverhead.However,thismethodofparallelizingthecomputationstilladdssignificant
overhead.Also,thereuseofdatafromcachestendstobepoorfortheseapproaches.Inprinciple,each
loopcanaccessadifferentpatternofatomsoneachUE.ThiseliminatesthecapabilityforUEsto
makeeffectiveuseofvaluesalreadyincache.
Evenwiththeseshortcomings,however,theseapproachesarecommonlyusedwhenthegoalisextra
+
parallelismonasmallsharedmemorysystem[BBE
99
].Forexample,onemightuseanSPMD
versionofthemoleculardynamicsprogramacrossaclusterandthenuseOpenMPtogainextra
performancefromdualprocessorsorfrommicroprocessorsutilizingsimultaneousmultithreading
[MPS02].
Mandelbrot set computation
ConsiderthewellknownMandelbrotset[Dou86].Wediscussthisproblemanditsparallelizationasa
taskparallelproblemintheTaskParallelismandSPMDpatterns.Eachpixeliscoloredbasedonthe
behaviorofthequadraticrecurrencerelationinEq.5.6.
Equation5.6
CandZarecomplexnumbersandtherecurrenceisstartedwithZ0=C.Theimageplotsthe
imaginarypartofContheverticalaxis(1.5to1.5)andtherealpartonthehorizontalaxis(1to2).
Thecolorofeachpixelisblackiftherecurrencerelationconvergestoastablevalueoriscolored
dependingonhowrapidlytherelationdiverges.
PseudocodeforthesequentialversionofthiscodeisshowninFig.5.26.Theinterestingpartofthe
problemishiddeninsidetheroutinecompute_Row().Thedetailsofthisroutinearenotimportant
forunderstandingtheparallelalgorithm,however,sowewillnotshowthemhere.Atahighlevel,the
followinghappensforeachpointintherow.
EachpixelcorrespondstoavalueofCinthequadraticrecurrence.Wecomputethisvalue
basedontheinputrangeandthepixelindices.
Figure 5.26. Pseudocode for a sequential version of
the Mandelbrot set generation program
manage_user_input(ranges, color_map) // input ranges,
//color map initialize_graphics(RowSize, Nrows, M, ranges, color_map)
graph(i, RowSize, M, color-map, ranges, row)
Wethencomputethetermsintherecurrenceandsetthevalueofthepixelbasedonwhetherit
convergestoafixedvalueordiverges.Ifitdiverges,wesetthepixelvaluebasedontherateof
divergence.
Oncecomputed,therowsareplottedtomakethewellknownMandelbrotsetimages.Thecolorsused
forthepixelsaredeterminedbymappingdivergenceratesontoacolormap.
CreatingaparallelversionofthisprogramusingtheLoopParallelismpatternistrivial.Theiterations
oftheloopoverrowsareindependent.Allweneedtodoismakesureeachthreadhasitsownrowto
workon.Wedothiswiththesinglepragma:
#pragma omp parallel for private(row)
Theschedulingcanbeabittrickybecauseworkassociatedwitheachrowwillvaryconsiderably
dependingonhowmanypointsdiverge.Theprogrammershouldtryseveraldifferentschedules,buta
cyclicdistributionislikelytoprovideaneffectiveloadbalance.Inthisschedule,theloopiterations
aredealtoutlikeadeckofcards.Byinterleavingtheiterationsamongasetofthreads,wearelikelyto
getabalancedload.Becausetheschedulingdecisionsarestatic,theoverheadincurredbythis
approachissmall.
#pragma omp parallel for private(Row) schedule(static, 1)
Formoreinformationaboutthescheduleclauseandthedifferentoptionsavailabletotheparallel
programmer,seetheOpenMPappendix,AppendixA.
Noticethatwehaveassumedthatthegraphicspackageisthreadsafe.Thismeansthatmultiple
threadscansimultaneouslycallthelibrarywithoutcausinganyproblems.TheOpenMPspecifications
requirethisforthestandardI/Olibrary,butnotforanyotherlibraries.Therefore,itmaybenecessary
toprotectthecalltothegraphfunctionbyplacingitinsideacriticalsection:
#pragma critical
WedescribethisconstructindetailintheImplementationMechanismsdesignspaceandinthe
OpenMPappendix,AppendixA.Thisapproachwouldworkwell,butitcouldhaveserious
performanceimplicationsiftherowstookthesametimetocomputeandthethreadsalltriedtograph
theirrowsatthesametime.
Mesh computation
Considerasimplemeshcomputationthatsolvesthe1Dheatdiffusionequation.Thedetailsofthis
problemanditssolutionusingOpenMParepresentedintheGeometricDecompositionpattern.We
reproducethissolutioninFig.5.27.
Thisprogramwouldworkwellonmostsharedmemorycomputers.Acarefulanalysisoftheprogram
performance,however,wouldexposetwoperformanceproblems.First,thesingledirectiverequired
toprotecttheswappingofthesharedpointersaddsanextrabarrier,therebygreatlyincreasingthe
synchronizationoverhead.Second,onNUMAcomputers,memoryaccessoverheadislikelytobe
highbecausewe'vemadenoefforttokeepthearraysnearthePEsthatwillbemanipulatingthem.
WeaddressbothoftheseproblemsinFig.5.28.Toeliminatetheneedforthesingledirective,we
modifytheprogramsoeachthreadhasitsowncopyofthepointersukandukp1.Thiscanbedone
withaprivateclause,buttobeuseful,weneedthenew,privatecopiesofukandukp1topointto
thesharedarrayscomprisingthemeshofvalues.Wedothiswiththefirstprivateclauseapplied
totheparalleldirectivethatcreatestheteamofthreads.
Theotherperformanceissueweaddress,minimizingmemoryaccessoverhead,ismoresubtle.As
discussedearlier,toreducememorytrafficinthesystem,itisimportanttokeepthedataclosetothe
PEsthatwillworkwiththedata.OnNUMAcomputers,thiscorrespondstomakingsurethepagesof
memoryareallocatedand"owned"bythePEsthatwillbeworkingwiththedatacontainedinthe
page.ThemostcommonNUMApageplacementalgorithmisthe"firsttouch"algorithm,inwhich
thePEfirstreferencingaregionofmemorywillhavethepageholdingthatmemoryassignedtoit.So
averycommontechniqueinOpenMPprogramsistoinitializedatainparallelusingthesameloop
scheduleaswillbeusedlaterinthecomputations.
Figure 5.27. Parallel heat-diffusion program using OpenMP. This program is described
in the Examples section of the Geometric Decomposition pattern.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0
for (int i=1; i < NX-1; ++i)
uk[i] = 0.0;
for (int i = 0; i < NX; ++i)
ukp1 [i] = uk[i];
}
int main(void) {
double *temp;
int i,k;
#pragma omp parallel private (k, i)
{
for (k = 0; k < NSTEPS; ++k) {
for (i = 1; i < NX-1; ++i) {
}
#pragma omp single
{temp = ukpl; ukpl = uk; uk = temp;}
}
}
return 0;
Wedothisbyfirstchangingtheinitializationslightlysotheinitializationloopisidenticaltothe
computationloop.Wethenusethesameloopparallelizationdirectiveontheinitializationloopason
thecomputationalloop.Thisdoesn'tguaranteeanoptimalmappingofmemorypagesontothePEs,
butitisaportablewaytoimprovethismappingandinmanycasescomequiteclosetoanoptimal
solution.
Figure 5.28. Parallel heat-diffusion program using OpenMP, with reduced thread
management overhead and memory management more appropriate for NUMA
computers
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define NX 100
#define LEFTVAL 1.0
int i;
ukp1[NX-1] = 0.0;
for (i = 1; i < NX-1; ++i){
uk[i] = 0.0;
ukp1[i] = 0.0;
}
}
int main(void) {
double *ukp1 = malloc(sizeof(double) * NX); double *temp;
int i,k;
#pragma omp parallel private (k, i, temp) firstprivate(uk, ukpl)
{
initialize(uk, ukpl);
for (k = 0; k < NSTEPS; ++k)
{ #pragma omp for schedule(static)
for (i = 1; i < NX-1; ++i) {
}
/* "copy" ukpl to uk by swapping pointers */
temp = ukpl; ukpl = uk; uk = temp;
}
}
return 0;
Known uses
TheLoopParallelismpatternisheavilyusedbyOpenMPprogrammers.Annualworkshopsareheldin
NorthAmerica(Wompat:WorkshoponOpenMPApplicationsandTools),Europe(EWOMP:
EuropeanWorkshoponOpenMP),andJapan(WOMPEI:WorkshoponOpenMPExperiencesand
Implementations)todiscussOpenMPanditsuse.Proceedingsfrommanyoftheseworkshopsare
widelyavailable[VJKT00,Sci03,EV01]andarefullofexamplesoftheLoopParallelismpattern.
MostoftheworkonOpenMPhasbeenrestrictedtosharedmemorymultiprocessormachinesfor
problemsthatworkwellwithanearlyflatmemoryhierarchy.WorkhasbeendonetoextendOpenMP
applicationstomorecomplicatedmemoryhierarchies,includingNUMAmachines[NA01,SSGF00]
andevenclusters[HLCZ99,SHTS01].
Related Patterns
Theconceptofdrivingparallelismfromacollectionofloopsisgeneralandusedwithmanypatterns.
Inparticular,manyproblemsusingtheSPMDpatternareloopbased.TheyusetheUEID,however,
todrivetheparallelizationoftheloopandhencedon'tperfectlymapontothispattern.Furthermore,
problemsusingtheSPMDpatternusuallyincludesomedegreeofparallellogicinbetweentheloops.
ThisallowsthemtodecreasetheirserialfractionandisoneofthereasonswhySPMDprogramstend
toscalebetterthanprogramsusingtheLoopParallelismpattern.
AlgorithmstargetedforsharedmemorycomputersthatusetheTaskParallelismorGeometric
DecompositionpatternsfrequentlyusetheLoopParallelismpattern.
5.7. THE FORK/JOIN PATTERN

Problem
Insomeprograms,thenumberofconcurrenttasksvariesastheprogramexecutes,andthewaythese
tasksarerelatedpreventstheuseofsimplecontrolstructuressuchasparallelloops.Howcana
parallelprogrambeconstructedaroundsuchcomplicatedsetsofdynamictasks?
Context
Insomeproblems,thealgorithmimposesageneralanddynamicparallelcontrolstructure.Tasksare
createddynamically(thatis,forked)andlaterterminated(thatis,joinedwiththeforkingtask)asthe
programcontinuestoexecute.Inmostcases,therelationshipsbetweentasksaresimple,anddynamic
taskcreationcanbehandledwithparallelloops(asdescribedintheLoopParallelismpattern)or
throughtaskqueues(asdescribedintheMaster/Workerpattern).Inothercases,relationshipsbetween
thetaskswithinthealgorithmmustbecapturedinthewaythetasksaremanaged.Examplesinclude
recursivelygeneratedtaskstructures,highlyirregularsetsofconnectedtasks,andproblemswhere
differentfunctionsaremappedontodifferentconcurrenttasks.Ineachoftheseexamples,tasksare
forkedandlaterjoinedwiththeparenttask(thatis,thetaskthatexecutedthefork)andtheothertasks
createdbythesamefork.TheseproblemsareaddressedintheFork/Joinpattern.
Asanexample,consideranalgorithmdesignedusingtheDivideandConquerpattern.Asthe
programexecutionproceeds,theproblemissplitintosubproblemsandnewtasksarerecursively
created(orforked)toconcurrentlyexecutesubproblems;eachofthesetasksmayinturnbefurther
split.Whenallthetaskscreatedtohandleaparticularsplithaveterminatedandjoinedwiththeparent
task,theparenttaskcontinuesthecomputation.
ThispatternisparticularlyrelevantforJavaprogramsrunningonsharedmemory
computersandforproblemsusingtheDivideandConquerandRecursiveDatapatterns.
OpenMPcanbeusedeffectivelywiththispatternwhentheOpenMPenvironment
supportsnestedparallelregions.
Forces
Algorithmsimplyrelationshipsbetweentasks.Insomeproblems,therearecomplexor
recursiverelationsbetweentasks,andtheserelationsneedtobecreatedandterminated
dynamically.Althoughthesecanbemappedontofamiliarcontrolstructures,thedesignin
manycasesismucheasiertounderstandifthestructureofthetasksismimickedbythe
structureoftheUEs.
AonetoonemappingoftasksontoUEsisnaturalinthesealgorithms,butthatmustbe
balancedagainstthenumberofUEsasystemcanhandle.
UEcreationanddestructionarecostlyoperations.Thealgorithmmightneedtoberecastto
decreasetheseoperationssotheydon'tadverselyaffecttheprogram'soverallperformance.
Solution
InproblemsthatusetheFork/Joinpattern,tasksmapontoUEsindifferentways.Wewilldiscusstwo
differentapproachestothesolution:(1)asimpledirectmappingwherethereisonetaskperUE,and
(2)anindirectmappingwhereapoolofUEsworkonsetsoftasks.
Direct task/UE mapping
ThesimplestcaseisonewherewemapeachsubtasktoaUE.Asnewsubtasksareforked,newUEs
arecreatedtohandlethem.ThiswillbuildupcorrespondingsetsoftasksandUEs.Inmanycases,
thereisasynchronizationpointwherethemaintaskwaitsforitssubtaskstofinish.Thisiscalleda
join.Afterasubtaskterminates,theUEhandlingitwillbedestroyed.Wewillprovideanexampleof
thisapproachlaterusingJava.
Thedirecttask/UEmappingsolutiontotheFork/Joinpatternisthestandardprogrammingmodelin
OpenMP.Aprogrambeginsasasinglethread(themasterthread).Aparallelconstructforksateamof
threads,thethreadsexecutewithinasharedaddressspace,andattheendoftheparallelconstruct,the
threadsjoinbacktogether.Theoriginalmasterthreadthencontinuesexecutionuntiltheendofthe
programoruntilthenextparallelconstruct.[3]Thisstructureunderliestheimplementationofthe
OpenMPparallelloopconstructsdescribedintheLoopParallelismpattern.
[3]
Inprinciple,nestedparallelregionsinOpenMPprogramsalsomapontothisdirect
+
mappingsolution.Thisapproachhasbeensuccessfullyusedin[AML
99
].TheOpenMP
specification,however,letsconformingOpenMPimplementations"serialize"nested
parallelregions(thatis,executethemwithateamofsizeone).Therefore,anOpenMP
programcannotdependonnestedparallelregionsactuallyforkingadditionalthreads,and
programmersmustbecautiouswhenusingOpenMPforallbutthesimplestfork/join
programs.
Indirect task/UE mapping
Threadandprocesscreationanddestructionisoneofthemoreexpensiveoperationsthatoccurin
parallelprograms.Thus,ifaprogramcontainsrepeatedforkandjoinsections,thesimplesolution,
whichwouldrequirerepeateddestructionandcreationofUEs,mightnotbeefficientenough.Also,if
atsomepointtherearemanymoreUEsthanPEs,theprogrammightincurunacceptableoverheaddue
tothecostsofcontextswitches.
Inthiscase,itisdesirabletoavoidthedynamicUEcreationbyimplementingthefork/joinparadigm
usingathreadpool.Theideaistocreatea(relatively)staticsetofUEsbeforethefirstforkoperation.
ThenumberofUEsisusuallythesameasthenumberofPEs.ThemappingoftaskstoUEsthen
occursdynamicallyusingataskqueue.TheUEsthemselvesarenotrepeatedlycreatedanddestroyed,
butsimplymappedtodynamicallycreatedtasksastheneedarises.Thisapproach,although
complicatedtoimplement,usuallyresultsinefficientprogramswithgoodloadbalance.Wewill
discussaJavaprogramthatusesthisapproachintheExamplessectionofthispattern.
InOpenMP,thereissomecontroversyoverthebestapproachtousewiththisindirectmapping
approach[Mat03].Anapproachgainingcredibilityisonebasedonanew,proposedOpenMP
workshareconstructcalledataskqueue[SHPT00].Theproposalactuallydefinestwonew
constructs:ataskqueueandatask.Asthenameimplies,theprogrammerusesataskqueue
constructtocreatethetaskqueue.Insidethetaskqueueconstruct,ataskconstructdefinesa
blockofcodethatwillbepackagedintoataskandplacedonthetaskqueue.Theteamofthreads(as
usuallycreatedwithaparallelconstruct),playingtheroleofathreadpool,pullstasksoffthequeue
andexecutesthemuntilthequeueisempty.
UnlikeOpenMPparallelregions,taskqueuescanbedependablynestedtoproduceahierarchyof
taskqueues.Thethreadsworkacrosstaskqueuesusingworkstealingtokeepallthreadsfully
occupieduntilallofthequeuesareempty.Thisapproachhasbeenshowntoworkwell[SHPT00]and
islikelytobeadoptedinafutureOpenMPspecification.
Examples
Asexamplesofthispattern,wewillconsiderdirectmappingandindirectmappingimplementations
ofaparallelmergesortalgorithm.TheindirectmappingsolutionmakesuseofaJavapackageFJTasks
[Lea00b].TheExamplessectionoftheSharedQueuepatterndevelopsasimilar,butsimpler,
framework.
Mergesort using direct mapping
Asanexample,considerthestraightforwardimplementationofamethodtoperformsamergesortin
JavashowninFig.5.29.Themethodtakesareferencetothearraytobesortedandsortstheelements
withindicesrangingfromlo(inclusive)tohi(exclusive).SortingtheentirearrayAisdoneby
invokingsort(A,0,A.length).
Figure 5.29. Parallel mergesort where each task corresponds to a thread
static void sort(final int[] A,final int lo, final int hi)
{ int n = hi - lo;
//if not large enough to do in parallel, sort sequentially
if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; }
else
{ //split array
final int pivot = (hi+lo)/2;
//create and start new thread to sort lower half
Thread t = new Thread()
{ public void run()
{ sort(A, lo, pivot); }
};
t.start();
//sort upper half in current thread
sort(A,pivot,hi);
//wait for other thread
try{t. join();}
catch (InterruptedException e){Thread.dumpStack();}
}
}
//merge sorted arrays

int [] ws = new int [n] ;
System.arraycopy(A,lo,ws,0,n);
int wpivot = pivot - lo;
int wlo = 0;
int whi = wpivot;
for (int i = lo; i != hi; i++)
{ if((wlo < wpivot) && (whi >= n II ws[wlo] <= ws[whi]))
{ A[i] = ws[wlo++]; }
else { A[i] = ws[whi++];
}
Thefirststepofthemethodistocomputethesizeofthesegmentofthearraytobesorted.Ifthesize
oftheproblemistoosmalltomaketheoverheadofsortingitinparallelworthwhile,thenasequential
sortingalgorithmisused(inthiscase,thetunedquicksortimplementationprovidedbytheArrays
classinthejava.utilpackage).Ifthesequentialalgorithmisnotused,thenapivotpointis
computedtodividethesegmenttobesorted.Anewthreadisforkedtosortthelowerhalfofthearray,
whiletheparentthreadsortstheupperhalf.Thenewtaskisspecifiedbytherunmethodofan
anonymousinnersubclassoftheThreadclass.Whenthenewthreadhasfinishedsorting,it
terminates.Whentheparentthreadfinishessorting,itperformsajointowaitforthechildthreadto
terminateandthenmergesthetwosortedsegmentstogether.
Thissimpleapproachmaybeadequateinfairlyregularproblemswhereappropriatethresholdvalues
caneasilybedetermined.Westressthatitiscrucialthatthethresholdvaluebechosenappropriately:
Iftoosmall,theoverheadfromtoomanyUEscanmaketheprogramrunevenslowerthanasequential
version.Iftoolarge,potentialconcurrencyremainsunexploited.
Figure 5.30. Instantiating FJTaskRunnerGroup and invoking the master task
int groupSize = 4; //number of threads
FJTaskRunnerGroup group = new FJTaskRunnerGroup(groupSize); group.invoke(new FJTask()
{ public void run()
{ synchronized(this)
{ sort(A,0, A.length); }
}
});
Mergesort using indirect mapping
ThisexampleusestheFJTaskframeworkincludedaspartofthepublicdomainpackage
EDU.oswego.cs.dl.util.concurrent[Lea00b].[4]Insteadofcreatinganewthreadto
executeeachtask,aninstanceofa(subclassof)FJTaskiscreated.Thepackagethendynamically
mapstheFJTaskobjectstoastaticsetofthreadsforexecution.Althoughlessgeneralthana
Thread,anFJTaskisamuchlighterweightobjectthanathreadandisthusmuchcheapertocreate
anddestroy.InFig.5.30andFig.5.31,weshowhowtomodifythemergesortexampletouse
FJTasksinsteadofJavathreads.Theneededclassesareimportedfrompackage
EDU.oswego.cs.dl.util.concurrent.BeforestartinganyFJTasks,a
FJTaskRunnerGroupmustbeinstantiated,asshowninFig.5.30.Thiscreatesthethreadsthatwill
constitutethethreadpoolandtakesthenumberofthreads(groupsize)asaparameter.Once
instantiated,themastertaskisinvokedusingtheinvokemethodontheFJTaskRunnerGroup.
[4]
Thispackagewasthebasisforthenewfacilitiestosupportconcurrencyintroducedvia
JSR166inJava21.5.Itsauthor,DougLea,wasaleadintheJSReffort.TheFJTask
frameworkisnotpartofJava21.5,butremainsavailablein[Lea00b].
Thesortroutineitselfissimilartothepreviousversionexceptthatthedynamicallycreatedtasksare
implementedbytherunmethodofanFJTasksubclassinsteadofaThreadsubclass.Thefork
andjoinmethodsofFJTaskareusedtoforkandjointhetaskinplaceoftheThread start
andjoinmethods.Althoughtheunderlyingimplementationisdifferent,fromtheprogrammer's
viewpoint,thisindirectmethodisverysimilartothedirectimplementationshownpreviously.
AmoresophisticatedparallelimplementationofmergesortisprovidedwiththeFJTaskexamplesin
theutil.concurrentdistribution.Thepackagealsoincludesfunctionalitynotillustratedbythis
example.
Known uses
ThedocumentationwiththeFJTaskpackageincludesseveralapplicationsthatusetheFork/Join
pattern.ThemostinterestingoftheseincludeJacobiiteration,aparalleldivideandconquermatrix
multiplication,astandardparallelprocessingbenchmarkprogramthatsimulatesheatdiffusionacross
amesh,LUmatrixdecomposition,integralcomputationusingrecursiveGaussianQuadrature,andan
adaptationoftheMicroscopegame.[5]
[5]
Accordingtothedocumentationforthisapplication,thisisthegamethatisplayed
whilelookingthroughthemicroscopeinthelaboratoryinThe7thGuest(T7G;ACD
ROMgameforPCs).Itisaboardgameinwhichtwoplayerscompetetofillspacesonthe
boardwiththeirtiles,somethinglikeReversiorOthello.
Figure 5.31. Mergesort using the FJTask framework
static void sort(final int[] A,final int lo, final int hi) {
int n = hi - lo;
if (n <= THRESHOLD){ Arrays.sort(A,lo,hi); return; }
else {
//split array
final int pivot = (hi+lo)/2;
//override run method in FJTask to execute run method
FJTask t = new FJTask()
{ public void run()
{ sort(A, lo, pivot); }
}
//fork new task to sort lower half of array
t.fork();
//perform sort on upper half in current task
sort(A,pivot,hi);
//join with forked task
t.join();
//merge sorted arrays as before, code omitted
}
BecauseOpenMPisbasedonafork/joinprogrammingmodel,onemightexpectheavyuseofthe
Fork/JoinpatternbyOpenMPprogrammers.Therealityis,however,thatmostOpenMPprogrammers
useeithertheLoopParallelismorSPMDpatternsbecausethecurrentOpenMPstandardprovides
poorsupportfortruenestingofparallelregions.Oneofthefewpublishedaccountsofusingthe
Fork/JoinpatternwithstandardOpenMPisapaperwherenestedparallelismwasusedtoprovidefine
grainedparallelisminanimplementationofLAPACK[ARv03].
ExtendingOpenMPsoitcanusetheFork/Joinpatterninsubstantialapplicationsisanactiveareaof
research.We'vementionedoneoftheselinesofinvestigationforthecaseoftheindirectmapping
solutionoftheFork/Joinpattern(thetaskqueue[SHPT00]).Anotherpossibilityistosupportnested
parallelregionswithexplicitgroupsofthreadsforthedirectmappingsolutionoftheFork/Join
+
pattern(theNanosOpenMPcompiler[GAM
00
]).
Related Patterns
AlgorithmsthatusetheDivideandConquerpatternusetheFork/Joinpattern.
TheLoopParallelismpattern,inwhichthreadsareforkedjusttohandleasingleparallelloop,isan
instanceoftheFork/Joinpattern.
TheMaster/Workerpattern,whichinturnusestheSharedQueuepattern,canbeusedtoimplement
theindirectmappingsolution.
5.8. THE SHARED DATA PATTERN

Problem
Howdoesoneexplicitlymanageshareddatainsideasetofconcurrenttasks?
Context
MostoftheAlgorithmStructurepatternssimplifythehandlingofshareddatabyusingtechniquesto
"pull"theshareddata"outside"thesetoftasks.Examplesincludereplicationplusreductioninthe
TaskParallelismpatternandalternatingcomputationandcommunicationintheGeometric
Decompositionpattern.Forcertainproblems,however,thesetechniquesdonotapply,thereby
requiringthatshareddatabeexplicitlymanagedinsidethesetofconcurrenttasks.
+
Forexample,considerthephylogenyproblemfrommolecularbiology,asdescribedin[YWC
96
].A
phylogenyisatreeshowingrelationshipsbetweenorganisms.Theproblemconsistsofgenerating
largenumbersofsubtreesaspotentialsolutionsandthenrejectingthosethatfailtomeetthevarious
consistencycriteria.Differentsetsofsubtreescanbeexaminedconcurrently,soanaturaltask
definitioninaparallelphylogenyalgorithmwouldbetheprocessingrequiredforeachsetofsubtrees.
However,notallsetsmustbeexaminedifasetSisrejected,allsupersetsofScanalsoberejected.
Thus,itmakessensetokeeptrackofthesetsstilltobeexaminedandthesetsthathavebeenrejected.
Giventhattheproblemnaturallydecomposesintonearlyindependenttasks(oneperset),thesolution
tothisproblemwouldusetheTaskParallelismpattern.Usingthepatterniscomplicated,however,by
thefactthatalltasksneedbothreadandwriteaccesstothedatastructureofrejectedsets.Also,
becausethisdatastructurechangesduringthecomputation,wecannotusethereplicationtechnique
describedintheTaskParallelismpattern.Partitioningthedatastructureandbasingasolutiononthis
datadecomposition,asdescribedintheGeometricDecompositionpattern,mightseemlikeagood
alternative,butthewayinwhichtheelementsarerejectedisunpredictable,soanydatadecomposition
islikelytoleadtoapoorloadbalance.
Similardifficultiescanariseanytimeshareddatamustbeexplicitlymanagedinsideasetof
concurrenttasks.ThecommonelementsforproblemsthatneedtheSharedDatapatternare(1)at
leastonedatastructureisaccessedbymultipletasksinthecourseoftheprogram'sexecution,(2)at
leastonetaskmodifiestheshareddatastructure,and(3)thetaskspotentiallyneedtousethemodified
valueduringtheconcurrentcomputation.
Forces
Theresultsofthecomputationmustbecorrectforanyorderingofthetasksthatcouldoccur
duringthecomputation.
Explicitlymanagingshareddatacanincurparalleloverhead,whichmustbekeptsmallifthe
programistorunefficiently.
Techniquesformanagingshareddatacanlimitthenumberoftasksthatcanrunconcurrently,
therebyreducingthepotentialscalabilityofanalgorithm.
Iftheconstructsusedtomanageshareddataarenoteasytounderstand,theprogramwillbe
moredifficulttomaintain.
Solution
Explicitlymanagingshareddatacanbeoneofthemoreerrorproneaspectsofdesigningaparallel
algorithm.Therefore,agoodapproachistostartwithasolutionthatemphasizessimplicityandclarity
ofabstractionandthentrymorecomplexsolutionsifnecessarytoobtainacceptableperformance.
Thesolutionreflectsthisapproach.
Be sure this pattern is needed
Thefirststepistoconfirmthatthispatternistrulyneeded;itmightbeworthwhiletorevisitdecisions
madeearlierinthedesignprocess(thedecompositionintotasks,forexample)toseewhetherdifferent
decisionsmightleadtoasolutionthatfitsoneoftheAlgorithmStructurepatternswithouttheneedto
explicitlymanageshareddata.Forexample,iftheTaskParallelismpatternisagoodfit,itis
worthwhiletoreviewthedesignandseeifdependenciescanbemanagedbyreplicationandreduction.
Define an abstract data type
Assumingthispatternmustindeedbeused,startbyviewingtheshareddataasanabstractdatatype
(ADT)withafixedsetof(possiblycomplex)operationsonthedata.Forexample,iftheshareddata
structureisaqueue(seetheSharedQueuepattern),theseoperationswouldconsistofput(enqueue),
take(dequeue),andpossiblyotheroperations,suchasatestforanemptyqueueoratesttoseeifa
specifiedelementispresent.Eachtaskwilltypicallyperformasequenceoftheseoperations.These
operationsshouldhavethepropertythatiftheyareexecutedserially(thatis,oneatatime,without
interferencefromothertasks),eachoperationwillleavethedatainaconsistentstate.
Theimplementationoftheindividualoperationswillmostlikelyinvolveasequenceoflowerlevel
actions,theresultsofwhichshouldnotbevisibletootherUEs.Forexample,ifweimplementedthe
previouslymentionedqueueusingalinkedlist,a"take"operationactuallyinvolvesasequenceof
lowerleveloperations(whichmaythemselvesconsistofasequenceofevenlowerleveloperations):
1. Usevariablefirsttoobtainareferencetothefirstobjectinthelist.
2. Fromthefirstobject,getareferencetothesecondobjectinthelist.
3. Replacethevalueoffirstwiththereferencetothesecondobject.
4. Updatethesizeofthelist.
5. Returnthefirstelement.
Iftwotasksareexecuting"take"operationsconcurrently,andtheselowerleveloperationsare
interleaved(thatis,the"take"operationsarenotbeingexecutedatomically),theresultcouldeasilybe
aninconsistentlist.
Implement an appropriate concurrency-control protocol
AftertheADTanditsoperationshavebeenidentified,theobjectiveistoimplementaconcurrency
controlprotocoltoensurethattheseoperationsgivethesameresultsasiftheywereexecutedserially.
Thereareseveralwaystodothis;startwiththefirsttechnique,whichisthesimplest,andthentrythe
othermorecomplextechniquesifitdoesnotyieldacceptableperformance.Thesemorecomplex
techniquescanbecombinedifmorethanoneisapplicable.
Oneatatimeexecution.Theeasiestsolutionistoensurethattheoperationsareindeedexecuted
serially.
Inasharedmemoryenvironment,themoststraightforwardwaytodothisistotreateachoperationas
partofasinglecriticalsectionanduseamutualexclusionprotocoltoensurethatonlyoneUEata
timeisexecutingitscriticalsection.Thismeansthatalloftheoperationsonthedataaremutually
exclusive.Exactlyhowthisisimplementedwilldependonthefacilitiesofthetargetprogramming
environment.Typicalchoicesincludemutexlocks,synchronizedblocks,criticalsections,and
semaphores.ThesemechanismsaredescribedintheImplementationMechanismsdesignspace.Ifthe
programminglanguagenaturallysupportstheimplementationofabstractdatatypes,itisusually
appropriatetoimplementeachoperationasaprocedureormethod,withthemutualexclusion
protocolimplementedinthemethoditself.
Inamessagepassingenvironment,themoststraightforwardwaytoensureserialexecutionistoassign
theshareddatastructuretoaparticularUE.Eachoperationshouldcorrespondtoamessagetype;
otherprocessesrequestoperationsbysendingmessagestotheUEmanagingthedatastructure,which
processesthemserially.
Ineitherenvironment,thisapproachisusuallynotdifficulttoimplement,butitcanbeoverly
conservative(thatis,itmightdisallowconcurrentexecutionofoperationsthatwouldbesafeto
executesimultaneously),anditcanproduceabottleneckthatnegativelyaffectstheperformanceofthe
program.Ifthisisthecase,theremainingapproachesdescribedinthissectionshouldbereviewedto
seewhetheroneofthemcanreduceoreliminatethisbottleneckandgivebetterperformance.
Noninterferingsetsofoperations.Oneapproachtoimprovingperformancebeginsbyanalyzingthe
interferencebetweentheoperations.WesaythatoperationAinterfereswithoperationBifAwritesa
variablethatBreads.Noticethatanoperationmayinterferewithitself,whichwouldbeaconcernif
morethanonetaskexecutesthesameoperation(forexample,morethanonetaskexecutes"take"
operationsonasharedqueue).Itmaybethecase,forexample,thattheoperationsfallintotwo
disjointsets,wheretheoperationsindifferentsetsdonotinterferewitheachother.Inthiscase,the
amountofconcurrencycanbeincreasedbytreatingeachofthesetsasadifferentcriticalsection.That
is,withineachset,operationsexecuteoneatime,butoperationsindifferentsetscanproceed
concurrently.
Readers/writers.Ifthereisnoobviouswaytopartitiontheoperationsintodisjointsets,considerthe
typeofinterference.Itmaybethecasethatsomeoftheoperationsmodifythedata,butothersonly
readit.Forexample,ifoperationAisawriter(bothreadingandwritingthedata)andoperationBisa
reader(reading,butnotwriting,thedata),AinterfereswithitselfandwithB,butBdoesnotinterfere
withitself.Thus,ifonetaskisperformingoperationA,noothertaskshouldbeabletoexecuteeither
AorB,butanynumberoftasksshouldbeabletoexecuteBconcurrently.Insuchcases,itmaybe
worthwhiletoimplementareaders/writersprotocolthatwillallowthispotentialconcurrencytobe
exploited.Theoverheadofmanagingthereaders/writersprotocolisgreaterthanthatofsimplemutex
locks,sothelengthofthereaders'computationshouldbelongenoughtomakethisoverhead
worthwhile.Inaddition,thereshouldgenerallybealargernumberofconcurrentreadersthanwriters.
Thejava.util.concurrentpackageprovidesread/writelockstosupportthereaders/writers
protocol.ThecodeinFig.5.32illustrateshowtheselocksaretypicallyused:Firstinstantiatea
ReadWriteLock,andthenobtainitsreadandwritelocks.ReentrantReadWriteLockisa
classthatimplementstheReadWriteLockinterface.Toperformareadoperation,thereadlock
mustbelocked.Toperformawriteoperation,thewritelockmustbelocked.Thesemanticsofthe
locksarethatanynumberofUEscansimultaneouslyholdthereadlock,butthewritelockis
exclusive;thatis,onlyoneUEcanholdthewritelock,andifthewritelockisheld,noUEscanhold
thereadlockeither.
Figure 5.32. Typical use of read/write locks. These locks are defined in the
java.util.concurrent.locks package. Putting the unlock in the finally block ensures
that the lock will be unlocked regardless of how the try block is exited (normally or
with an exception) and is a standard idiom in Java programs that use locks rather than
synchronized blocks.
class X {
ReadWriteLock rw = new ReentrantReadWriteLock() ;

// ...
/*operation A is a writer*/
public void A() throws InterruptedException {
rw.writeLock().lock(); //lock the write lock
try {
// ... do operation A
}
finally {
rw.writeLock().unlock(); //unlock the write lock
}
}
/*operation B is a reader*/
public void B() throws InterruptedException {
rw.readLock().lock(); //lock the read lock
try {
// ... do operation B
}
finally {
rw.readLock().unlock(); //unlock the read lock
}
}
}
Readers/writersprotocolsarediscussedin[And00]andmostoperatingsystemstexts.
Reducingthesizeofthecriticalsection.Anotherapproachtoimprovingperformancebeginswith
analyzingtheimplementationsoftheoperationsinmoredetail.Itmaybethecasethatonlypartofthe
operationinvolvesactionsthatinterferewithotheroperations.Ifso,thesizeofthecriticalsectioncan
bereducedtothatsmallerpart.Noticethatthissortofoptimizationisveryeasytogetwrong,soit
shouldbeattemptedonlyifitwillgivesignificantperformanceimprovementsoversimpler
approaches,andtheprogrammercompletelyunderstandstheinterferencesinquestion.
Nestedlocks.Thistechniqueisasortofhybridbetweentwoofthepreviousapproaches,
noninterferingoperationsandreducingthesizeofthecriticalsection.SupposewehaveanADTwith
twooperations.OperationAdoesalotofworkbothreadingandupdatingvariablexandthenreads
andupdatesvariableyinasinglestatement.OperationBreadsandwritesy.Someanalysisshows
thatUEsexecutingAneedtoexcludeeachother,UEsexecutingBneedtoexcludeeachother,and
becausebothoperationsreadandupdatey,technically,AandBneedtomutuallyexcludeeachother
aswell.However,closerinspectionshowsthatthetwooperationsarealmostnoninterfering.Ifit
weren'tforthatsinglestatementwhereAreadsandupdatesy,thetwooperationscouldbe
implementedinseparatecriticalsectionsthatwouldallowoneAandoneBtoexecuteconcurrently.A
solutionistousetwolocks,asshowninFig.5.33.AacquiresandholdslockAfortheentire
operation.BacquiresandholdslockBfortheentireoperation.AacquireslockBandholdsitonly
forthestatementupdatingy.
Whenevernestedlockingisused,theprogrammershouldbeawareofthepotentialfordeadlocksand
doublecheckthecode.(Theclassicexampleofdeadlock,statedintermsofthepreviousexample,is
asfollows:AacquireslockAandBacquireslockB.AthentriestoacquirelockBandBtriesto
acquirelockA.Neitheroperationcannowproceed.)Deadlockscanbeavoidedbyassigningapartial
ordertothelocksandensuringthatlocksarealwaysacquiredinanorderthatrespectsthepartial
order.Inthepreviousexample,wewoulddefinetheordertobelockA < lockBandensurethat
lockAisneveracquiredbyaUEalreadyholdinglockB.
Applicationspecificsemanticrelaxation.Yetanotherapproachistoconsiderpartiallyreplicating
+
shareddata(thesoftwarecachingdescribedin[YWC
96
])andperhapsevenallowingthecopiestobe
inconsistentifthiscanbedonewithoutaffectingtheresultsofthecomputation.Forexample,a
distributedmemorysolutiontothephylogenyproblemdescribedearliermightgiveeachUEitsown
copyofthesetofsetsalreadyrejectedandallowthesecopiestobeoutofsynch;tasksmaydoextra
work(inrejectingasetthathasalreadybeenrejectedbyataskassignedtoadifferentUE),butthis
extraworkwillnotaffecttheresultofthecomputation,anditmaybemoreefficientoverallthanthe
communicationcostofkeepingallcopiesinsynch.
Figure 5.33. Example of nested locking using synchronized blocks with dummy objects
lockA and lockB
class Y {
Object lockA = new Object();
Object lockB = new Object();
void A()
{ synchronized(lockA)
{
....compute....
synchronized(lockB)
{ ....read and update y....
}
}
}
void B() throws InterruptedException

{ synchronized(lockB)
{ ...compute....
}
}
Review other considerations

Memorysynchronization.Makesurememoryissynchronizedasrequired:Cachingandcompiler
optimizationscanresultinunexpectedbehaviorwithrespecttosharedvariables.Forexample,astale
valueofavariablemightbereadfromacacheorregisterinsteadofthenewestvaluewrittenby
anothertask,orthelatestvaluemightnothavebeenflushedtomemoryandthuswouldnotbevisible
toothertasks.Inmostcases,memorysynchronizationisperformedimplicitlybyhigherlevel
synchronizationprimitives,butitisstillnecessarytobeawareoftheissue.Unfortunately,memory
synchronizationtechniquesareveryplatformspecific.InOpenMP,theflushdirectivecanbeusedto
synchronizememoryexplicitly;itisimplicitlyinvokedbyseveralotherdirectives.InJava,memoryis
implicitlysynchronizedwhenenteringandleavingasynchronizedblock,and,inJava21.5,when
lockingandunlockinglocks.Also,variablesmarkedvolatileareimplicitlysynchronizedwith
respecttomemory.ThisisdiscussedinmoredetailintheImplementationMechanismsdesignspace.
Taskscheduling.Considerwhethertheexplicitlymanageddatadependenciesaddressedbythis
patternaffecttaskscheduling.Akeygoalindecidinghowtoscheduletasksisgoodloadbalance;in
additiontotheconsiderationsdescribedintheAlgorithmStructurepatternbeingused,oneshould
alsotakeintoaccountthattasksmightbesuspendedwaitingforaccesstoshareddata.Itmakessense
totrytoassigntasksinawaythatminimizessuchwaiting,ortoassignmultipletaskstoeachUEin
thehopethattherewillalwaysbeonetaskperUEthatisnotwaitingforaccesstoshareddata.
Examples
Shared queues
ThesharedqueueisacommonlyusedADTandanexcellentexampleoftheSharedDatapattern.The
SharedQueuepatterndiscussesconcurrencycontrolprotocolsandthetechniquesusedtoachieve
highlyefficientsharedqueueprograms.
Genetic algorithm for nonlinear optimization
+
ConsidertheGAFORTprogramfromtheSPECOMP2001benchmarksuite[ADE
01
].GAFORTisa
smallFortranprogram(around1,500lines)thatimplementsageneticalgorithmfornonlinear
optimization.Thecalculationsarepredominantlyintegerarithmetic,andtheprogram'sperformanceis
dominatedbythecostofmovinglargearraysofdatathroughthememorysubsystem.
Thedetailsofthegeneticalgorithmarenotimportantforthisdiscussion.Wearegoingtofocusona
singleloopwithinGAFORT.Pseudocodeforthesequentialversionofthisloop,basedonthe
discussionofGAFORTin[EM],isshowninFig.5.34.Thisloopshufflesthepopulationof
chromosomesandconsumesontheorderof36percentoftheruntimeinatypicalGAFORTjob
[AE03].
Figure 5.34. Pseudocode for the population shuffle loop from the genetic algorithm
program GAFORT
Int const NPOP // number of chromosomes ("40000)
Int const NCHROME // length of each chromosome
Real :: tempScalar
Array of Real :: temp(NCHROME)
Array of Int :: iparent(NCHROME, NPOP)
Array of Int :: fitness(NPOP)
Int :: j, iother
loop [j] over NPOP
iother = rand(j) // returns random value greater
// than or equal to zero but not
// equal to j and less than NPOP
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother)
iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME)
// Swap fitness metrics
tempScalar = fitness(iother)
fitness(iother) = fitness(j)
fitness(j) = tempScalar
end loop [j]
Aparallelversionofthisprogramwillbecreatedbyparallelizingtheloop,usingtheLoopParallelism
pattern.Inthisexample,theshareddataconsistsoftheiparentandfitnessarrays.Withinthe
bodyoftheloop,calculationsinvolvingthesearraysconsistofswappingtwoelementsofiparent
andthenswappingthecorrespondingelementsoffitness.Examinationoftheseoperationsshows
thattwoswapoperationsinterferewhenatleastoneofthelocationsbeingswappedisthesamein
bothoperations.
ThinkingabouttheshareddataasanADThelpsustoidentifyandanalyzetheactionstakenonthe
shareddata.Thisdoesnotmean,however,thattheimplementationitselfalwaysneedstoreflectthis
structure.Insomecases,especiallywhenthedatastructureissimpleandtheprogramminglanguage
doesnotsupportADTswell,itcanbemoreeffectivetoforgotheencapsulationimpliedinanADT
andworkwiththedatadirectly.Thisexampleillustratesthis.
Asmentionedearlier,thechromosomesbeingswappedmightinterferewitheachother;thustheloop
overjcannotsafelyexecuteinparallel.Themoststraightforwardapproachistoenforcea"oneata
time"protocolusingacriticalsection,asshowninFig.5.35.Itisalsonecessarytomodifythe
randomnumbergeneratorsoitproducesaconsistentsetofpseudorandomnumberswhencalledin
parallelbymanythreads.Thealgorithmstoaccomplishthisarewellunderstood[Mas97],butwillnot
bediscussedhere.
Figure 5.35. Pseudocode for an ineffective approach to parallelizing the population
shuffle in the genetic algorithm program GAFORT
#include <omp.h>
Real :: tempScalar
Int :: j, iother
#pragma omp parallel for loop [j] over NPOP
iother = par_rand(j) // returns random value greater
// than or equal to zero but not
// equal to j and less than NPOP
#pragma omp critical
{
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother)

iparent(1:NCHROME, iother) = iparent(1:NCHROME, j)
iparent(1:NCHROME, j) = temp(1:NCHROME)
}
end loop [j]
TheprograminFig.5.35cansafelyexecutewithmultiplethreads,butitwillnotrunanyfasteras
morethreadsareadded.Infact,thisprogramwillslowdownasmorethreadsareaddedbecausethe
threadswillwastesystemresourcesastheywaitfortheirturntoexecutethecriticalsection.In
essence,theconcurrencycontrolprotocoleliminatesalloftheavailableconcurrency.
Thesolutiontothisproblemistotakeadvantageofthefactthattheswapoperationsontheshared
dataonlyinterferewhenatleastoneofthelocationsbeingswappedisthesameinbothoperations.
Hence,therightconcurrencycontrolprotocolusespairwisesynchronizationwithnestedlocks,
therebyaddingonlymodestoverheadwhenloopiterationsdonotinterfere.Theapproachusedin
+
[ADE
01
]istocreateanOpenMPlockforeachchromosome.Pseudocodeforthissolutionisshown
inFig.5.36.Intheresultingprogram,mostoftheloopiterationsdonotactuallyinterferewitheach
other.Thetotalnumberofchromosomes,NPOP(40,000intheSPECOMP2001benchmark),ismuch
largerthanthenumberofUEs,sothereisonlyaslightchancethatloopiterationswillhappento
interferewithanotherloopiteration.
OpenMPlocksaredescribedintheOpenMPappendix,AppendixA.Thelocksthemselvesusean
opaquetype,omp_lock_t,definedintheomp.hheaderfile.Thelockarrayisdefinedandlater
initializedinaseparateparallelloop.Onceinsidethechromosomeswappingloop,thelocksareset
forthepairofswappingchromosomes,theswapiscarriedout,andthelocksareunset.Nestedlocks
arebeingused,sothepossibilityofdeadlockmustbeconsidered.Thesolutionhereistoorderthe
locksusingthevalueoftheindicesofthearrayelementassociatedwiththelock.Alwaysacquiring
locksinthisorderwillpreventdeadlockwhenapairofloopiterationshappentobeswappingthe
sametwoelementsatthesametime.Afterthemoreefficientconcurrencycontrolprotocolis
implemented,theprogramrunswellinparallel.
Known uses
AsolutiontothephylogenyproblemdescribedintheContextsectionispresentedin[YWC+96].The
overallapproachfitstheTaskParallelismpattern;therejectedsetsdatastructureisexplicitlymanaged
usingreplicationandperiodicupdatestoreestablishconsistencyamongcopies.
+
Anotherproblempresentedin[YWC
96
]istheGrbnerbasisprogram.Omittingmostofthedetails,
inthisapplicationthecomputationconsistsofusingpairsofpolynomialstogeneratenew
polynomials,comparingthemagainstamastersetofpolynomials,andaddingthosethatarenotlinear
combinationsofelementsofthemastersettothemasterset(wheretheyareusedtogeneratenew
pairs).Differentpairscanbeprocessedconcurrently,soonecandefineataskforeachpairand
+
partitionthemamongUEs.Thesolutiondescribedin[YWC
96
]fitstheTaskParallelismpattern
(withataskqueueconsistingofpairsofpolynomials),plusexplicitmanagementofthemasterset
usinganapplicationspecificprotocolcalledsoftwarecaching.
Figure 5.36. Pseudocode for a parallelized loop to carry out the population shuffle in
the genetic algorithm program GAFORT. This version of the loop uses a separate lock
for each chromosome and runs effectively in parallel.
#include <omp.h>
Array of omp_lock_t :: lck(NPOP)
Real :: tempScalar
Int :: j, iother
// Initialize the locks
#pragma omp parallel for
for (j=0; j<NPOP; j++){ omp_init_lock (&lck(j)) }
for (j=0; j<NPOP; j++){
iother = par_rand(j) // returns random value >= 0, != j,
// < NPOP
if (j < iother) {
set_omp_lock (lck(j)); set_omp_lock (lck(iother))
}
else {
set_omp_lock (lck(iother)); set_omp_lock (lck(j))
}
// Swap Chromosomes
temp(1:NCHROME) = iparent(1:NCHROME, iother);
iparent(1:NCHROME, iother) = iparent(1:NCHROME, j);
iparent(1:NCHROME, j) = temp(1:NCHROME);
if (j < iother) {
unset_omp_lock (lck(iother)); unset_omp_lock (lck(j))
}
else {
unset_omp_lock (lck(j)); unset_omp_lock (lck(iother))
}
} // end loop [j]
Related Patterns
TheSharedQueueandDistributedArraypatternsdiscussspecifictypesofshareddatastructures.
ManyproblemsthatusetheSharedDatapatternusetheTaskParallelismpatternforthealgorithm
structure.
5.9. THE SHARED QUEUE PATTERN

Problem
HowcanconcurrentlyexecutingUEssafelyshareaqueuedatastructure?
Context
Effectiveimplementationofmanyparallelalgorithmsrequiresaqueuethatistobesharedamong
UEs.Themostcommonsituationistheneedforataskqueueinprogramsimplementingthe
Master/Workerpattern.
Forces
Simpleconcurrencycontrolprotocolsprovidegreaterclarityofabstractionandmakeiteasier
fortheprogrammertoverifythatthesharedqueuehasbeencorrectlyimplemented.
Concurrencycontrolprotocolsthatencompasstoomuchofthesharedqueueinasingle
synchronizationconstructincreasethechancesUEswillremainblockedwaitingtoaccessthe
queueandwilllimitavailableconcurrency.
Aconcurrencycontrolprotocolfinelytunedtothequeueandhowitwillbeusedincreasesthe
availableconcurrency,butatthecostofmuchmorecomplicated,andmoreerrorprone,
synchronizationconstructs.
Maintainingasinglequeueforsystemswithcomplicatedmemoryhierarchies(asfoundon
NUMAmachinesandclusters)cancauseexcesscommunicationandincreaseparallel
overhead.Solutionsmayinsomecasesneedtobreakwiththesinglequeueabstractionanduse
multipleordistributedqueues.
Solution
Ideallythesharedqueuewouldbeimplementedaspartofthetargetprogrammingenvironment,either
explicitlyasanADTtobeusedbytheprogrammer,orimplicitlyassupportforthehigherlevel
patterns(suchasMaster/Worker)thatuseit.InJava21.5,suchqueuesareavailableinthe
java.util.concurrentpackage.Herewedevelopimplementationsfromscratchtoillustrate
theconcepts.
Implementingsharedqueuescanbetricky.Appropriatesynchronizationmustbeutilizedtoavoidrace
conditions,andperformanceconsiderationsespeciallyforproblemswherelargenumbersofUEs
accessthequeuecanrequiresophisticatedsynchronization.Insomecases,anoncentralizedqueue
mightbeneededtoeliminateperformancebottlenecks.
However,ifitisnecessarytoimplementasharedqueue,itcanbedoneasaninstanceoftheShared
Datapattern:First,wedesignanADTforthequeuebydefiningthevaluesthequeuecanholdandthe
setofoperationsonthequeue.Next,weconsidertheconcurrencycontrolprotocols,startingwiththe
simplest"oneatatimeexecution"solutionandthenapplyingaseriesofrefinements.Tomakethis
discussionmoreconcrete,wewillconsiderthequeueintermsofaspecificproblem:aqueuetohold
tasksinamaster/workeralgorithm.Thesolutionspresentedhere,however,aregeneralandcanbe
easilyextendedtocoverotherapplicationsofasharedqueue.
The abstract data type (ADT)
AnADTisasetofvaluesandtheoperationsdefinedonthatsetofvalues.Inthecaseofaqueue,the
valuesareorderedlistsofzeroormoreobjectsofsometype(forexample,integersortaskIDs).The
operationsonthequeueareput(orenqueue)andtake(ordequeue).Insomesituations,theremight
beotheroperations,butforthesakeofthisdiscussion,thesetwoaresufficient.
Wemustalsodecidewhathappenswhenatakeisattemptedonanemptyqueue.Whatshouldbe
donedependsonhowterminationwillbehandledbythemaster/workeralgorithm.Suppose,for
example,thatallthetaskswillbecreatedatstartuptimebythemaster.Inthiscase,anemptytask
queuewillindicatethattheUEshouldterminate,andwewillwantthetakeoperationonanempty
queuetoreturnimmediatelywithanindicationthatthequeueisemptythatis,wewanta
nonblockingqueue.AnotherpossiblesituationisthattaskscanbecreateddynamicallyandthatUEs
willterminatewhentheyreceiveaspecialpoisonpilltask.Inthiscase,appropriatebehaviormightbe
forthetakeoperationonanemptyqueuetowaituntilthequeueisnonemptythatis,wewanta
blockonemptyqueue.
Queue with "one at a time" execution
Nonblockingqueue.Becausethequeuewillbeaccessedconcurrently,wemustdefineaconcurrency
controlprotocoltoensurethatinterferencebymultipleUEswillnotoccur.Asrecommendedinthe
SharedDatapattern,thesimplestsolutionistomakealloperationsontheADTexcludeeachother.
Becausenoneoftheoperationsonthequeuecanblock,astraightforwardimplementationofmutual
exclusionasdescribedintheImplementationMechanismsdesignspacessuffices.TheJava
implementationshowninFig.5.37usesalinkedlisttoholdthetasksinthequeue.(Wedevelopour
ownlistclassratherthanusinganunsynchronizedlibraryclasssuchasjava.util.LinkedList
oraclassfromthejava.util.concurrentpackagetoillustratehowtoaddappropriate
synchronization.)headreferstoanalwayspresentdummynode.[6]
[6]
Thecodefortakemakestheoldheadnodeintoadummynoderatherthansimply
manipulatingnextpointerstoallowustolateroptimizethecodesothatputandget
canexecuteconcurrently.
Thefirsttaskinthequeue(ifany)isheldinthenodereferredtobyhead.next.TheisEmpty
methodisprivate,andonlyinvokedinsideasynchronizedmethod.Thus,itneednotbesynchronized.
(Ifitwerepublic,itwouldneedtobesynchronizedaswell.)Ofcourse,numerouswaysof
implementingthestructurethatholdsthetasksarepossible.
Blockonemptyqueue.ThesecondversionofthesharedqueueisshowninFig.5.38.Inthisversion
ofthequeue,thetakeoperationischangedsothatathreadtryingtotakefromanemptyqueuewill
waitforataskratherthanreturningimmediately.Thewaitingthreadneedstoreleaseitslockand
reacquireitbeforetryingagain.ThisisdoneinJavausingthewaitandnotifymethods.Theseare
describedintheJavaappendix,AppendixC.TheJavaappendixalsoshowsthequeueimplemented
usinglocksfromthejava.util.concurrent.lockspackageintroducedinJava21.5instead
ofwaitandnotify.SimilarprimitivesareavailablewithPOSIXthreads(Pthreads)[But97,IEE],
andtechniquesforimplementingthisfunctionalitywithsemaphoresandotherbasicprimitivescanbe
foundin[And00].
Figure 5.37. Queue that ensures that at most one thread can access the data
structure at one time. If the queue is empty, null is immediately returned.
public class SharedQueue1
{
class Node //inner class defines list nodes
{ Object task;
Node next;
Node(Object task)
{this.task = task; next = null;}
}
private Node head = new Node(null); //dummy node
private Node last = head;
public synchronized void put(Object task)
{ assert task != null: "Cannot insert null task";
Node p = new Node(task);
last.next = p;
last = p;
}
public synchronized Object take()
{ //returns first task in queue or null if queue is empty
Object task = null;
if (!isEmpty())
{ Node first = head.next;
task = first.task;
first.task = null;
head = first;
}
return task;
}
private boolean isEmpty(){return head.next == null;}
}
Ingeneral,tochangeamethodthatreturnsimmediatelyifaconditionisfalsetoonethatwaitsuntil
theconditionistrue,twochangesneedtobemade:First,wereplaceastatementoftheform
if (condition){do_something;}
Figure 5.38. Queue that ensures at most one thread can access the data structure at
one time. Unlike the first shared queue example, if the queue is empty, the thread
waits. When used in a master/worker algorithm, a poison pill would be required to
signal termination to a thread.
{
class Node
{ Object task;
Node next;
Node(Object task)
private Node head = new Node(null);

public synchronized void put(Object task)
last.next = p;
last = p;
notifyAll();
}
{ //returns first task in queue, waits if queue is empty
Object task = null;
while (isEmpty())
{try{wait();}catch(InterruptedException ignore){}}
task = first.task;
first.task = null;
head = first;
}
return task;
}
}
withaloop[7]
[7]
ThefactthatwaitcanthrowanInterruptedExceptionmustbedealtwith;it
isignoredhereforclarity,buthandledproperlyinthecodeexamples.
while( !condition) {wait();} do_something;
Second,weexaminetheotheroperationsonthesharedqueueandaddanotifyAlltoany
operationsthatmightestablishcondition.Theresultisaninstanceofthebasicidiomforusingwait,
describedinmoredetailintheJavaappendix,AppendixC.
Thus,twomajorchangesaremadeinmovingtothecodeinFig.5.38.First,wereplacethecode
if (!isEmpty( )){....}
with
while(isEmpty())
{try{wait()}catch(InterruptedException ignore){}}{....}
Second,wenotethattheputmethodwillmakethequeuenotempty,soweaddtoitacallto
notifyAll.
Thisimplementationhasaperformanceprobleminthatitwillgenerateextraneouscallsto
notifyAll.Thisdoesnotaffectthecorrectness,butitmightdegradetheperformance.Oneway
thisimplementationcouldbeoptimizedwouldbetominimizethenumberofinvocationsof
notifyAllinput.Onewaytodothisistokeeptrackofthenumberofwaitingthreadsandonly
performanotifyAllwhentherearethreadswaiting.Wewouldhave,forint windicatingthe
numberofwaitingthreads:
while( !condition){w++; wait(); w-} do_something;
and
if (w>0) notifyAll();
Inthisparticularexample,becauseonlyonewaitingthreadwillbeabletoconsumeatask,
notifyAllcouldbereplacedbynotify,whichnotifiesonlyonewaitingthread.Weshowcode
forthisrefinementinalaterexample(Fig.5.40).
Concurrency-control protocols for noninterfering operations
Iftheperformanceofthesharedqueueisinadequate,wemustlookformoreefficientconcurrency
controlprotocols.AsdiscussedintheSharedDatapattern,weneedtolookfornoninterferingsetsof
operationsinourADT.Carefulexaminationoftheoperationsinournonblockingsharedqueue(see
Fig.5.37andFig.5.38)showsthattheputandtakearenoninterferingbecausetheydonotaccess
thesamevariables.Theputmethodmodifiesthereferencelastandthenextmemberofthe
objectreferredtobylast.Thetakemethodmodifiesthevalueofthetaskmemberintheobject
referredtobyhead.nextandthereferencehead.Thus,putmodifieslastandthenext
memberofsomeNodeobject.Thetakemethodmodifiesheadandthetaskmemberofsome
object.Thesearenoninterferingoperations,sowecanuseonelockforputandadifferentlockfor
take.ThissolutionisshowninFig.5.39.
Figure 5.39. Shared queue that takes advantage of the fact that put and take are
noninterfering and uses separate locks so they can proceed concurrently
{
class Node
{ Object task;
Node next;
Node(Object task)

private Object putLock = new Object();
private Object takeLock = new Object();
public void put(Object task)
{ synchronized(putLock)
last.next = p;
last = p;
}
}
public Object take()
{ Object task = null;
synchronized(takeLock)
{ if (!isEmpty())
task = first.task;
first.task = null;
head = first;
}
}
return task;
}
}
Concurrency-control protocols using nested locks
TheapproachshowninFig.5.39isn'taseasytoapplytoablockonemptyqueue,however.Firstof
all,thewait,notify,andnotifyAllmethodsonanobjectcanonlybeinvokedwithinablock
synchronizedonthatobject.Also,ifwehaveoptimizedtheinvocationsofnotifyasdescribed
previously,thenw,thecountofwaitingthreads,isaccessedinbothputandtake.Therefore,weuse
putLockbothtoprotectwandtoserveasthelockonwhichatakingthreadblockswhenthequeue
isempty.CodeisshowninFig.5.40.NoticethatputLock.wait()ingetwillreleaseonlythe
lockonputLock,soablockedthreadwillcontinuetoblockothertakersfromtheouterblock
synchronizedontakeLock.Thisisokayforthisparticularproblem.Thisschemecontinuestoallow
puttersandtakerstoexecuteconcurrently;theonlyexceptionbeingwhenthequeueisempty.
Figure 5.40. Blocking queue with multiple locks to allow concurrent put and take on a
nonempty queue
pubic class SharedQueue4
{
class Node
{ Object task;
Node next;
Node(Object task)
private
private
private
private
private
Node head = new Node(null);

Node last = head;
int w;
Object putLock = new Object();
Object takeLock = new Object();
public void put(0bject task)

{ synchronized(putLock)
last.next = p;
last = p;
if(w>0){putLock.notify();}
}
}
{ Object task = null;
synchronized(takeLock)
while (isEmpty())
{ try{synchronized(putLock){w++; putLock.wait();w--;} }
catch(InterruptedException error){assert false;}}
task = first.task;
first.task = null;
head = first;
}
}
return task;
}
private boolean isEmpty() {return head.next == null;}
}
Anotherissuetonoteisthatthissolutionhasnestedsynchronizedblocksinbothtakeandput.
Nestedsynchronizedblocksshouldalwaysbeexaminedforpotentialdeadlocks.Inthiscase,there
willbenodeadlockbecauseputonlyacquiresonelock,putLock.Moregenerally,wewoulddefine
apartialorderoverallthelocksandensurethatthelocksarealwaysacquiredinanorderconsistent
withourpartialorder.Forexample,here,wecoulddefinetakeLock < putLockandmakesure
thatthesynchronizedblocksareenteredinawaythatrespectsthatpartialorder.
Asmentionedearlier,severalJavabasedimplementationsofqueuesareincludedinJava21.5inthe
java.util.concurrentpackage,somebasedonthesimplestrategiesdiscussedhereandsome
basedonmorecomplexstrategiesthatprovideadditionalflexibilityandperformance.
Distributed shared queues
Acentralizedsharedqueuemaycauseahotspot,indicatingthatperformancemightbeimprovedbya
moredistributedimplementation.Asanexample,wewilldevelopasimplepackagetosupport
fork/joinprogramsusingapoolofthreadsandadistributedtaskqueueintheunderlying
implementation.ThepackageisamuchsimplifiedversionoftheFJTaskpackage[Lea00b],whichin
+
turnusesideasfrom[BJK
96
].Theideaistocreateafixedpoolofthreadstoexecutethetasksthat
aredynamicallycreatedastheprogramexecutes.Insteadofasinglecentraltaskqueue,weassociatea
nonblockingqueuewitheachthread.Whenathreadgeneratesanewtask,itisplacedinitsown
queue.Whenathreadisabletoexecuteanewtask,itfirsttriestoobtainataskfromitsownqueue.If
itsownqueueisempty,itrandomlychoosesanotherthreadandattemptstostealataskfromthat
+
thread'squeueandcontinuescheckingtheotherqueuesuntilataskisfound.(In[BJK
96
],thisis
calledrandomworkstealing.)
Athreadterminateswhenitreceivesapoisonpilltask.Forthefork/joinprogramswehaveinmind,
thisapproachhasbeenshowntoworkwellwhenthreadsremovetasksfromtheirownqueueinLIFO
(lastin,firstout)orderandfromotherqueuesinFIFO(firstin,firstout)order.Therefore,wewilladd
totheADTanoperationthatremovesthelastelement,tobeusedbythreadstoremovetasksfrom
theirownqueues.TheimplementationcanthenbesimilartoFig.5.40,butwithanadditionalmethod
takeLastfortheaddedoperation.TheresultisshowninFig.5.41.
Theremainderofthepackagecomprisesthreeclasses.
Taskisanabstractclass.Applicationsextenditandoverrideitsrunmethodtoindicatethe
functionalityofataskinthecomputation.Methodsofferedbytheclassincludeforkand
join.
TaskRunnerextendsThreadandprovidesthefunctionalityofthethreadsinthethread
pool.Eachinstancecontainsasharedtaskqueue.Thetaskstealingcodeisinthisclass.
Figure 5.41. Nonblocking shared queue with takeLast
operation
{
class Node
{ Object task;
Node next;
Node prev;
Node(0bject task, Node prev)
{this.task = task; next = null; this.prev = prev;}
}
private Node head = new Node(null, null);
public synchronized void put(0bject task)
Node p = new Node(task, last);
last.next = p;
last = p;
}
{ //returns first task in queue or null if queue is empty
Object task = null;
if (!isEmpty())
task = first.task;
first.task = null;
head = first;
}
return task;
}
public synchronized Object takeLast()
{ //returns last task in queue or null if queue is empty
Object task = null;
if (!isEmpty())
{ task = last.task; last = last.prev; last.next = null;}
return task;
}
}
TaskRunnerGroupmanagestheTaskRunners.Itcontainsmethodstoinitializeandshut
downthethreadpool.ItalsohasamethodexecuteAndWaitthatstartsataskrunningand
waitsforitscompletion.Thismethodisusedtogetthecomputationstarted.(Itisneeded
becausetheforkmethodinclassTaskcanonlybeinvokedfromwithinaTask.We
describethereasonforthisrestrictionlater.)
Figure 5.42. Abstract base class for tasks
public abstract class Task implements Runnable
{
//done indicates whether the task is finished
private volatile boolean done;
public final void setDone(){done = true;}
public boolean isDone(){return done;}
//returns the currently executing TaskRunner thread
public static TaskRunner getTaskRunner()
{ return (TaskRunner)Thread.currentThread() ; }
//push this task on the local queue of current thread
public void fork()
{ getTaskRunner().put(this);
}
//wait until this task is done
public void join()
{ getTaskRunner().taskJoin(this);
}
//execute the run method of this task
public void invoke()
{ if (!isDone()){run(); setDone(); }
}
}
Wewillnowdiscusstheseclassesinmoredetail.TaskisshowninFig.5.42.Theonlystate
associatedwiththeabstractclassisdone,whichismarkedvolatiletoensurethatanythreadthat
triestoaccessitwillobtainafreshvalue.
TheTaskRunnerclassisshowninFig.5.43,Fig.5.44,andFig.5.45.Thethread,asspecifiedin
therunmethod,loopsuntilthepoisontaskisencountered.Firstittriestoobtainataskfromtheback
ofitslocalqueue.Ifthelocalqueueisempty,itattemptstostealataskfromthefrontofaqueue
belongingtoanotherthread.
ThecodefortheTaskRunnerGroupclassisshowninFig.5.46.Theconstructorfor
TaskRunnerGroupinitializesthethreadpool,giventhenumberofthreadsasaparameter.
Typically,thisvaluewouldbechosentomatchthenumberofprocessorsinthesystem.The
executeAndWaitmethodstartsataskbyplacingitinthetaskqueueofthread0.
Oneuseforthismethodisgetacomputationstarted.Somethinglikethisisneededbecausewecan't
justforkanewTaskfromamainorothernonTaskRunnerthreadthisiswhatwasmeantby
theearlierremarkthattheforkandjoinmethodsofTaskcanonlybeinvokedfromwithin
anotherTask.ThisisbecausethesemethodsrequireinteractionwiththeTaskRunnerthread
executingthetask(forexample,forkinvolvesaddingthetasktothethread'staskqueue);wefindthe
appropriateTaskRunnerusingThread.getCurrentThread,thusforkandjoinmustbe
invokedonlyincodebeingexecutedbyathreadthatisaTaskRunner.
Figure 5.43. Class defining behavior of threads in the thread pool (continued in Fig.
5.44 and Fig. 5.45)
import java.util.*;
class TaskRunner extends Thread
{
private final TaskRunnerGroup g; //managing group
private final Random chooseToStealFrom; //random number generator
private final Task poison; //poison task
protected volatile boolean active; //state of thread
final int id; //index of task in the TaskRunnerGroup
private final SharedQueue5 q; //Nonblocking shared queue
//operations relayed to queue
public void put(Task t){q.put(t);}
public Task take(){return (Task)q.take() ;}
public Task takeLast(){return (Task)q.takeLast() ;}
//constructor
TaskRunner(TaskRunnerGroup g, int id, Task poison)
{ this.g = g;
this.id = id;
this.poison = poison;
chooseToStealFrom = new Random(System.identityHashCode(this));
setDaemon(true);
q = new SharedQueue5();
}
protected final TaskRunnerGroup getTaskRunnerGroup(){return g;}
protected final int getlD(){return id;}
Wenormallyalsowanttheprogramthatcreatestheinitialtasktowaituntilitcompletesbeforegoing
on.Toaccomplishthisandalsomeettherestrictiononwhenforkcanbeinvokedonatask,we
createa"wrapper"taskwhosefunctionistostarttheinitialtask,waitforittocomplete,andthen
notifythemainthread(theonethatcalledexecuteAndWait).Wethenaddthiswrappertaskto
thread0'staskqueue,makingiteligibletobeexecuted,andwaitforittonotifyus(with
notifyAll)thatithascompleted.
Allofthismaybeclearerfromtheusageoffork,join,andexecuteAndWaitintheFibonacci
exampleintheExamplessection.
Figure 5.44. Class defining behavior of threads in the thread pool (continued from Fig.
5.43 and continued in Fig. 5.45)
//Attempts to steal a task from another thread. First chooses a
//random victim, then continues with other threads until either
//a task has been found or all have been checked. If a task
//is found, it is invoked. The parameter waitingFor is a task
//on which this thread is waiting for a join. If steal is not
//called as part of a join, use waitingFor = null.
void steal(final Task waitingFor)
{ Task task = null;
TaskRunner[] runners = g.getRunners();
int victim = chooseToStealFrom.nextInt(runners.length);
for (int i = 0; i != runners.length; ++i)
{ TaskRunner tr = runners[victim];
if (waitingFor != null && waitingFor.isDone()){break;}
else
{ if (tr != null && tr != this)
task = (Task)tr.q.take();
if(task != null) {break;}
yield();
victim = (victim + 1)%runners.length;
}
} //have either found a task or have checked all other queues
//if have a task, invoke it
if(task != null && ! task.isDone())
{ task.invoke(); }
}
/* continued in next figure
Examples
Computing Fibonacci numbers
WeshowinFig.5.47andFig.5.48codethatusesourdistributedqueuepackage.[8]Recallthat
[8]
ThiscodeisessentiallythesameastheclasstocomputeFibonaccinumbersthatis
providedasademowiththeFJTaskpackage,exceptfortheslightmodificationnecessary
tousetheclassesdescribedpreviously.
Equation5.7
Equation5.8
Equation5.9
Thisisaclassicdivideandconqueralgorithm.Touseourtaskpackage,wedefineaclassFibthat
extendsTask.EachFibtaskcontainsamembernumberthatinitiallycontainsthenumberfor
whichtheFibonaccinumbershouldbecomputedandlaterisreplacedbytheresult.ThegetAnswer
methodreturnstheresultafterithasbeencomputed.Becausethisvariablewillbeaccessedby
multiplethreads,itisdeclaredvolatile.
Figure 5.45. Class defining behavior of threads in the thread pool (continued from Fig.
5.43 and Fig. 5.44)
//Main loop of thread. First attempts to find a task on local
//queue and execute it. If not found, then tries to steal a task
//from another thread. Performance may be improved by modifying
//this method to back off using sleep or lowered priorities if the
//thread repeatedly iterates without finding a task. The run
//method, and thus the thread, terminates when it retrieves the
//poison task from the task queue.
public void run()
{ Task task = null;
try
{ while (!poison.equals(task))
{ task = (Task)q.takeLast();
if (task != null) { if (!task.isDone()){task.invoke();}}
else { steal(null); }
}
} finally { active = false; }
}
//Looks for another task to run and continues when Task w is done.
protected final void taskJoin(final Task w)
{ while(!w.isDone())
{ Task task = (Task)q.takeLast();
if (task != null) { if (!task.isDone()){ task.invoke();}}
}
}
else { steal(w); }
Therunmethoddefinesthebehaviorofeachtask.Recursiveparalleldecompositionisdoneby
creatinganewFibobjectforeachsubtask,invokingtheforkmethodoneachsubtasktostarttheir
computation,callingthejoinmethodforeachsubtasktowaitforthesubtaskstocomplete,andthen
computingthesumoftheirresults.
Themainmethoddrivesthecomputation.Itfirstreadsproc(thenumberofthreadstocreate),num
(thevalueforwhichtheFibonaccinumbershouldbecomputed),andoptionallythe
sequentialThreshold.Thevalueofthislast,optionalparameter(thedefaultis0)isusedto
decidewhentheproblemistoosmalltobotherwithaparalleldecompositionandshouldthereforeuse
asequentialalgorithm.Aftertheseparametershavebeenobtained,themainmethodcreatesa
TaskRunnerGroupwiththeindicatednumberofthreads,andthencreatesaFibobject,initialized
withnum.ThecomputationisinitiatedbypassingtheFibobjecttotheTaskRunnerGroup's
invokeAndWaitmethod.Whenthisreturns,thecomputationisfinished.Thethreadpoolisshut
downwiththeTaskRunnerGroup's cancelmethod.Finally,theresultisretrievedfromthe
Fibobjectanddisplayed.
Figure 5.46. The TaskRunnerGroup class. This class initializes and manages the threads
in the thread pool.
class TaskRunnerGroup
{ protected final TaskRunner[] threads;
protected final int groupSize;
protected final Task poison;
public TaskRunnerGroup(int groupSize)
{ this.groupSize = groupSize;
threads = new TaskRunner[groupSize];
poison = new Task(){public void run(){assert false;}};
poison.setDone();
for (int i = 0; i!= groupSize; i++)
{threads[i] = new TaskRunner(this,i,poison);}
for(int i=0; i!= groupSize; i++){ threads[i].start(); }
}
//start executing task t and wait for its completion.
//The wrapper task is used in order to start t from within
//a Task (thus allowing fork and join to be used)
public void executeAndWait(final Task t)
{ final TaskRunnerGroup thisGroup = this;
Task wrapper = new Task()
{ public void run()
{ t.fork();
t.join();
setDone();
synchronized(thisGroup)
{ thisGroup.notifyAll();} //notify waiting thread
}
};
//add wrapped task to queue of thread[0]
threads[0].put(wrapper);
//wait for notification that t has finished.
synchronized(thisGroup)
{ try{thisGroup.wait();}
catch(InterruptedException e){return;}
}
}
//cause all threads to terminate. The programmer is responsible

//for ensuring that the computation is complete.
public void cancel()
{ for(int i=0; i!= groupSize; i++)
{ threads[i].put(poison); }
}
public TaskRunner[] getRunners(){return threads;}
}
Related Patterns
TheSharedQueuepatternisaninstanceoftheSharedDatapattern.Itisoftenusedtorepresentthe
taskqueuesinalgorithmsthatusetheMaster/Workerpattern.Itcanalsobeusedtosupportthread
poolbasedimplementationsoftheFork/Joinpattern.
Figure 5.47. Program to compute Fibonacci numbers (continued in Fig. 5.48)
public class Fib extends Task
{
volatile int number; // number holds value to compute initially,
//after computation is replaced by answer
Fib(int n) { number = n; } //task constructor, initializes number
//behavior of task
public void run() {
int n = number;
// Handle base cases:
if (n <= 1) { // Do nothing: fib(0) = 0; fib(1) = 1 }
// Use sequential code for small problems:
else if (n <= sequentialThreshold) {
number = seqFib(n);
}
// Otherwise use recursive parallel decomposition:
else {
// Construct subtasks:
Fib fl = new Fib(n - 1);

Fib f2 = new Fib(n - 2);
// Run them in parallel:
f1.fork();f2.fork();
// Await completion;
f1.join();f2.join();
// Combine results:
number = f1.number + f2.number;
// (We know numbers are ready, so directly access them.)
}
// Sequential version for arguments less than threshold
static int seqFib(int n) {
if (n <= 1) return n;
else return seqFib(n-l) + seqFib(n-2);
}
//method to retrieve answer after checking to make sure
//computation has finished, note that done and isDone are
//inherited from the Task class. done is set by the executing
//(TaskRunner) thread when the run method is finished.
int getAnswer() {
if (!isDone()) throw new Error("Not yet computed");
return number;
}
Notethatwhenthetasksinataskqueuemapontoaconsecutivesequenceofintegers,amonotonic
sharedcounter,whichwouldbemuchmoreefficient,canbeusedinplaceofaqueue.
Figure 5.48. Program to compute Fibonacci numbers (continued from Fig. 5.47)
//Performance-tuning constant, sequential algorithm is used to
//find Fibonacci numbers for values <= this threshold
static int sequentialThreshold = 0;
public static void main(String[] args) {
int procs; //number of threads
int num; //Fibonacci number to compute
try {
//read parameters from command line
procs = Integer.parseInt(args[0]);
num = Integer.parseInt(args[1]);
if (args.length > 2)
sequentialThreshold = Integer.parseInt(args[2]);
}
catch (Exception e) {
System.out.println("Usage: java Fib <threads> <number> "+
"[<sequentialThreshold>]");
return;
}
//initialize thread pool
TaskRunnerGroup g = new TaskRunnerGroup(procs);
//create first task
Fib f = new Fib(num);
//execute it
g.executeAndWait(f);
//computation has finished, shutdown thread pool
g.cancel();
//show result
long result;
{result = f.getAnswer();}
System.out.println("Fib: Size: " + num + " Answer: " + result);
}
}
5.10. THE DISTRIBUTED ARRAY PATTERN

Problem
ArraysoftenneedtobepartitionedbetweenmultipleUEs.Howcanwedothissotheresulting
programthatisbothreadableandefficient?
Context
Largearraysarefundamentaldatastructuresinscientificcomputingproblems.Differentialequations
areatthecoreofmanytechnicalcomputingproblems,andsolvingtheseequationsrequirestheuseof
largearraysthatarisenaturallywhenacontinuousdomainisreplacedbyacollectionofvaluesat
discretepoints.Largearraysalsoariseinsignalprocessing,statisticalanalysis,globaloptimization,
andahostofotherproblems.Hence,itshouldcomeasnosurprisethatdealingeffectivelywithlarge
arraysisanimportantproblem.
Ifparallelcomputerswerebuiltwithasingleaddressspacethatwaslargeenoughtoholdthefull
arrayyetprovidedequaltimeaccessfromanyPEtoanyarrayelement,wewouldnotneedtoinvest
muchtimeinhowthesearraysarehandled.Butprocessorsaremuchfasterthanlargememory
subsystems,andnetworksconnectingnodesaremuchslowerthanmemorybuses.Theendresultis
usuallyasysteminwhichaccesstimesvarysubstantiallydependingonwhichPEisaccessingwhich
arrayelement.
ThechallengeistoorganizethearrayssothattheelementsneededbyeachUEarenearbyattheright
timeinthecomputation.Inotherwords,thearraysmustbedistributedaboutthecomputersothatthe
arraydistributionmatchestheflowofthecomputation.
Thispatternisimportantforanyparallelalgorithminvolvinglargearraysinaparallelalgorithm.Itis
particularlyimportantwhenthealgorithmusestheGeometricDecompositionpatternforitsalgorithm
structureandtheSPMDpatternforitsprogramstructure.Althoughthispatternisinsomerespects
specifictodistributedmemoryenvironmentsinwhichglobaldatastructuresmustbesomehow
distributedamongtheensembleofPEs,someoftheideasofthispatternapplyifthesingleaddress
spaceisimplementedonaNUMAplatform,inwhichallPEshaveaccesstoallmemorylocations,but
accesstimevaries.Forsuchplatforms,itisnotnecessarytoexplicitlydecomposeanddistribute
arrays,butitisstillimportanttomanagethememoryhierarchysothatarrayelementsstayclose[9]to
thePEsthatneedthem.Becauseofthis,onNUMAmachines,MPIprogramscansometimes
outperformsimilaralgorithmsimplementedusinganativemultithreadedAPI.Further,theideasof
thispatterncanbeusedwithamultithreadedAPItokeepmemorypagesclosetotheprocessorsthat
willworkwiththem.Forexample,ifthetargetsystemusesafirsttouchpagemanagementscheme,
efficiencyisimprovedifeveryarrayelementisinitializedbythePEthatwillbeworkingwithit.This
strategy,however,breaksdownifarraysneedtoberemappedinthecourseofthecomputation.
[9]
NUMAcomputersareusuallybuiltfromhardwaremodulesthatbundletogether
processorsandasubsetofthetotalsystemmemory.Withinoneofthesehardware
modules,theprocessorsandmemoryare"close"togetherandprocessorscanaccessthis
"close"memoryinmuchlesstimethanforremotememory.
Forces
Loadbalance.BecauseaparallelcomputationisnotfinisheduntilallUEscompletetheir
work,thecomputationalloadamongtheUEsmustbedistributedsoeachUEtakesnearlythe
sametimetocompute.
Effectivememorymanagement.Modernmicroprocessorsaremuchfasterthanthecomputer's
memory.Toaddressthisproblem,highperformancecomputersystemsincludecomplex
memoryhierarchies.Goodperformancedependsonmakinggooduseofthismemory
hierarchy,andthisisdonebyensuringthatthememoryreferencesimpliedbyaseriesof
calculationsareclosetotheprocessormakingthecalculation(thatis,datareusefromthe
cachesishighandneededpagesstayaccessibletotheprocessor).
Clarityofabstraction.Programsinvolvingdistributedarraysareeasiertowrite,debug,and
maintainifitisclearhowthearraysaredividedamongUEsandmappedtolocalarrays.
Solution
Overview
Thesolutionissimpletostateatahighlevel;itisthedetailsthatmakeitcomplicated.Thebasic
approachistopartitiontheglobalarrayintoblocksandthenmapthoseblocksontotheUEs.This
mappingontoUEsshouldbedonesothat,asthecomputationunfolds,eachUEhasanequalamount
ofworktocarryout(thatis,theloadmustbewellbalanced).UnlessallUEsshareasingleaddress
space,eachUE'sblockswillbestoredinanarraythatislocaltoasingleUE.Thus,thecodewill
accesselementsofthedistributedarrayusingindicesintoalocalarray.Themathematicaldescription
oftheproblemandsolution,however,isbasedonindicesintotheglobalarray.Thus,itmustbeclear
howtomovebackandforthbetweentwoviewsofthearray,oneinwhicheachelementisreferenced
byglobalindicesandoneinwhichitisreferencedbyacombinationoflocalindicesandUE
identifier.Makingthesetranslationsclearwithinthetextoftheprogramisthechallengeofusingthis
patterneffectively.
Array distributions
Overtheyears,asmallnumberofarraydistributionshavebecomestandard.
Onedimensional(1D)block.Thearrayisdecomposedinonedimensiononlyanddistributed
oneblockperUE.Fora2Dmatrix,forexample,thiscorrespondstoassigningasingleblock
ofcontiguousrowsorcolumnstoeachUE.Thisdistributionissometimescalledacolumn
blockorrowblockdistributiondependingonwhichsingledimensionisdistributedamongthe
UEs.TheUEsareconceptuallyorganizedasa1Darray.
Twodimensional(2D)block.Asinthe1Dblockcase,oneblockisassignedtoeachUE,but
nowtheblockisarectangularsubblockoftheoriginalglobalarray.Thismappingviewsthe
collectionofUEsasa2Darray.
Blockcyclic.Thearrayisdecomposedintoblocks(usinga1Dor2Dpartition)suchthatthere
aremoreblocksthanUEs.TheseblocksarethenassignedroundrobintoUEs,analogousto
thewayadeckofcardsisdealtout.TheUEsmaybeviewedaseithera1Dor2Darray.
Next,weexplorethesedistributionsinmoredetail.Forillustration,weuseasquarematrixAoforder
8,asshowninFig.5.49.[10]
[10]
Inthisandtheotherfiguresinthispattern,wewillusethefollowingnotational
conventions:Amatrixelementwillberepresentedasalowercaseletterwithsubscripts
representingindices;forexample,a1,2istheelementinrow1andcolumn2ofmatrixA.
Asubmatrixwillberepresentedasanuppercaseletterwithsubscriptsrepresenting
indices;forexample,A0,0isasubmatrixcontainingthetopleftcornerofA.Whenwe
talkaboutassigningpartsofAtoUEs,wewillreferencedifferentUEsusingUEandan
indexorindicesinparentheses;forexample,ifweareregardingUEsasforminga1D
array,UE(0)istheconceptuallyleftmostUE,whileifweareregardingUEsasforminga
2Darray,UE(0,0)istheconceptuallytopleftUE.Indicesareallassumedtobezero
based(thatis,thesmallestindexis0).
Figure 5.49. Original square matrix A
1Dblock.Fig.5.50showsacolumnblockdistributionofAontoalineararrayoffourUEs.The
matrixisdecomposedalongthecolumnindexonly;thenumberofcolumnsineachblock,MB(2
here),isthematrixorderdividedbythenumberofUEs.Matrixelement(i,j)isassignedto
UE(j\MB).[11]
[11]
Wewillusethenotation"\"forintegerdivision,and"/"fornormaldivision.Thus
a\b= a/b .Also, x (floor)isthelargestintegeratmostx,and x (ceiling)isthe
smallestintegeratleastx.Forexample, 4/3 =1,and 4/2 =2.
Figure 5.50. 1D distribution of A onto four UEs
MappingtoUEs.Moregenerally,wecouldhaveanNxMmatrixwherethenumberofUEs,P,need
notdividethenumberofcolumnsevenly.Inthiscase,MBisthemaximumnumberofcolumns
mappedtoaUE,andallUEsexceptUE(P1)containMBblocks.Then,MB= M/P ,andelements
ofcolumnjaremappedtoUE( j/MB ).[12](Thisreducestotheformulagivenearlierforthe
example,becauseinthespecialcasewherePevenlydividesM, M/P =M/Pand j/MB =j/MB.)
Analogousformulasapplyforrowdistributions.
[12]
NoticethatthisisnottheonlypossiblewaytodistributecolumnsamongUEswhen
thenumberofUEsdoesnotevenlydividethenumberofcolumns.Anotherapproach,
morecomplextodefinebutproducingamorebalanceddistributioninsomecases,isto
firstdefinetheminimumnumberofcolumnsperUEas M/P ,andthenincreasethis
numberbyoneforthefirst(MmodP)UEs.Forexample,forM=10andP=4,UE(0)
andUE(1)wouldhavethreecolumnseachandUE(2)andUE(3)wouldhavetwocolumns
each.
Mappingtolocalindices.InadditiontomappingthecolumnstoUEs,wealsoneedtomaptheglobal
indicestolocalindices.Inthiscase,matrixelement(i,j)mapstolocalelement(i,jmodMB).Given
localindices(x,y)andUE(w),wecanrecovertheglobalindices(x,wMB+y).Again,analogous
formulasapplyforrowdistributions.
2Dblock.Fig.5.51showsa2DblockdistributionofAontoatwobytwoarrayofUEs.Here,Ais
beingdecomposedalongtwodimensions,soforeachsubblock,thenumberofcolumnsisthematrix
orderdividedbythenumberofcolumnsofUEs,andthenumberofrowsisthematrixorderdivided
bythenumberofrowsofUEs.Matrixelement(i,j)isassignedtoUE(i\2,j\2).
Figure 5.51. 2D distribution of A onto four UEs
MappingtoUEs.Moregenerally,wemapanNxMmatrixtoaprxPcmatrixofUEs.Themaximum
sizeofasubblockisNBxMB,whereNB= N/PR andMB= M/Pc .Then,element(i,j)inthe
globalmatrixisstoredinUE( i/NB , jIMB .
Mappingtolocalindices.Globalindices(i,j)maptolocalindices(imodNB,jmodMB).Givenlocal
indices(x,y)onUE(z,w)thecorrespondingglobalindicesare(zNB+x,wMB+y).
Blockcyclic.ThemainideabehindtheblockcyclicdistributionistocreatemoreblocksthanUEs
andallocatetheminacyclicmanner,similartodealingoutadeckofcards.Fig.5.52showsa1D
blockcyclicdistributionofAontoalineararrayoffourUEs,illustratinghowcolumnsareassignedto
UEsinaroundrobinfashion.Here,matrixelement(i,j)isassignedtoUE(jmod4)(where4isthe
numberofUEs).
Figure 5.52. 1D block-cyclic distribution of A onto four UEs
Fig.5.53andFig.5.54showa2DblockcyclicdistributionofAontoatwobytwoarrayofUEs:Fig.
5.53illustrateshowAisdecomposedintotwobytwosubmatrices.(Wecouldhavechosenadifferent
decomposition,forexampleonebyonesubmatrices,buttwobytwoillustrateshowthisdistribution
canhavebothblockandcycliccharacteristics.)Fig.5.54thenshowshowthesesubmatricesare
assignedtoUEs.Matrixelement(i,j)isassignedtoUE(imod2,jmod2).
Figure 5.53. 2D block-cyclic distribution of A onto four UEs, part 1: Decomposing A
Figure 5.54. 2D block-cyclic distribution of A onto four UEs, part 2: Assigning

submatrices to UEs
MappingtoUEs.Inthegeneralcase,wehaveanNxMmatrixtobemappedontoaPRxpcarrayof
UEs.WechooseblocksizeNBxMB.Element(i,j)intheglobalmatrixwillbemappedtoUE(z,w),
wherez= i/NB modPRandw= j/MB modPc.
Mappingtolocalindices.BecausemultipleblocksaremappedtothesameUE,wecanviewthelocal
indexingblockwiseorelementwise.
Intheblockwiseview,eachelementonaUEisindexedlocallybyblockindices(l,m)andindices(x,
y)intotheblock.Torestatethis:Inthisscheme,theglobalmatrixelement(i,j)willbefoundonthe
UEwithinthelocal(l,m)blockattheposition(x,y)where(l,m)=( i/(PRNB) , j/(PcMB) )and(x,
y)=(imodNB,jmodMB).Fig.5.55illustratesthisforUE(0,0).
Figure 5.55. 2D block-cyclic distribution of A onto four UEs: Local view of elements of
A assigned to UE(0,0). LAl,m is the block with block indices (l, m). Each element is
labeled both with its original global indices (ai,j) and its indices within block LAl,m
(lx,y).
Forexample,considerglobalmatrixelementa5,1.BecausePR=PC=NB=MB=2,thiselementwill
maptoUE(0,0).TherearefourtwobytwoblocksonthisUE.Fromthefigure,weseethatthis
elementappearsintheblockonthebottomleft,orblockLA1,0andindeed,fromtheformulas,we
obtain(l,m)=( (2x2) , (2x2) )=(1,0).Finally,weneedthelocalindiceswithintheblock.Inthis
case,theindiceswithinblockare(x,y)=(5mod2,1mod2)=(1,1).
Intheelementwiseview(whichrequiresthatalltheblocksforeachUEformacontiguousmatrix),
globalindices(i,j)aremappedelementwisetolocalindices(INB+x,mMB+y),wherelandmare
definedasbefore.Fig.5.56illustratesthisforUE(0,0).
Figure 5.56. 2D block-cyclic distribution of A onto four UEs: Local view of elements of
A assigned to UE(0,0). Each element is labeled both with its original global indices ai,j
and its local indices [x', y' . Local indices are with respect to the contiguous matrix
used to store all blocks assigned to this UE.
Again,lookingatglobalmatrixelementa5,1,weseethatviewingthedataasasinglematrix,the
elementisfoundatlocalindices(1x2+1,0x2+1)=(3,1).Localindices(x,y)inblock(l,m)on
UE(z,w)correspondtoglobalindices((lPR+z)NB+x,(mPc+w)MB+y).
Choosing a distribution
Toselectwhichdistributiontouseforaproblem,considerhowthecomputationalloadontheUEs
changesasthecomputationproceeds.Forexample,inmanysinglechannelsignalprocessing
problems,thesamesetofoperationsisperformedoneachcolumnofanarray.Theworkdoesnotvary
asthecomputationproceeds,soacolumnblockdecompositionwillproducebothclarityofcodeand
goodloadbalance.Ifinsteadtheamountofworkvariesbycolumn,withhighernumberedcolumns
requiringmorework,acolumnblockdecompositionwouldleadtopoorloadbalance,withtheUEs
processinglowernumberedcolumnsfinishingaheadoftheUEsprocessinghighernumbered
columns.Inthiscase,acyclicdistributionwouldproducebetterloadbalance,becauseeachUEis
assignedamixoflownumberedandhighnumberedcolumns.
+
Thissameapproachappliestohigherdimensionsaswell.ScaLAPACK[Sca,BCC
97
],theleading
packageofdenselinearalgebrasoftwarefordistributedmemorycomputers,requiresa2Dblock
cyclicdistribution.Toseewhythischoicewasmade,considerGaussianelimination,oneofthemore
commonlyusedoftheScaLAPACKroutines.Inthisalgorithm,alsoknownasLUdecomposition,a
densesquarematrixistransformedintoapairoftriangularmatrices,anuppermatrixUandalower
matrixL.Atahighlevel,thealgorithmproceedsfromtheupperleftcornerandworksitswaydown
thediagonaloftheglobalmatrix,eliminatingelementsbelowthediagonalandtransformingthe
remainingblockstotherightasneeded.AblockdistributionwouldresultinidleUEsasthe
processingmarchesdownthediagonal.Butwitha2Dblockcyclicdistributionsuchastheoneshown
inFig.5.54,eachUEcontainselementsusedbothearlyandlateinthealgorithm,resultingin
excellentloadbalance.
Mapping indices
Theexamplesintheprecedingsectionillustratehoweachelementoftheoriginal(global)arrayis
mappedtoaUEandhoweachelementintheglobalarray,afterdistribution,isidentifiedbybothaset
ofglobalindicesandacombinationofUEidentifierandlocalinformation.Theoriginalproblemis
typicallystatedintermsofglobalindices,butcomputationwithineachUEmustbeintermsoflocal
indices.Applyingthispatterneffectivelyrequiresthattherelationshipbetweenglobalindicesandthe
combinationofUEandlocalindicesbeastransparentaspossible.Inaquestforprogramefficiency,it
isaltogethertooeasytoburytheseindexmappingsinthecodeinawaythatmakestheprogram
painfullydifficulttodebug.Abetterapproachistousemacrosandinlinefunctionstocapturethe
indexmappings;ahumanreaderoftheprogramthenonlyneedstomasterthemacroorfunctiononce.
Suchmacrosorfunctionsalsocontributetoclarityofabstraction.TheExamplessectionillustrates
thisstrategy.
Aligning computation with locality
Oneofthecardinalrulesofperformanceorientedcomputingistomaximizereuseofdataclosetoa
UE.Thatis,theloopsthatupdatelocaldatashouldbeorganizedinawaythatgetsasmuchuseas
possibleoutofeachmemoryreference.Thisobjectivecanalsoinfluencethechoiceofarray
distribution.
Forexample,inlinearalgebracomputations,itispossibletoorganizecomputationsonamatrixinto
smallercomputationsoversubmatrices.Ifthesesubmatricesfitintocache,dramaticperformance
gainscanresult.Similareffectsapplytootherlevelsofthememoryhierarchy:minimizingmissesin
thetranslationlookasidebuffer(TLB),pagefaults,andsoon.Adetaileddiscussionofthistopicgoes
wellbeyondthescopeofthisbook.Anintroductioncanbefoundin[PH98].
Examples
Transposing a matrix stored as column blocks
Asanexampleoforganizingmatrixcomputationsintosmallercomputationsoversubmatrices,
considertransposingasquarematrixdistributedwithacolumnblockdistribution.Forsimplicity,we
willassumethatthenumberofUEsevenlydividesthenumberofcolumns,sothatallblocksarethe
samesize.Ourstrategyfortransposingthematrixwillbebasedonlogicallydecomposingthematrix
intosquaresubmatrices,asshowninFig.5.57.Eachofthelabeledblocksinthefigurerepresentsa
squaresubmatrix;labelsshowhowtheblocksofthetransposerelatetotheblocksoftheoriginal
matrix.(Forexample,theblocklabeled(A0,l)Tinthetransposeisthetransposeoftheblocklabeled
A0,lintheoriginalmatrix.)Thealgorithmproceedsinphases;thenumberofphasesisthenumberof
submatricesperUE(whichisalsothenumberofUEs).Inthefirstphase,wetransposethe
submatricesonthediagonalofA,witheachUEtransposingonesubmatrixandnocommunication
required.Insuccessivephases,wetransposethesubmatricesonebelowthediagonal,thentwobelow
thediagonal,andsoforth,wrappingaroundtothetopofthematrixasnecessary.Ineachofthese
phases,eachUEmusttransposeoneofitssubmatrices,sendittoanotherUE,andreceivea
submatrix.Forexample,inthesecondphase,theUElabeledUE(1)mustcompute(A2,1)T,senditto
UE(2),andreceive(A0,1)TfromUE(0).Figs.5.58and5.59showcodetotransposesuchamatrix.
Thiscoderepresentsafunctionthatwilltransposeasquarecolumnblockedarray.Weassumethe
blocksaredistributedcontiguouslywithonecolumnblockperUE.Thisfunctionisintendedaspart
ofalargerprogram,soweassumethearrayhasalreadybeendistributedpriortocallingthisfunction.
Figure 5.57. Matrix A and its transpose, in terms of submatrices, distributed among
four UEs
Theprogramrepresentseachlocalcolumnblock(oneforAandoneforthetransposedresult)asa1D
array.ThesearraysinturnconsistofNum_procssubmatriceseach,eachofsizeblock_size =
Block_order * Block_order,whereBlock_orderisthenumberofcolumnsperUE.We
canthereforefindtheblockindexedIDusingtheBLOCKmacro:
Figure 5.58. Code to transpose a matrix (continued in Fig. 5.59)
/*******************************************************************
NAME: trans_isend_ircv
PURPOSE: This function uses MPI Isend and Irecv to transpose
a column-block distributed matrix.
*******************************************************************/
#include "mpi.h"
#include <stdio.h>
/*******************************************************************
** This function transposes a local block of a matrix. We don't
** display the text of this function as it is not relevant to the
** point of this example.

*******************************************************************/
void transpose(
double* A, int Acols, /* input matrix */
double* B, int Bcols, /* transposed mat */
int sub_rows, int sub_cols); /* size of slice to transpose */
/*******************************************************************
** Define macros to compute process source and destinations and
** local indices
*******************************************************************/
#define T0 (ID, PHASE, NPROC) ((ID + PHASE ) % NPROC)
#define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC)
#define BLOCK(BUFF, ID) (BUFF + (ID * block_size))
Figure 5.59. Code to transpose a matrix (continued from Fig. 5.58)
void trans_isnd_ircv(double *buff, double *trans, int Block_order,
double *work, int my_ID, int num_procs)
{
int iphase;
int block_size;
int send_to, recv_from;
double *bblock; /* pointer to current location in buff */
double *tblock; /* pointer to current location in trans */
MPI_Status status;
MPI_Request send_req, recv_req;
block_size = Block_order * Block_order;
/*******************************************************************
** Do the transpose in num_procs phases.
**
** In the first phase, do the diagonal block. Then move out
** from the diagonal copying the local matrix into a communication
** buffer (while doing the local transpose) and send to process
** (diag+phase)%,num_procs.
*******************************************************************/
bblock = BLOCK(buff, my_ID);
tblock = BLOCK(trans, my_ID);
transpose(bblock, Block_order, tblock, Block_order,
Block_order, Block_order);
for (iphase=l; iphase<num_procs; iphase++){
recv_from = FROM(my_ID, iphase, num_procs);
tblock = BLOCK(trans, recv.from);
MPI.Irecv (tblock, block_size, MPI_DOUBLE, recv_from,
iphase, MPI_COMM_WORLD, &recv_req);
send_to = TO(my_ID, iphase, num_procs);

bblock = BLOCK(buff, send_to);
transpose(bblock, Block_order, work, Block_order,
Block_order, Block_order);
MPI_Isend (work, block_size, MPI_DOUBLE, send_to,
iphase, MPI_COMM_WORLD, &send.req);
MPI_Wait(&recv_req, &status);
MPI_Wait(&send_req, &status);
#define BLOCK(BUFF, ID) (BUFF + (ID * block_size))
BUFFisthestartofthe1Darray(bufffortheoriginalarray,transforthetranspose)andIDisthe
secondindexoftheblock.Soforexample,wefindthediagonalblockofbotharraysasfollows:
bblock = BLOCK(buff, my_ID);
tblock = BLOCK(trans, my_ID);
Insucceedingphasesofthealgorithm,wemustdeterminetwothings:(1)theindexoftheblockwe
shouldtransposeandsendand(2)theindexoftheblockweshouldreceive.WedothiswiththeTO
andFROMmacros:
#define T0(ID, PHASE, NPROC) ((ID + PHASE ) % NPROC)
#define FROM(ID, PHASE, NPROC) ((ID + NPROC - PHASE) % NPROC)
TheTOindexshowstheprogressionthroughtheoffdiagonalblocks,workingdownfromthe
diagonalandwrappingbacktothetopatthebottomofthematrix.Ateachphaseofthealgorithm,we
computewhichUEistoreceivetheblockandthenupdatethelocalpointer(bblock)totheblock
thatwillbesent:
send_to = TO(my_ID, iphase, num_procs);
bblock = BLOCK(buff, send_to);
Likewise,wecomputewherethenextblockiscomingfromandwhichlocalindexcorrespondstothat
block:
recv_from = FROM(my_ID, iphase, num_procs);
tblock = BLOCK(trans, recv_from);
Thiscontinuesuntilalloftheblockshavebeentransposed.
Weuseimmediate(nonblocking)sendsandreceivesinthisexample.(Theseprimitivesaredescribed
inmoredetailintheMPIappendix,AppendixB.)Duringeachphase,eachUEfirstpostsareceive
andthenperformsatransposeontheblockitwillsend.Afterthattransposeiscomplete,theUEsends
thenowtransposedblocktotheUEthatshouldreceiveit.Atthebottomoftheloopandbefore
movingtothenextphase,functionsarecalledtoforcetheUEtowaituntilboththesendsandreceives
complete.Thisapproachletsusoverlapcommunicationandcomputation.Moreimportantly(because
inthiscasethereisn'tmuchcomputationtooverlapwithcommunication),itpreventsdeadlock:A
morestraightforwardapproachusingregularsendsandreceiveswouldbetofirsttransposetheblock
tobesent,thensendit,andthen(waitto)receiveablockfromanotherUE.However,iftheblocksto
besentarelarge,aregularsendmightblockbecausethereisinsufficientbufferspaceforthemessage;
inthiscase,suchblockingcouldproducedeadlock.Byinsteadusingnonblockingsendsandreceives
andpostingthereceivesfirst,weavoidthissituation.
Known uses
Thispatternisusedthroughoutthescientificcomputingliterature.ThewellknownScaLAPACK
+
package[Sca,BCC
97
]makesheavyuseofthe2Dblockcyclicdistribution,andthedocumentation
givesathoroughexplanationofmappingandindexingissuesforthisdistribution.
SeveraldifferentarraydistributionswereembeddedintotheHPFlanguage[HPF97]definition.
Someofthemostcreativeusesofthispatterncanbefoundinquantumchemistry,particularlyinthe
areaofpostHartreeFockcomputations.TheGlobalArraysorGApackage[NHL94,NHL96,
+
NHK02
,Gloa]wascreatedspecificallytoaddressdistributedarrayproblemsinpostHartreeFock
algorithms.Amorerecentapproachisdescribedin[NHL96,LDSH95].
+
ThePLAPACKpackage[ABE
97
,PLA,vdG97]takesadifferentapproachtoarraydistribution.
Ratherthanfocusingonhowtodistributethearrays,PLAPACKconsidershowthevectorsoperated
uponbythearraysareorganized.Fromthesedistributedvectors,thecorrespondingarraydistributions
arederived.Inmanyproblems,thesevectorscorrespondtothephysicalquantitiesintheproblem
domain,sothePLAPACKteamreferstothisasthephysicallybaseddistribution.
Related Patterns
TheDistributedArraypatternisoftenusedtogetherwiththeGeometricDecompositionandSPMD
patterns.
5.11. OTHER SUPPORTING STRUCTURES

Thispatternlanguage(andhencetheSupportingStructurespatterns)isbasedoncommonpractice
amongOpenMP,MPI,andJavaprogrammerswritingcodeforbothsharedmemoryanddistributed
memoryMIMDcomputers.Parallelapplicationprogrammerswillinmostcasesfindthepatternsthey
needwithinthispatternlanguage.
Thereare,however,additionalpatterns(withtheirownsupportingstructures)thathaveatvarious
timesbeenimportantinparallelprogramming.Theyareonlyrarelyusedatthistime,butitisstill
importanttobeawareofthem.Theycanprovideinsightsintodifferentopportunitiesforfindingand
exploitingconcurrency.Anditispossiblethatasparallelarchitecturescontinuetoevolve,theparallel
programmingtechniquessuggestedbythesepatternsmaybecomeimportant.
Inthissection,wewillbrieflydescribesomeoftheseadditionalpatternsandtheirsupporting
structures:SIMD,MPMD,ClientServer,andDeclarativeProgramming.Weclosewithabrief
discussionofproblemsolvingenvironments.Thesearenotpatterns,buttheyhelpprogrammerswork
withinatargetedsetofproblems.
5.11.1. SIMD
ASIMDcomputerhasasinglestreamofinstructionsoperatingonmultiplestreamsofdata.These
machineswereinspiredbythebeliefthatprogrammerswouldfindittoodifficulttomanagemultiple
streamsofinstructions.Manyimportantproblemsaredataparallel;thatis,theconcurrencycanbe
expressedintermsofconcurrentupdatesacrosstheproblem'sdatadomain.Carriedtoitslogical
extreme,theSIMDapproachassumesthatitispossibletoexpressallparallelismintermsofthedata.
Programswouldthenhavesinglethreadsemantics,makingunderstandingandhencedebuggingthem
mucheasier.ThebasicideabehindtheSIMDpatterncanbesummarizedasfollows.
DefineanetworkofvirtualPEstobemappedontotheactualPEs.ThesevirtualPEsare
connectedaccordingtoawelldefinedtopology.Ideallythetopologyis(1)wellalignedwith
thewaythePEsinthephysicalmachineareconnectedand(2)effectiveforthecommunication
patternsimpliedbytheproblembeingsolved.
Expresstheproblemintermsofarraysorotherregulardatastructuresthatcanbeupdated
concurrentlywithasinglestreamofinstructions.
AssociatethesearrayswiththelocalmemoriesofthevirtualPEs.
Createasinglestreamofinstructionsthatoperatesonslicesoftheregulardatastructures.
Theseinstructionsmayhaveanassociatedmasksotheycanbeselectivelyskippedforsubsets
ofarrayelements.Thisiscriticalforhandlingboundaryconditionsorotherconstraints.
Whenaproblemistrulydataparallel,thisisaneffectivepattern.Theresultingprogramsarerelatively
easytowriteanddebug[DKK90].
Unfortunately,mostdataproblemscontainsubproblemsthatarenotdataparallel.Settingupthecore
datastructures,dealingwithboundaryconditions,andpostprocessingafteracoredataparallel
algorithmcanallintroducelogicthatmightnotbestrictlydataparallel.Furthermore,thisstyleof
programmingistightlycoupledtocompilersthatsupportdataparallelprogramming.Thesecompilers
haveprovendifficulttowriteandresultincodethatisdifficulttooptimizebecauseitcanbefar
removedfromhowaprogramrunsonaparticularmachine.Thus,thisstyleofparallelprogramming
andthemachinesbuiltaroundtheSIMDconcepthavelargelydisappeared,exceptforafewspecial
purposemachinesusedforsignalprocessingapplications.
TheprogrammingenvironmentmostcloselyassociatedwiththeSIMDpatternisHighPerformance
Fortran(HPF)[HPF97].HPFisanextensionofthearraybasedconstructsinFortran90.Itwas
createdtosupportportableparallelprogrammingacrossSIMDmachines,butalsotoallowtheSIMD
programmingmodeltobeusedonMIMDcomputers.Thisrequiredexplicitcontroloverdata
placementontothePEsandthecapabilitytoremapthedataduringacalculation.Itsdependenceona
strictlydataparallel,SIMDmodel,however,doomedHPFbymakingitdifficulttousewithcomplex
+
applications.ThelastlargecommunityofHPFusersisinJapan[ZJS
02
],wheretheyhaveextended
thelanguagetorelaxthedataparallelconstraints[HPF99].
5.11.2. MPMD
TheMultipleProgram,MultipleData(MPMD)pattern,asthenameimplies,isusedinaparallel
algorithmwhendifferentprogramsrunondifferentUEs.Thebasicapproachisthefollowing.
Decomposetheproblemintoasetofsubproblems,whereeachsubproblemmapsontoasubset
ofUEs.OfteneachsubsetofUEscorrespondstothenodesofadifferentparallelcomputer.
Createindependentprogramssolvingtheappropriatesubproblemsandtunedtotherelevant
targetUEs.
CoordinatetheprogramsrunningondistinctUEsasneeded,typicallythroughamessage
passingframework.
Inmanyways,theMPMDapproachisnottoodifferentfromanSPMDprogramusingMPI.Infact,
theruntimeenvironmentsassociatedwiththetwomostcommonimplementationsofMPI,MPICH
[MPI]andLAM/MPI[LAM],supportsimpleMPMDprogramming.
ApplicationsoftheMPMDpatterntypicallyariseinoneoftwoways.First,thearchitectureofthe
UEsmaybesodifferentthatasingleprogramcannotbeusedacrossthefullsystem.Thisisthecase
whenusingparallelcomputingacrosssometypeofcomputationalgrid[Glob,FK03]usingmultiple
classesofhighperformancecomputingarchitectures.Thesecond(andfromaparallelalgorithmpoint
ofviewmoreinteresting)caseoccurswhencompletelydifferentsimulationprogramsarecombined
intoacoupledsimulation.
Forexample,climateemergesfromacomplexinterplaybetweenatmosphericandoceanphenomena.
Wellunderstoodprogramsformodelingtheoceanandtheatmosphereindependentlyhavebeen
developedandhighlyrefinedovertheyears.AlthoughanSPMDprogramcouldbecreatedthat
implementsacoupledocean/atmosphericmodeldirectly,amoreeffectiveapproachistotakethe
separate,validatedoceanandatmosphericprogramsandcouplethemthroughsomeintermediate
layer,therebyproducinganewcoupledmodelfromwellunderstoodcomponentmodels.
AlthoughbothMPICHandLAM/MPIprovidesomesupportforMPMDprogramming,theydonot
allowdifferentimplementationsofMPItointeract,soonlyMPMDprogramsusingacommonMPI
implementationaresupported.ToaddressawiderrangeofMPMDproblemsspanningdifferent
architecturesanddifferentMPIimplementations,anewstandardcalledinteroperableMPI(iMPI)was
created.ThegeneralideaofcoordinatingUEsthroughtheexchangeofmessagesiscommontoMPI
andiMPI,butthedetailedsemanticsareextendediniMPItoaddresstheuniquechallengesarising
fromprogramsrunningonwidelydifferingarchitectures.Thesemultiarchitectureissuescanadd
significantcommunicationoverhead,sothepartofanalgorithmdependentontheperformanceof
iMPImustberelativelycoarsegrained.
MPMDprogramsarerare.Asincreasinglycomplicatedcoupledsimulationsgrowinimportance,
however,useoftheMPMDpatternwillincrease.Useofthispatternwillalsogrowasgridtechnology
becomesmorerobustandmorewidelydeployed.
5.11.3. Client-Server Computing
ClientserverarchitecturesarerelatedtoMPMD.Traditionally,thesesystemshavecomprisedtwoor
threetierswherethefrontendisagraphicaluserinterfaceexecutedonaclient'scomputeranda
mainframebackend(oftenwithmultipleprocessors)providesaccesstoadatabase.Themiddletier,if
present,dispatchesrequestsfromtheclientsto(possiblymultiple)backends.Webserversarea
familiarexampleofaclientserversystem.Moregenerally,aservermightofferavarietyofservicesto
clients,anessentialaspectofthesystembeingthatserviceshavewelldefinedinterfaces.Parallelism
canappearattheserver(whichcanservicemanyclientsconcurrentlyorcanuseparallelprocessingto
obtainresultsmorequicklyforsinglerequests)andattheclient(whichcaninitiaterequestsatmore
thanoneserversimultaneously).
Techniquesusedinclientserversystemsareespeciallyimportantinheterogeneoussystems.
MiddlewaresuchasCORBA[COR]providesastandardforserviceinterfacespecifications,enabling
newprogramstobeputtogetherbycomposingexistingservices,evenifthoseservicesareofferedon
vastlydifferenthardwareplatformsandimplementedindifferentprogramminglanguages.CORBA
alsoprovidesfacilitiestoallowservicestobelocated.TheJavaJ2EE(Java2Platform,Enterprise
Edition)[Javb]alsoprovidessignificantsupportforclientserverapplications.Inbothofthesecases,
interoperabilitywasamajordesignforce.
Clientserverarchitectureshavetraditionallybeenusedinenterpriseratherthanscientific
applications.Gridtechnology,whichisheavilyusedinscientificcomputing,borrowsfromclient
servertechnology,extendingitbyblurringthedistinctionbetweenclientsandservers.Allresourcesin
agrid,whethertheyarecomputers,instruments,filesystems,oranythingelseconnectedtothe
network,arepeersandcanserveasclientsandservers.Themiddlewareprovidesstandardsbased
interfacestotietheresourcestogetherintoasinglesystemthatspansmultipleadministrativedomains.
5.11.4. Concurrent Programming with Declarative Languages
TheoverwhelmingmajorityofprogrammingisdonewithimperativelanguagessuchasC++,Java,or
Fortran.Thisisparticularlythecasefortraditionalapplicationsinscienceandengineering.The
artificialintelligencecommunityandasmallsubsetofacademiccomputerscientists,however,have
developedandshowngreatsuccesswithadifferentclassoflanguages,thedeclarativelanguages.In
theselanguages,theprogrammerdescribesaproblem,aproblemdomain,andtheconditionssolutions
mustsatisfy.Theruntimesystemassociatedwiththelanguagethenusesthesetofindvalidsolutions.
Declarativesemanticsimposeadifferentstyleofprogrammingthatoverlapswiththeapproaches
discussedinthispatternlanguage,buthassomesignificantdifferences.Therearetwoimportant
classesofdeclarativelanguages:functionallanguagesandlogicprogramminglanguages.
Logicprogramminglanguagesarebasedonformalrulesoflogicalinference.Themostcommonlogic
programminglanguagebyfarisProlog[SS94],aprogramminglanguagebasedonfirstorder
predicatecalculus.WhenPrologisextendedtosupportexpressionofconcurrency,theresultisa
concurrentlogicprogramminglanguage.Concurrencyisexploitedinoneofthreewayswiththese
Prologextensions:andparallelism(executemultiplepredicates),orparallelism(executemultiple
guards),orthroughexplicitmappingofpredicateslinkedtogetherthroughsingleassignmentvariables
[CG86].
Concurrentlogicprogramminglanguageswereahotareaofresearchinthelate1980sandearly
1990s.Theyultimatelyfailedbecausemostprogrammersweredeeplycommittedtomoretraditional
imperativelanguages.Evenwiththeadvantagesofdeclarativesemanticsandthevalueoflogic
programmingforsymbolicreasoning,thelearningcurveassociatedwiththeselanguagesproved
prohibitive.
Theolderandmoreestablishedclassofdeclarativeprogramminglanguagesisbasedonfunctional
programmingmodels[Hud89].LISPistheoldestandbestknownofthefunctionallanguages.Inpure
functionallanguages,therearenosideeffectsfromafunction.Therefore,functionscanexecuteas
soonastheirinputdataisavailable.Theresultingalgorithmsexpressconcurrencyintermsofthe
flowofdatathroughtheprogramleading,therebyresultingin"dataflow"algorithms[Jag96].
ThebestknownconcurrentfunctionallanguagesareSisal[FCO90],ConcurrentML[Rep99,Con]
(anextensiontoML),andHaskell[HPF].Becausemathematicalexpressionsarenaturallywritten
downinafunctionalnotation,Sisalwasparticularlystraightforwardtoworkwithinscienceand
engineeringapplicationsandprovedtobehighlyefficientforparallelprogramming.However,justas
withthelogicprogramminglanguages,programmerswereunwillingtopartwiththeirfamiliar
imperativelanguages,andSisalessentiallydied.ConcurrentMLandHaskellhavenotmademajor
inroadsintohighperformancecomputing,althoughbothremainpopularinthefunctional
programmingcommunity.
5.11.5. Problem-Solving Environments
Adiscussionofsupportingstructuresforparallelalgorithmswouldnotbecompletewithout
mentioningproblemsolvingenvironments(PSE).APSEisaprogrammingenvironmentspecialized
totheneedsofaparticularclassofproblems.Whenappliedtoparallelcomputing,PSEsalsoimplya
particularalgorithmstructureaswell.
ThemotivationbehindPSEsistosparetheapplicationprogrammerthelowleveldetailsofthe
parallelsystem.Forexample,PETsc(Portable,Extensible,ToolkitforScientificComputation)
[BGMS98]supportsavarietyofdistributeddatastructuresandfunctionsrequiredtousethemfor
solvingpartialdifferentialequations(typicallyforproblemsfittingtheGeometricDecomposition
pattern).TheprogrammerneedstounderstandthedatastructureswithinPETSc,butissparedthe
needtomasterthedetailsofhowtoimplementthemefficientlyandportably.OtherimportantPSEs
+
+
arePLAPACK[ABE
97
](fordenselinearalgebraproblems)andPOOMA[RHC
96
](anobject
orientedframeworkforscientificcomputing).
PSEshavenotbeenverywellaccepted.PETScisprobablytheonlyPSEthatisheavilyusedfor
seriousapplicationprogramming.Theproblemisthatbytyingthemselvestoanarrowclassof
problems,PSEsrestricttheirpotentialaudienceandhaveadifficulttimereachingacriticalmassof
users.Webelievethatovertimeandasthecorepatternsbehindparallelalgorithmsbecomebetter
understood,PSEswillbeabletobroadentheirimpactandplayamoredominantroleinparallel
programming.
Chapter 6. The Implementation Mechanisms

Design Space
6.1OVERVIEW
6.2UEMANAGEMENT
6.3SYNCHRONIZATION
6.4COMMUNICATION
Uptothispoint,wehavefocusedondesigningalgorithmsandthehighlevelconstructsusedto
organizeparallelprograms.Withthischapter,weshiftgearsandconsideraprogram'ssourcecodeand
thelowleveloperationsusedtowriteparallelprograms.
Whataretheselowleveloperations,orimplementationmechanisms,forparallelprogramming?Of
course,thereisthecomputer'sinstructionset,typicallyaccessedthroughahighlevelprogramming
language,butthisisthesameforserialandparallelprograms.Ourconcernistheimplementation
mechanismsuniquetoparallelprogramming.Acompleteanddetaileddiscussionoftheseparallel
programming"buildingblocks"wouldfillalargebook.Fortunately,mostparallelprogrammersuse
onlyamodestcoresubsetofthesemechanisms.Thesecoreimplementationmechanismsfallinto
threecategories:
UEmanagement
Synchronization
Communication
Withineachofthesecategories,themostcommonlyusedmechanismsarecoveredinthischapter.An
overviewofthisdesignspaceanditsplaceinthepatternlanguageisshowninFig.6.1.
Figure 6.1. Overview of the Implementation Mechanisms design space and its place in
the pattern language
Inthischapterwealsodroptheformalismofpatterns.Mostoftheimplementationmechanismsare
includedwithinthemajorparallelprogrammingenvironments.Hence,ratherthanusepatterns,we
provideahighleveldescriptionofeachimplementationmechanismandtheninvestigatehowthe
mechanismmapsontoourthreetargetprogrammingenvironments:OpenMP,MPI,andJava.This
mappingwillinsomecasesbetrivialandrequirelittlemorethanpresentinganexistingconstructin
anAPIorlanguage.Thediscussionwillbecomeinterestingwhenwelookatoperationsnativetoone
programmingmodel,butforeigntoanother.Forexample,itispossibletodomessagepassingin
OpenMP.Itisnotpretty,butitworksandcanbeusefulattimes.
WeassumethatthereaderisfamiliarwithOpenMP,MPI,andJavaandhowtheyareusedforwriting
parallelprograms.AlthoughwecoverspecificfeaturesoftheseAPIsinthischapter,thedetailsof
usingthemarelefttotheappendixes.
6.1. OVERVIEW
ParallelprogramsexploitconcurrencybymappinginstructionsontomultipleUEs.Ataverybasic
level,everyparallelprogramneedsto(1)createthesetofUEs,(2)manageinteractionsbetweenthem
andtheiraccesstosharedresources,(3)exchangeinformationbetweenUEs,and(4)shutthemdown
inanorderlymanner.Thissuggeststhefollowingcategoriesofimplementationmechanisms.
UEmanagement.Thecreation,destruction,andmanagementoftheprocessesandthreads
usedinparallelcomputation.
Synchronization.EnforcingconstraintsontheorderingofeventsoccurringindifferentUEs.
ThisisprimarilyusedtoensurethatsharedresourcesareaccessedbyacollectionofUEsin
suchawaythattheprogramiscorrectregardlessofhowtheUEsarescheduled.
Communication.TheexchangeofinformationbetweenUEs.
6.2. UE MANAGEMENT
Letusrevisitthedefinitionofunitofexecution,orUE.AUEisanabstractionfortheentitythat
carriesoutcomputationsandismanagedfortheprogrammerbytheoperatingsystem.Inmodern
parallelprogrammingenvironments,therearetwotypesofUEs:processesandthreads.
Aprocessisaheavyweightobjectthatcarrieswithitthestateorcontextrequiredtodefineitsplacein
thesystem.Thisincludesmemory,programcounters,registers,buffers,openfiles,andanythingelse
requiredtodefineitscontextwithintheoperatingsystem.Inmanysystems,differentprocessescan
belongtodifferentusers,andthusprocessesarewellprotectedfromeachother.Creatinganew
processandswappingbetweenprocessesisexpensivebecauseallthatstatemustbesavedand
restored.Communicationbetweenprocesses,evenonthesamemachine,isalsoexpensivebecausethe
protectionboundariesmustbecrossed.
Athread,ontheotherhand,isalightweightUE.Acollectionofthreadsiscontainedinaprocess.
Mostoftheresources,includingthememory,belongtotheprocessandaresharedamongthethreads.
Theresultisthatcreatinganewthreadandswitchingcontextbetweenthreadsislessexpensive,
requiringonlythesavingofaprogramcounterandsomeregisters.Communicationbetweenthreads
belongingtothesameprocessisalsoinexpensivebecauseitcanbedonebyaccessingtheshared
memory.
ThemechanismsformanagingthesetwotypesofUEsarecompletelydifferent.Wehandlethem
separatelyinthenexttwosections.
6.2.1. Thread Creation/Destruction
Threadsrequirearelativelymodestnumberofmachinecyclestocreate.Programmerscanreasonably
createanddestroythreadsasneededinsideaprogram,andaslongastheydonotdosoinsidetight
loopsortimecriticalkernels,theprogram'soverallruntimewillbeaffectedonlymodestly.Hence,
mostparallelprogrammingenvironmentsmakeitstraightforwardtocreatethreadsinsideaprogram,
andtheAPIsupportsthreadcreationanddestruction.
OpenMP: thread creation/destruction
InOpenMP,threadsarecreatedwiththeparallelpragma:
#pragma omp parallel
{ structured block }
Eachthreadwillindependentlyexecutethecodewithinthestructuredblock.Astructuredblockisjust
ablockofstatementswithasinglepointofentryatthetopandasinglepointofexitatthebottom.
ThenumberofthreadscreatedinOpenMPcanbeeitherlefttotheoperatingsystemorcontrolledby
theprogrammer(seetheOpenMPappendix,AppendixA,formoredetails).
Destructionofthreadsoccursattheendofthestructuredblock.Thethreadswaitattheendofthe
structuredblock.Afterallthreadshavearrived,thethreadsaredestroyedandtheoriginalormaster
threadcontinues.
Java: thread creation/destruction
InJava,threadsareinstancesofthejava.1ang.ThreadclassorasubclassoftheThreadclass.
AThreadobjectisinstantiatedintheusualwayusingthenewkeyword,andthenthestart
methodisinvokedtolaunchthethread.Thethreadthuscreatedcanaccessanyvariablevisible
accordingtoJava'sscoperules.
Therearetwowaystospecifythebehaviorofathread.ThefirstistocreateasubclassofThread
andoverridetherunmethod.Thefollowingshowshowtodothistocreateathreadthat,when
launched,willexecutethread_body.
class MyThread extends Thread
{ public void run(){ thread_body }}
Tocreateandlaunchthethreadonewouldwrite
Thread t = new MyThread() ; //create thread object
t. start(); //launch the thread
Tousethesecondapproach,wedefineaclassthatimplementsthejava. lang. Runnable

interface,whichcontainsasinglemethodpublic void run(),andpassaninstanceofthe
RunnableclasstotheThreadconstructor.Forexample,firstwedefinetheRunnableclass:
class MyRunnable implements Runnable
{ public void run(){ thread_body }}
Tocreateandexecutethethread,wecreateaRunnableobjectandpassittotheThread
constructor.Thethreadislaunchedusingthestartmethodasbefore:
Thread t = new Thread(new MyRunnable()); //create Runnable
//and Thread objects
t.start(); //start the thread
Inmostcases,thesecondapproachispreferred.[1]
[1]
InJava,aclasscanimplementanynumberofinterfaces,butisonlyallowedtoextenda
singlesuperclass.Thus,extendingThreadinthefirstapproachmeansthattheclass
definingtherunmethodcannotextendanapplicationspecificclass.
Athreadterminateswhentherunmethodreturns.TheThreadobjectitselfwillbecollectedafter
terminationbythegarbagecollectorinthesamewayasanyotherobjectinaJavaprogram.
Thejava.util.concurrentpackagedefinestheExecutorandExecutor-Services
interfacesandprovidesseveralclassesimplementingthem.Theseclassesdirectlysupporthigherlevel
structuressuchastheMaster/WorkerpatternbyarrangingfortheexecutionofRunnableswhile
hidingthedetailsofthreadcreationandscheduling.MoredetailsaregivenintheJavaappendix,
AppendixC.
MPI: thread creation/destruction
MPIisfundamentallybasedonprocesses.ItisthreadawareinthattheMPI2.0API[Mesa]defines
differentlevelsofthreadsafetyandprovidesfunctionstoqueryasystematruntimeastothelevelof
threadsafetythatissupported.TheAPI,however,hasnoconceptofcreatinganddestroyingthreads.
TomixthreadsintoanMPIprogram,theprogrammermustuseathreadbasedprogrammingmodelin
additiontoMPI.
6.2.2. Process Creation/Destruction
Aprocesscarrieswithitalltheinformationrequiredtodefineitsplaceintheoperatingsystem.In
additiontoprogramcountersandregisters,aprocessincludesalargeblockofmemory(itsaddress
space),systembuffers,andeverythingelserequiredtodefineitsstatetotheoperatingsystem.
Consequently,creatinganddestroyingprocessesisexpensiveandnotdoneveryoften.
MPI: process creation/destruction
InoldermessagepassingAPIssuchasPVM[Sun90],thecapabilitytocreatenewprocesseswas
embeddedintheAPI.AprogrammercouldissueacommandcalledPVM_spawntocreateanew
process:
PVM_Spawn(node, program-executable)
ThiscapabilityofPVMallowedprogrammerstocontrolwhichexecutablesranonwhichnodesfrom
insideaprogram.InMPI1.1,however,thiscapabilitywasnotprovidedtotheprogrammerandwas
lefttotheruntimeenvironment.Thisdecisionmightseemlikeastepbackwards,butitwasdonefor
tworeasons.OnewastheobservationthatthevastmajorityofPVMprogramswerebasedonthe
SPMDpattern,soitmadesensetobuildthispatternintoMPI.Second,itallowedthestandardtobe
implementedonawiderrangeofparallelarchitectures.ManyoftheMPPcomputersavailableatthe
timetheMPIForumdefinedMPIcouldnoteasilyhandleaspawnstatement.
ForanexampleofprocesscreationinMPI,consideranMPIprogramwithanexecutablenamedfoo.
Theprogrammerlaunchesthejobonmultipleprocessors(fourinthisexample)withthecommand:
mpirun -np 4 foo
Inresponsetothiscommand,thesystemgoestoastandardfilelistingthenamesofthenodestouse,
selectsfourofthem,andlaunchesthesameexecutableoneachone.
Theprocessesaredestroyedwhentheprogramsrunningonthenodesoftheparallelcomputerexit.To
maketheterminationclean,anMPIprogramhasasitsfinalexecutablestatement:
MPI-Finalize()
Thesystemattemptstocleanupanyprocessesleftrunningaftertheprogramexits.Iftheexitis
abnormal,suchascanhappenwhenanexternalinterruptoccurs,itispossiblefororphanchild
processestobeleftbehind.Thisisamajorconcerninlargeproductionenvironmentswheremany
largeMPIprogramscomeandgo.Lackofpropercleanupcanleadtoanoverlycrowdedsystem.
Java: process creation/destruction
Usually,oneinstanceofaJavaruntime(implementingtheJavaVirtualMachine,orJVM,
specification)correspondstoaprocess,whichwillthencontainallthethreadscreatedbytheJava
programsitsupports.Limitedfacilitiesforcreatingnewprocesses,andcommunicatingand
synchronizingwiththem,arefoundinthejava.lang.Processandjava.lang.Runtime
classesinthestandardAPI.Theseprocesses,however,aretypicallyusedtoinvokenonJavaprograms
fromwithinaJavaprogram,notforparallelism.Indeed,thespecificationdoesnotevenrequirethata
newchildprocessexecuteconcurrentlywiththeparent.
Javacanalsobeusedindistributedmemorymachines;inthiscase,onestartsaJVMinstanceoneach
machineusingfacilitiesoftheoperatingsystem.
OpenMP: process creation/destruction
OpenMPisanAPIcreatedtosupportmultithreadedprogramming.Thesethreadsshareasingle
process.ThereisnocapabilitywithinOpenMPtocreateordestroyprocesses.
TheextensionofOpenMPtodistributedmemorycomputersandhencetoamultiprocessmodelisan
activeareaofresearch[SLGZ99,BB99,Omn].Thesesystemsusuallyadoptthetechniqueusedby
MPIandleaveprocesscreationanddestructiontotheruntimeenvironment.
6.3. SYNCHRONIZATION
SynchronizationisusedtoenforceaconstraintontheorderofeventsoccurringindifferentUEs.
Thereisavastbodyofliteratureonsynchronization[And00],anditcanbecomplicated.Most
programmers,however,useonlyafewsynchronizationmethodsonaregularbasis.
6.3.1. Memory Synchronization and Fences
Inasimple,classicalmodelofasharedmemorymultiprocessor,eachUEexecutesasequenceof
instructionsthatcanreadorwriteatomicallyfromthesharedmemory.Wecanthinkofthe
computationasasequenceofatomicevents,withtheeventsfromdifferentUEsinterleaved.Thus,if
UEAwritesamemorylocationandthenUEBreadsit,UEBwillseethevaluewrittenbyUEA.
Suppose,forexample,thatUEAdoessomeworkandthensetsavariabledonetotrue.Meanwhile,
UEBexecutesaloop:
while (!done) {/*do something but don't change done*/}
Inthesimplemodel,UEAwilleventuallysetdone,andtheninthenextloopiteration,UEBwill
readthenewvalueandterminatetheloop.
Inreality,severalthingscouldgowrong.Firstofall,thevalueofthevariablemaynotactuallybe
writtenbyUEAorreadbyUEB.Thenewvaluecouldbeheldinacacheinsteadofthemain
memory,andeveninsystemswithcachecoherency,thevaluecouldbe(asaresultofcompiler
optimizations,say)heldinaregisterandnotbemadevisibletoUEB.Similarly,UEBmaytrytoread
thevariableandobtainastalevalue,orduetocompileroptimizations,notevenreadthevaluemore
thanoncebecauseitisn'tchangedintheloop.Ingeneral,manyfactorspropertiesofthememory
system,thecompiler,instructionreorderingetc.canconspiretoleavethecontentsofthememories
(asseenbyeachUE)poorlydefined.
AmemoryfenceisasynchronizationeventthatguaranteesthattheUEswillseeaconsistentviewof
memory.Writesperformedbeforethefencewillbevisibletoreadsperformedafterthefence,as
wouldbeexpectedintheclassicalmodel,andallreadsperformedafterthefencewillobtainavalue
writtennoearlierthanthelatestwritebeforethefence.
Clearly,memorysynchronizationisonlyanissuewhenthereissharedcontextbetweentheUEs.
Hence,thisisnotgenerallyanissuewhentheUEsareprocessesrunninginadistributedmemory
environment.Forthreads,however,puttingmemoryfencesintherightlocationcanmakethe
differencebetweenaworkingprogramandaprogramriddledwithraceconditions.
Explicitmanagementofmemoryfencesiscumbersomeanderrorprone.Fortunately,most
programmers,althoughneedingtobeawareoftheissue,onlyrarelyneedtodealwithfences
explicitlybecause,aswewillseeinthenextfewsections,thememoryfenceisusuallyimpliedby
higherlevelsynchronizationconstructs.
OpenMP: fences
InOpenMP,amemoryfenceisdefinedwiththeflushstatement:
#pragma omp flush
ThisstatementaffectseveryvariablevisibletothecallingUE,causingthemtobeupdatedwithinthe
computer'smemory.Thisisanexpensiveoperationbecauseguaranteeingconsistencyrequiressome
ofthecachelinesandallsystembuffersandregisterstobewrittentomemory.Alowercostversionof
flushisprovidedwheretheprogrammerliststhevariablestobeflushed:
#pragma omp flush (flag)
OpenMPprogrammersonlyrarelyusetheflushconstructbecauseOpenMP'shighlevel
synchronizationconstructsimplyaflushwhereneeded.Whencustomsynchronizationconstructsare
created,however,flushcanbecritical.Agoodexampleispairwisesynchronization,wherethe
synchronizationoccursbetweenspecificpairsofthreadsratherthanamongthefullteam.Because
pairwisesynchronizationisnotdirectlysupportedbytheOpenMPAPI[2],whenfacedwithan
algorithmthatdemandsit,programmersmustcreatethepairwisesynchronizationconstructontheir
own.ThecodeinFig.6.2showshowtosafelyimplementpairwisesynchronizationinOpenMPusing
theflushconstruct.
[2]
Ifaprogramusessynchronizationamongthefullteam,thesynchronizationwillwork
independentlyofthesizeoftheteam,eveniftheteamsizeisone.Ontheotherhand,a
programwithpairwisesynchronizationwilldeadlockifrunwithasinglethread.An
OpenMPdesigngoalwastoencouragecodethatisequivalentwhetherrunwithone
threadormany,apropertycalledsequentialequivalence.Thus,highlevelconstructsthat
arenotsequentiallyequivalent,suchaspairwisesynchronization,wereleftoutoftheAPI.
Inthisprogram,eachthreadhastwoblocksofworktocarryoutconcurrentlywiththeotherthreadsin
theteam.Theworkisrepresentedbytwofunctions:do_a_whole_bunch()and
do_more_stuff ().Thecontentsofthesefunctionsareirrelevant(andhencearenotshown)for
thisexample.Allthatmattersisthatforthisexampleweassumethatathreadcannotsafelybegin
workonthesecondfunctiondo_more_stuff ()untilitsneighborhasfinishedwiththefirst
functiondo_a_whole_bunch().
TheprogramusestheSPMDpattern.Thethreadscommunicatetheirstatusforthesakeofthe
pairwisesynchronizationbysettingtheirvalue(indexedbythethreadID)oftheflagarray.Because
thisarraymustbevisibletoallofthethreads,itneedstobeasharedarray.Wecreatethiswithin
OpenMPbydeclaringthearrayinthesequentialregion(thatis,priortocreatingtheteamofthreads).
Wecreatetheteamofthreadswithaparallelpragma:
#pragma omp parallel shared(flag)
Whentheworkisdone,thethreadsetsitsflagto1tonotifyanyinterestedthreadsthattheworkis
done.Thismustbeflushedtomemorytoensurethatotherthreadscanseetheupdatedvalue:
#pragma omp flush (flag)
Figure 6.2. Program showing one way to implement pairwise synchronization in

OpenMP. The flush construct is vital. It forces the memory to be consistent, thereby
making the updates to the flag array visible. For more details about the syntax of
OpenMP, see the OpenMP appendix, Appendix A.
#include <omp.h>
#include <stdio.h>
#define MAX 10 // max number of threads
// Functions used in this program: the details of these
// functions are not relevant so we do not include the

// function bodies.
extern int neighbor(int); // return the ID for a thread's neighbor
extern void do_a_whole_bunch(int);
extern void do_more_stuff( );
int main() {
int flag[MAX]; //Define an array of flags one per thread
int i;
for(i=0;i<MAX;i++)flag[i] = 0;
#pragma omp parallel shared (flag)
{
int ID;
ID = omp_get_thread_num(); // returns a unique ID for each thread.
do_a_wholebunch(ID); // Do a whole bunch of work.
flag[ID] = 1; // signal that this thread has finished its work
#pragma omp flush (flag) // make sure all the threads have a chance to
// see the updated flag value
while (!flag[neighbor(ID)]){ // wait to see if neighbor is done.
#pragma omp flush(flag) // required to see any changes to flag
}
do_more_stuff(); // call a function that can't safely start until
// the neighbor's work is complete
} // end parallel region
}
Thethreadthenwaitsuntilitsneighborisfinishedwithdo_a_whole_bunch()beforemovingon
tofinishitsworkwithacalltodo_more_work().
while (!flag( neighbor(ID))){ // wait to see if neighbor is done.
#pragma omp flush(flag) // required to see any changes to flag
}
TheflushoperationinOpenMPonlyaffectsthethreadvisiblevariablesforthecallingthread.If
anotherthreadwritesasharedvariableandforcesittobeavailabletotheotherthreadsbyusinga
flush,thethreadreadingthevariablestillneedstoexecuteaflushtomakesureitpicksupthenew
value.Hence,thebodyofthewhileloopmustincludeaflush (flag)constructtomakesure
thethreadseesanynewvaluesintheflagarray.
Aswementionedearlier,knowingwhenaflushisneededandwhenitisnotcanbechallenging.In
mostcases,theflushisbuiltintothesynchronizationconstruct.Butwhenthestandardconstructs
aren'tadequateandcustomsynchronizationisrequired,placingmemoryfencesintherightlocations
isessential.
Java: fences
JavadoesnotprovideanexplicitflushconstructasinOpenMP.Infact,theJavamemorymodel[3]is
notdefinedintermsofflushoperations,butintermsofconstraintsonvisibilityandorderingwith
respecttolockingoperations.Thedetailsarecomplicated,butthegeneralideaisnot:Supposethread
1performssomeoperationswhileholdinglockLandthenreleasesthelock,andthenthread2
acquireslockL.Theruleisthatallwritesthatoccurredinthread1beforethread1releasedthelock
arevisibleinthread2afteritacquiresthelock.Further,whenathreadisstartedbyinvokingits
startmethod,thestartedthreadseesallwritesvisibletothecalleratthecallpoint.Similarly,when
athreadcallsjoin,thecallerwillseeallwritesperformedbytheterminatingthread.
[3]
Javaisoneofthefirstlanguageswhereanattemptwasmadetospecifyitsmemory
modelprecisely.Theoriginalspecificationhasbeencriticizedforimprecisionaswellas
fornotsupportingcertainsynchronizationidiomswhileatthesametimedisallowing
somereasonablecompileroptimizations.AnewspecificationforJava21.5isdescribed
in[JSRa].Inthisbook,weassumethenewspecification.
Javaallowsvariablestobedeclaredasvolatile.Whenavariableismarkedvolatile,allwritesto
thevariableareguaranteedtobeimmediatelyvisible,andallreadsareguaranteedtoobtainthelast
valuewritten.Thus,thecompilertakescareofmemorysynchronizationissuesforvolatiles.[4]Ina
Javaversionofaprogramcontainingthefragment(wheredoneisexpectedtobesetbyanother
thread)
[4]
Fromthepointofviewoftherulestatedpreviously,readingavolatilevariableis
definedtohavethesameeffectwithregardtomemorysynchronizationasacquiringa
lockassociatedwiththevariable,whereaswritinghasthesameeffectasreleasingalock.
while (!done) {.../*do something but don't change done*/}
wewouldmarkdonetobevolatilewhenthevariableisdeclared,asfollows,andthenignore
memorysynchronizationissuesrelatedtothisvariableintherestoftheprogram.
volatile boolean done = false;
Becausethevolatilekeywordcanbeappliedonlytoreferencestoanarrayandnottothe
individualelements,thejava.util. concurrent. atomicpackageintroducedinJava21.5
addsanotionofatomicarrayswheretheindividualelementsareaccessedwithvolatilesemantics.For
example,inaJavaversionoftheOpenMPexampleshowninFig.6.2,wewoulddeclaretheflag
arraytobeoftypeAtomicIntegerArrayandupdateandreadwiththatclass'ssetandget
methods.
Anothertechniqueforensuringpropermemorysynchronizationissynchronizedblocks.A
synchronizedblockappearsasfollows:
synchronized(some_object){/*do something with shared variables*/}
WewilldescribesynchronizedblocksinmoredetailinSec.6.3.3andtheJavaappendix,AppendixC.
Forthetimebeing,itissufficienttoknowthatsome_objectisimplicitlyassociatedwithalock
andthecompilerwillgeneratecodetoacquirethislockbeforeexecutingthebodyofthesynchronized
blockandtoreleasethelockonexitfromtheblock.Thismeansthatonecanguaranteeproper
memorysynchronizationofaccesstoavariablebyensuringthatallaccessestothevariableoccurin
synchronizedblocksassociatedwiththesameobject.
MPI: fences
Afenceonlyarisesinenvironmentsthatincludesharedmemory.InMPIspecificationspriortoMPI
2.0,theAPIdidnotexposesharedmemorytotheprogrammer,andhencetherewasnoneedfora
usercallablefence.MPI2.0,however,includesonesidedcommunicationconstructs.Theseconstructs
create"windows"ofmemoryvisibletootherprocessesinanMPIprogram.Datacanbypushedtoor
pulledfromthesewindowsbyasingleprocesswithouttheexplicitcooperationoftheprocessowning
thememoryregioninquestion.Thesememorywindowsrequiresometypeoffence,butarenot
discussedherebecauseimplementationsofMPI2.0arenotwidelyavailableatthetimethiswas
written.
6.3.2. Barriers
AbarrierisasynchronizationpointatwhicheverymemberofacollectionofUEsmustarrivebefore
anymemberscanproceed.IfaUEarrivesearly,itwillwaituntilalloftheotherUEshavearrived.
Abarrierisoneofthemostcommonhighlevelsynchronizationconstructs.Ithasrelevancebothin
processorientedenvironmentssuchasMPIandthreadbasedsystemssuchasOpenMPandJava.
MPI: barriers
InMPI,abarrierisinvokedbycallingthefunction
MPI Barrier(MPI COMM)
whereMPI_COMMisacommunicatordefiningtheprocessgroupandthecommunicationcontext.All
processesinthegroupassociatedwiththecommunicatorparticipateinthebarrier.Althoughitmight
notbeapparenttotheprogrammer,thebarrieritselfisalmostalwaysimplementedwithacascadeof
pairwisemessagesusingthesametechniquesasusedinareduction(seethediscussionofreductionin
Sec.6.4.2).
Figure 6.3. MPI program containing a barrier. This program is used to time the
execution of function runit().
#include <mpi.h> // MPI include file
#include <stdio.h>
extern void runit();
int main(int argc, char **argv) {
int num_procs; // number of processes in the group

int ID; // unique identifier ranging from 0 to (num_procs-l)
double time_init, time_final, time_elapsed;
//
// Initialize MPI and set up the SPMD program
//
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &ID);
MPI_Comm_size (MPI_COMM_WORLD, &num_procs);
//
// Ensure that all processes are set up and ready to go before timing // runit ()
//
MPI_Barrier(MPI_COMM_WORLD);
time.init = MPI_Wtime();
runit(); // a function that we wish to time on each process
time_final = MPI_Wtime();
time_elapsed = time_final - time_init;
printf(" I am \%d and my computation took \%f seconds\n",
ID, time_elapsed);
MPI-Finalize();
return 0;
Asanexampleofabarrier,considertheprograminFig.6.3.ThissimpleprogramsetsuptheMPI
environmentandthenrecordstheexecutiontimeofafunctioncalledrunit( )(thecodeofwhich
isnotshown).ThetimeitselfisfoundusingtheMPItimingroutine:
MPI Wtime()
Thisfunctionreturnsadoubleprecisionvalueholdingtheelapsedtimeinsecondssincesomepointin
thepast.Thedifferencebetweenthevaluereturnedafterthefunctioncallandthevaluereturned
beforethefunctioncallgivestheelapsedtimeforthefunction'sexecution.Thisiswallclocktime,
thatis,thetimethatwouldelapseonaclockexternaltothecomputer.
TherecanbeconsiderablevariationinprocessstartuportheinitializationofMPI.Thus,forthetime
tobeconsistentacrossalltheprocesses,itisimportantthatallprocessesenterthetimedsectionof
codetogether.Toaddressthisissue,weplaceabarrierbeforethetimedsectionofcode.
OpenMP: barriers
InOpenMP,asimplepragmasetsthebarrier:
#pragma omp barrier
Allthreadsintheteamparticipateinthebarrier.Asanexampleofabarrier,considerthecodeinFig.
6.4.ThisprogramisessentiallythesameastheMPIprogram.Abarrierisusedtoensurethatall
threadscompleteanystartupactivitiesbeforethetimingmeasurementsaretaken.Thetimingroutine,
omp_Wtick(),wasmodeledaftertheanalogousMPIroutine,MPI_Wtime(),andisdefinedin
thesameway.
Figure 6.4. OpenMP program containing a barrier. This program is used to time the
#include <omp.h>
#include <stdio.h>
#pragma omp parallel private(time_init, time_final, time_elapsed)
{
int ID;
ID = omp_get_thread_num();
//
// ensure that all threads are set up and ready to go before timing runit()
//
#pragma omp barrier
time_init = omp_get_wtime();
runit(); // a function that we wish to time on each thread
time_final = omp_get_wtime();
time_ elapsed = time_final - time_init;
printf(" I am %d and my computation took \%f seconds\n",
ID, time_elapsed);
}
return ();
}
InadditiontoanexplicitbarrierasshowninFig.6.4,OpenMPautomaticallyinsertsbarriersatthe
endoftheworksharingconstructs(for, single, section,etc.).Thisimplicitbarriercanbe
disabled,ifdesired,byusingthenowaitclause.
Thebarrierimpliesacalltoflush,sotheOpenMPbarriercreatesamemoryfenceaswell.These
memoryflushes,combinedwithanycycleswastedwhileUEswaitatthebarrier,makeitapotentially
expensiveconstruct.Barriers,whichareexpensiveinanyprogrammingenvironment,mustbeused
whererequiredtoensurethecorrectprogramsemantics,butforperformancereasonsshouldbeused
nomorethanabsolutelyrequired.
Java: barriers
Javadidnotoriginallyincludeabarrierprimitive,althoughitisnotdifficulttocreateoneusingthe
facilitiesinthelanguage,aswasdoneinthepublicdomainutil. concurrentpackage[Lea].In
Java21.5,similarclassesareprovidedintheJava.util.concurrentpackage.
ACyclicBarrierissimilartothebarrierdescribedpreviously.TheCyclic-Barrierclass
containstwoconstructors:onethatrequiresthenumberofthreadsthatwillsynchronizeonthebarrier,
andanotherthattakesthenumberofthreadsalongwithaRunnableobjectwhoserunmethodwill
beexecutedbythelastthreadtoarriveatthebarrier.Whenathreadarrivesatthebarrier,itinvokes
thebarrier'sawaitmethod.Ifathread"breaks"thebarrierbyterminatingprematurelywithan
exception,theotherthreadswillthrowaBrokenBarrierException.ACyclicBarrier
automaticallyresetsitselfwhenpassedandcanbeusedmultipletimes(inaloop,forexample).
InFig.6.5,weprovideaJavaversionofthebarrierexamplesgivenpreviouslyusinga
CyclicBarrier.
Thejava.util.concurrentpackagealsoprovidesarelatedsynchronizationprimitive
CountDownLatch.ACountDownLatchisinitializedtoaparticularvalueN.Eachinvocationof
itscountDownmethoddecreasesthecount.Athreadexecutingtheawaitmethodblocksuntilthe
valueofthelatchreaches0.TheseparationofcountDown(analogousto"arrivingatthebarrier")
andawait(waitingfortheotherthreads)allowsmoregeneralsituationsthan"allthreadswaitforall
otherthreadstoreachabarrier."Forexample,asinglethreadcouldwaitforNeventstohappen,orN
threadscouldwaitforasingleeventtohappen.ACountDownLatchcannotberesetandcanonly
beusedonce.AnexampleusingaCountDownLatchisgivenintheJavaappendix,AppendixC.
6.3.3. Mutual Exclusion
Whenmemoryorothersystemresources(forexample,afilesystem)areshared,theprogrammust
ensurethatmultipleUEsdonotinterferewitheachother.Forexample,iftwothreadstrytoupdatea
shareddatastructureatthesametime,araceconditionresultsthatcanleavethestructureinan
inconsistentstate.Acriticalsectionisasequenceofstatementsthatconflictwithasequenceof
statementsthatmaybeexecutedbyotherUEs.Twosequencesofstatementsconflictifbothaccessthe
samedataandatleastoneofthemmodifiesthedata.Toprotecttheresourcesaccessedinsidethe
criticalsection,theprogrammermustusesomemechanismthatensuresthatonlyonethreadatatime
willexecutethecodewithinthecriticalsection.Thisiscalledmutualexclusion.
Figure 6.5. Java program containing a CyclicBarrier. This program is used to time the
public class TimerExample implements Runnable {
static int N;
static CyclicBarrier barrier;

final int ID;
public void run()
{
//wait at barrier until all threads are ready
System.out.println(ID + " at await");
try{ barrier.await(); }
catch (InterruptedException ex) { Thread.dumpStack();}
catch (BrokenBarrierException ex){Thread.dumpStack();}
//record start time
long time_init = System.currentTimeMillis() ;
//execute function to be timed on each thread runit() ;
//record ending time
long time_final = System.currentTimeMillis() ;
//print elapsed time
System.out.println("Elapsed time for thread "+ID+
" = "+(time_final-time_init)+" msecs");
return;
}
void runit(){...} //definition of runit()
TimerExample(int ID){this.ID = ID;}
{ N = Integer.parseInt(args[0]); //read number of threads
barrier = new CyclicBarrier(N); //instantiate barrier
for(int i = 0; i!= N; i++) //create and start threads
{new Thread(new TimerExample(i)) .start() ; }
}
Whenusingmutualexclusion,itiseasytofallintoasituationwhereonethreadismakingprogress
whileoneormorethreadsareblockedwaitingfortheirturntoenterthecriticalsection.Thiscanbea
serioussourceofinefficiencyinaparallelprogram,sogreatcaremustbetakenwhenusingmutual
exclusionconstructs.Itisimportanttominimizetheamountofcodethatisprotectedbymutual
exclusionand,ifpossible,staggerthearrivalatthemutualexclusionconstructbythemembersofa
teamofthreadssoaminimumnumberofthreadsarewaitingtoexecutetheprotectedcode.The
SharedDatapatterndiscussestheissueofwhatshouldbeprotectedbymutualexclusion.Herewe
focusontheimplementationmechanismsforprotectingacriticalsection.
Figure 6.6. Example of an OpenMP program that includes a critical section
#include <omp.h>
#include <stdio.h>
#define N 1000
extern double big_computation(int, int);
extern void consume_results(int, double, double *);
int main() {
double global_result[N];
#pragma omp parallel shared (global_result)
{
double local_result;
int I;
int ID = omp_get_thread_num(); // set a thread ID
#pragma omp for
for(i=0;i<N;i++){
local_result = big_computation(ID, i); // carry out the UE's work
{
consume_results(ID, local_result, global_result);
}
}
}
return 0;
}
OpenMP: mutual exclusion
InOpenMP,mutualexclusionismosteasilyaccomplishedusingaconstructcalledcriticalsection.An
exampleofusingthecriticalsectionconstructforOpenMPisshowninFig.6.6.
Inthisprogram,ateamofthreadsiscreatedtocooperativelycarryoutaseriesofNcallsto
big_computation().Thepragma
#pragma omp for
istheOpenMPconstructthattellsthecompilertodistributetheiterationsoftheloopamongateamof
threads.Afterbig_computation()iscomplete,theresultsneedtobecombinedintotheglobal
datastructurethatwillholdtheresult.
Whilewedon'tshowthecode,assumetheupdatewithinconsume_results()canbedoneinany
order,buttheupdatebyonethreadmustcompletebeforeanotherthreadcanexecuteanupdate.The
criticalpragmaaccomplishesthisforus.Thefirstthreadtofinishitsbig_computation()
enterstheenclosedblockofcodeandcallsconsume_results().Ifathreadarrivesatthetopof
thecriticalsectionblockwhileanotherthreadisprocessingtheblock,itwaitsuntilthepriorthreadis
finished.
Thecriticalsectionisanexpensivesynchronizationoperation.Uponentrytoacriticalsection,a
threadflushesallvisiblevariablestoensurethataconsistentviewofthememoryisseeninsidethe
criticalsection.Attheendofthecriticalsection,weneedanymemoryupdatesoccurringwithinthe
criticalsectiontobevisibletotheotherthreadsintheteam,soasecondflushofallthreadvisible
variablesisrequired.
Thecriticalsectionconstructisnotonlyexpensive,itisnotverygeneral.Itcannotbeusedamong
subsetsofthreadswithinateamortoprovidemutualexclusionbetweendifferentblocksofcode.
Thus,theOpenMPAPIprovidesalowerlevelandmoreflexibleconstructformutualexclusioncalled
alock.
LocksindifferentsharedmemoryAPIstendtobesimilar.Theprogrammerdeclaresthelockand
initializesit.Onlyonethreadatatimeisallowedtoholdthelock.Otherthreadstryingtoacquirethe
lockwillblock.Blockingwhilewaitingforalockisinefficient,somanylockAPIsallowthreadsto
testalock'savailabilitywithouttryingtoacquireit.Thus,athreadcanopttodousefulworkandcome
backtoattempttoacquirethelocklater.
ConsidertheuseoflocksinOpenMP.TheexampleinFig.6.7showsuseofasimplelocktomake
sureonlyonethreadatatimeattemptstowritetostandardoutput.
Theprogramfirstdeclaresthelocktobeoftypeomp_lock_t.Thisisanopaqueobject,meaning
thataslongastheprogrammeronlymanipulateslockobjectsthroughtheOpenMPruntimelibrary,
theprogrammercansafelyworkwiththelockswithouteverconsideringthedetailsofthelocktype.
Thelockistheninitializedwithacalltoomp_init_lock.
Figure 6.7. Example of using locks in OpenMP
#include <omp.h>
#include <stdio.h>
int main() {
omp_lock_t lock; // declare the lock using the lock
// type defined in omp.h
omp_set_num_threads(5);
omp_init_lock (&lock); // initialize the lock
#pragma omp parallel shared (lock)
{
int id = omp_get_thread_num();
omp_set_lock (&lock);
printf("\n only thread %d can do this print\n",id);
omp_unset_lock (&lock);
}
}
Tobeofanyuseformanagingconcurrency,alockmustbesharedbetweentheindividualmembersof
theteamofthreads.Thus,thelockisdefinedpriortotheparallelpragmaanddeclaredasa
sharedvariable(whichisthedefault,butwecallthesharedclauseinthisexamplejusttoemphasize
thepoint).Insidetheparallelregion,athreadcansetthelock,whichcausesanyotherthreads
attemptingtosetthesamelocktoblockandwaituntilthelockhasbeenunset.
UnliketheOpenMPcriticalsection,anOpenMPlockdoesnotdefineamemoryfence.Ifthe
operationscarriedoutbythethreadholdingthelockdependonanyvaluesfromotherthreads,the
programmightfailbecausethememorymightnotbeconsistent.Managingmemoryconsistencycan
bedifficultformanyprogrammers,somostOpenMPprogrammersopttoavoidlocksandusethe
muchsafercriticalsections.
Java: mutual exclusion
TheJavalanguageprovidessupportformutualexclusionwiththesynchronizedblockconstructand
also,inJava21.5,withnewlockclassescontainedinthepackage
java.util.concurrent.lock.EveryobjectinaJavaprogramimplicitlycontainsitsown
lock.Eachsynchronizedblockhasanassociatedobject,andathreadmustacquirethelockonthat
objectbeforeexecutingthebodyoftheblock.Whenthethreadexitsthebodyofthesynchronized
block,whethernormallyorabnormallybythrowinganexception,thelockisreleased.
InFig.6.8,weprovideaJavaversionoftheexamplegivenpreviously.Theworkdonebythethreads
isspecifiedbytherunmethodinthenestedWorkerclass.NotethatbecauseNisdeclaredtobe
final(andisthusimmutable),itcansafelybeaccessedbyanythreadwithoutrequiringany
synchronization.
Forthesynchronizedblockstoexcludeeachother,theymustbeassociatedwiththesameobject.In
theexample,thesynchronizedblockintherunmethodusesthis.getClass()asanargument.
ThisexpressionreturnsareferencetotheruntimeobjectrepresentingtheWorkerclass.Thisisa
convenientwaytoensurethatallcallersusethesameobject.Wecouldalsohaveintroducedaglobal
instanceofjava.lang.Object(oraninstanceofanyotherclass)andusedthatastheargument
tothesynchronizedblock.Whatisimportantisthatallthethreadssynchronizeonthesameobject.A
potentialmistakewouldbetouse,say,this,whichwouldnotenforcethedesiredmutualexclusion
becauseeachworkerthreadwouldbesynchronizingonitselfandthuslockingadifferentlock.
BecausethisisaverycommonmisunderstandingandsourceoferrorsinmultithreadedJava
programs,weemphasizeagainthatasynchronizedblockonlyprotectsacriticalsectionfromaccess
byotherthreadswhoseconflictingstatementsarealsoenclosedinasynchronizedblockwiththesame
objectasanargument.Synchronizedblocksassociatedwithdifferentobjectsdonotexcludeeach
other.(Theyalsodonotguaranteememorysynchronization.)Also,thepresenceofasynchronized
blockinamethoddoesnotconstraincodethatisnotinasynchronizedblock.Thus,forgettinga
neededsynchronizedblockormakingamistakewiththeargumenttothesynchronizedblockcan
haveseriousconsequences.
ThecodeinFig.6.8isstructuredsimilarlytotheOpenMPexample.Amorecommonapproachused
inJavaprogramsistoencapsulateashareddatastructureinaclassandprovideaccessonlythrough
synchronizedmethods.Asynchronizedmethodisjustaspecialcaseofasynchronizedblockthat
includesanentiremethod.
Figure 6.8. Java version of the OpenMP program in Fig. 6.6
public class Example {
static final int N = 10;
static double[] global_result = new double[N];

public static void main(String[] args) throws InterruptedException
{
//create and start N threads
Thread[] t = new Thread[N];
for (int i = 0; i != N; i++)
{t[i] = new Thread(new Worker(i)); t[i] .start( ) ; }
//wait for all N threads to finish
for (int i = 0; i != N; i++){t[i].join();}
//print the results
for (int i = 0; i!=N; i++)
{System.out.print(global_result[i] + " ");}
System.out.println("done");
}
static class Worker implements Runnable
{
int i;
int ID;
Worker(int ID){this.ID = ID;}
//main work of threads
public void run()
{ //perform the main computation
local_result = big_computation(ID, i);
//update global variables in synchronized block
synchronized(this.getClass())
{consume_results(ID, local_result, global_result);}
}
//define computation
double big_computation(int ID, int i){ . . . }
//define result update

void consume_results(int ID, double local_result,
double[] global_result){. . .}
}
Itisimplicitlysynchronizedonthisfornormalmethodsandtheclassobjectforstaticmethods.This
approachmovestheresponsibilityforsynchronizationfromthethreadsaccessingtheshareddata
structuretothedatastructureitself.Oftenthisisabetterengineeredapproach.InFig.6.9,the
previousexampleisrewrittentousethisapproach.Nowtheglobal_resultvariableis
encapsulatedintheExample2classandmarkedprivatetohelpenforcethis.(Tomakethepoint,the
Workerclassisnolongeranestedclass.)Theonlywayfortheworkerstoaccessthe
global_resultarrayisthroughtheconsume_resultsmethod,whichisnowasynchronized
methodintheExample2class.Thus,theresponsibilityforsynchronizationhasbeenmovedfrom
theclassdefiningtheworkerthreadstotheclassowningtheglobal_resultarray.
Figure 6.9. Java program showing how to implement mutual exclusion with a
synchronized method
public class Example2 {
static final int N = 10;
private static double[] global_result = new double[N];
{ //create and start N threads
Thread[] t = new Thread[N] ;
for (int i = 0; i != N; i++)
{t[i] = new Thread(new Worker(i)); t[i] .start( ) ; }
//wait for all threads to terminate
for (int i = 0; i != N; i++){t[i] . join( ) ;}
//print results
for (int i = 0; i!=N; i++)
{System.out.print(global_result[i] + " ");}
System.out.println("done");
}
//synchronized method serializing consume_results method

synchronized static void consume_results(int ID, double local_result)
{ global.result[ID] = . . . }
class Worker implements Runnable

{
int i;
int ID;
Worker(int ID){this.ID = ID;}
public void run()
{ //perform the main computation
local_result = big_computation(ID, i); //carry out the UE's work
//invoke method to update results

Example2.consume_results(ID, local_result);
//define computation
double big_computation(int ID, int i){ . . . }
ThesynchronizedblockconstructinJavahassomedeficiencies.Probablythemostimportantfor
parallelprogrammersisthelackofawaytofindoutwhetheralockisavailablebeforeattemptingto
acquireit.Thereisalsonowaytointerruptathreadwaitingonasynchronizedblock,andthe
synchronizedblockconstructforcesthelockstobeacquiredandreleasedinanestedfashion.This
disallowscertainkindsofprogrammingidiomsinwhichalockisacquiredinoneblockandreleased
inanother.
Asaresultofthesedeficiencies,manyprogrammershavecreatedtheirownlockclassesinsteadof
usingthebuiltinsynchronizedblocks.Forexamples,see[Lea].Inresponsetothissituation,inJava2
1.5,packagejava.util.concurrent.locksprovidesseverallockclassesthatcanbeusedas
analternativetosynchronizedblocks.ThesearediscussedintheJavaappendix,AppendixC.
MPI: mutual exclusion
Asisthecasewithmostofthesynchronizationconstructs,mutualexclusionisonlyneededwhenthe
statementsexecutewithinasharedcontext.Hence,asharednothingAPIsuchasMPIdoesnot
providesupportforcriticalsectionsdirectlywithinthestandard.ConsidertheOpenMPprogramin
Fig.6.6.IfwewanttoimplementasimilarmethodinMPIwithacomplexdatastructurethathasto
beupdatedbyoneUEatatime,thetypicalapproachistodedicateaprocesstothisupdate.Theother
processeswouldthensendtheircontributionstothededicatedprocess.WeshowthissituationinFig.
6.10.
ThisprogramusestheSPMDpattern.AswiththeOpenMPprograminFig.6.6,wehavealoopto
carryoutNcallstobig_computation(),theresultsofwhichareconsumedandplacedina
singleglobaldatastructure.Updatestothisdatastructuremustbeprotectedsothatresultsfromonly
oneUEatatimeareapplied.
WearbitrarilychoosetheUEwiththehighestranktomanagethecriticalsection.Thisprocessthen
executesaloopandpostsNreceives.ByusingtheMPI_ANY_SOURCEandMPI_ANY_TAGvalues
intheMPI_Recv()statement,themessagesholdingresultsfromthecallsto
big_computation()aretakeninanyorder.IfthetagorIDarerequired,theycanberecovered
fromthestatusvariablereturnedfromMPI_Recv().
TheotherUEscarryouttheNcallstobig_computation().BecauseoneUEhasbeendedicated
tomanagingthecriticalsection,theeffectivenumberofprocessesinthecomputationisdecreasedby
one.WeuseacyclicdistributionoftheloopiterationsaswasdescribedintheExamplessectionofthe
SPMDpattern.Thisassignstheloopiterationsinaroundrobinfashion.AfteraUEcompletesits
computation,theresultissenttotheprocessmanagingthecriticalsection.[5]
[5]
Weusedasynchronoussend(MPI_Ssend()),whichdoesnotreturnuntilamatching
MPIreceivehasbeenposted,toduplicatethebehaviorofsharedmemorymutual
exclusionascloselyaspossible.Itisworthnoting,however,thatinadistributedmemory
environment,makingthesendingprocesswaituntilthemessagehasbeenreceivedisonly
rarelyneededandaddsadditionalparalleloverhead.Usually,MPIprogrammersgoto
greatlengthstoavoidparalleloverheadsandwouldonlyusesynchronousmessage
passingasalastresort.Inthisexample,thestandardmodemessagepassingfunctions,
MPI_Send()andMPI_Recv(),wouldbeabetterchoiceunlesseither(1)acondition
externaltothecommunicationrequiresthetwoprocessestosatisfyanorderingconstraint,
henceforcingthemtosynchronizewitheachother,or(2)communicationbuffersor
anothersystemresourcelimitthecapacityofthecomputerreceivingthemessages,
therebyforcingtheprocessesonthesendingsidetowaituntilthereceivingsideisready.
Figure 6.10. Example of an MPI program with an update that requires mutual
exclusion. A single process is dedicated to the update of this data structure.
#include <mpi.h> // MPI include file
#include <stdio.h>
#define N 1000
extern void consume_results(int, double, double * );
extern double big_computation(int, int);
int Tag1 = 1; int Tag2 = 2; // message tags
int num_procs; // number of processes in group
int ID; // unique identifier from 0 to (num_Procs-1)
double local_result, global_result[N];
int i, ID_CRIT;
MPI_Status stat; // MPI status parameter
// Need at least two processes for this method to work
if(num_procs < 2) MPI_Abort(MPI_COMM_WORLD,-1);
// Dedicate the last process to managing update of final result
ID_CRIT = num_procs-1;
if (ID == ID_CRIT) {
int ID_sender; // variable to hold ID of sender
for(i=0;i<N;i++){
MPI_Recv(&local_result, 1, MPI_DOUBLE, MPI_ANY_SOURCE,
MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
ID_sender = stat.MPI_SOURCE;
consume_results(ID_sender, local_result, global_result);
}
}
else {
num_procs--;
for(i=ID;i<N;i+=num_procs){ // cyclic distribution of loop iterations
local_result = big_computation(ID, i); // carry out UE's work
// Send local result using a synchronous Send - a send that doesn't
// return until a matching receive has been posted.
MPI_Ssend (&local_result, 1, MPIDOUBLE, ID_CRIT, ID,
MPI_COMM_WORLD);
}
}
MPI_Finalize();
return 0;
}
6.4. COMMUNICATION
Inmostparallelalgorithms,UEsneedtoexchangeinformationasthecomputationproceeds.Shared
memoryenvironmentsprovidethiscapabilitybydefault,andthechallengeinthesesystemsisto
synchronizeaccesstosharedmemorysothattheresultsarecorrectregardlessofhowtheUEsare
scheduled.Indistributedmemorysystems,however,itistheotherwayaround:Becausetherearefew,
ifany,sharedresources,theneedforexplicitsynchronizationtoprotecttheseresourcesisrare.
Communication,however,becomesamajorfocusoftheprogrammer'seffort.
6.4.1. Message Passing
Amessageisthemostbasiccommunicationelement.Amessageconsistsofaheadercontainingdata
aboutthemessage(forexample,source,destination,andatag)andaseriesofbitstobe
communicated.Messagepassingistypicallytwosided,meaningthatamessageisexplicitlysent
betweenapairofUEs,fromaspecificsourcetoaspecificdestination.Inadditiontodirectmessage
passing,therearecommunicationeventsthatinvolvemultipleUEs(usuallyallofthem)inasingle
communicationevent.Wecallthiscollectivecommunication.Thesecollectivecommunicationevents
oftenincludecomputationaswell,theclassicexamplebeingaglobalsummationwhereasetof
valuesdistributedaboutthesystemissummedintoasinglevalueoneachnode.
Inthefollowingsections,wewillexplorecommunicationmechanismsinmoredetail.Westartwith
basicmessagepassinginMPI,OpenMP,andJava.Then,weconsidercollectivecommunication,
lookinginparticularatthereductionoperationinMPIandOpenMP.Weclosewithabrieflookat
otherapproachestomanagingcommunicationinparallelprograms.
MPI: message passing
MessagepassingbetweenapairofUEsisthemostbasicofthecommunicationoperationsand
providesanaturalstartingpointforourdiscussion.AmessageissentbyoneUEandreceivedby
another.Initsmostbasicform,thesendandthereceiveoperationarepaired.
Asanexample,considertheMPIprograminFig.6.11and6.12.Inthisprogram,aringofprocessors
worktogethertoiterativelycomputetheelementswithinafield(fieldintheprogram).Tokeepthe
problemsimple,weassumethedependenciesintheupdateoperationaresuchthateachUEonly
needsinformationfromitsneighbortothelefttoupdatethefield.
Weshowonlythepartsoftheprogramrelevanttothecommunication.Thedetailsoftheactual
updateoperation,initializationofthefield,oreventhestructureofthefieldandboundarydataare
omitted.
TheprograminFig.6.11and6.12declaresitsvariablesandtheninitializestheMPIenvironment.This
isaninstanceoftheSPMDpatternwherethelogicwithintheparallelalgorithmwillbedrivenbythe
processrankIDandthenumberofprocessesintheteam.
Afterinitializingfield,wesetupthecommunicationpatternbycomputingtwovariables,left
andright,thatidentifywhichprocessesarelocatedtotheleftandtheright.Inthisexample,the
computationisstraightforwardandimplementsaringcommunicationpattern.Inmorecomplex
programs,however,theseindexcomputationscanbecomplex,obscure,andthesourceofmanyerrors.
Thecoreloopoftheprogramexecutesanumberofsteps.Ateachstep,theboundarydataiscollected
andcommunicatedtoitsneighborsbyshiftingaroundaring,andthenthelocalblockofthefield
controlledbytheprocessisupdated.
Figure 6.11. MPI program that uses a ring of processors and a communication pattern
where information is shifted to the right. The functions to do the computation do not
affect the communication itself so they are not shown. (Continued in Fig. 6.12.)
#include <stdio.h>
#include "mpi.h" // MPI include file
#define IS_ODD(x) ((x)%2) // test for an odd int
// prototypes for functions to initialize the problem, extract
// the boundary region to share, and perform the field update.
// The contents of these functions are not provided.
extern
extern
extern
extern
void
void
void
void
init (int, int, double *, double * , double *);

extract_boundary (int, int, int, int, double *);
update (int, int, int, int, double *, double *);
output_results (int, double *);

int Tag1 = 1; // message tag
int nprocs; // the number of processes in the group
int ID; // the process rank
int Nsize; // Problem size (order of field matrix)
int Bsize; // Number of doubles in the boundary
int Nsteps; // Number of iterations
double *field, *boundary, *incoming;
int i, left, right;
//
//
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
init (Nsize, Bsize, field, boundary, incoming);
Noticethatwehadtoexplicitlyswitchthecommunicationbetweenoddandevenprocesses.This
ensuresthatthematchingcommunicationeventsareorderedconsistentlyonthetwoprocesses
involved.Thisisimportantbecauseonsystemswithlimitedbufferspace,asendmaynotbeableto
returnuntiltherelevantreceivehasbeenposted.
OpenMP: message passing
MessagepassingcanbeaneffectivestrategyinOpenMP.OneuseofmessagepassinginOpenMPis
toemulateanMPIalgorithm,forexamplewhenportingaworkingMPIprogramtoasystemwhere
onlyOpenMPisavailable.AnothercaseiswhenusingNUMAmachineswheredatalocalityis
important.ByusingtheSPMDpatternandmessagepassing,theprogrammercanmoreprecisely
controlhowtheprogramdataisalignedwiththesystem'smemoryhierarchy,therebyresultingin
betterperformance.
Figure 6.12. MPI program that uses a ring of processors and a communication pattern
where information is shifted to the right (continued from Fig. 6.11)
// assume a ring of processors and a communication pattern
// where boundaries are shifted to the right.
left = (ID+1); if(left>(nprocs-1)) left = 0;
right = (ID-1); if(right<0)right = nprocs-1;
for(i = 0; i < Nsteps; i++){
extract_boundary(Nsize, Bsize, ID, nprocs, boundary);
if(IS_ODD(ID)){
MPI_Send (boundary, Bsize, MPI.DOUBLE, right, Tagl,
MPI_COMM_WORLD);
MPI_Recv (incoming, Bsize, MPI.DOUBLE, left, Tagl,
MPI_COMM_WORLD, &stat);
}
else {
MPI_Recv (incoming, Bsize, MPI_DOUBLE, left, Tagl,
MPI_Send (boundary, Bsize, MPI_DOUBLE, right, Tagl,
MPI_COMM_WORLD);
}
update(Nsize, Bsize, ID, nprocs, field, incoming);
}
output_results(Nsize, field);
MPI_Finalize();
return 0;
}
AparticularlyeasywaytousemessagepassingwithinOpenMPisshowninFig.6.13and6.14.This
programisbasicallythesameastheMPIprograminFig.6.11and6.12.Asbefore,wehidedetailsof
theproblemnotrelevanttoadiscussionoftheparallelalgorithm.
MessagepassingisaccomplishedinOpenMPbythereceivingprocessreadingthemessagefroma
shareddatastructure.InMPI,thesynchronizationrequiredbytheproblemisimpliedbythemessage
passingfunctions.InOpenMP,thesynchronizationisexplicit.Westartwithabarriertomakesure
allthreadsareatthesamepoint(thatis,readytofilltheboundarytheywillshare).Asecond
barrierensuresthatallthethreadshavefinishedfillingtheirboundarybeforethereceivingside
usesthedata.
Thisstrategyisextremelyeffectiveforproblemsthatinvolvethesameamountofworkoneachthread.
Whensomethreadsarefasterthanotherthreads,becauseeitherthehardwareislessloadedorperhaps
thedataisnotevenlydistributed,theuseofbarriersleadstoexcessiveparalleloverheadasthreads
waitatthebarrier.Thesolutionistomorefinelytunethesynchronizationtotheneedsoftheproblem,
which,inthiscase,wouldindicatethatpairwisesynchronizationshouldbeused.
The"messagepassing"OpenMPprograminFig.6.15showsasimplewaytointroducepairwise
synchronizationinOpenMP.ThiscodeismuchmorecomplicatedthantheOpenMPprograminFig.
6.13andFig.6.14.Basically,wehaveaddedanarraycalleddonethatathreadusestoindicatethatits
buffer(thatis,theboundarydata)isreadytouse.Athreadfillsitsbuffer,setstheflag,andthen
flushesittomakesureotherthreadscanseetheupdatedvalue.Thethreadthencheckstheflagforits
neighborandwaits(usingasocalledspinlock)untilthebufferfromitsneighborisreadytoreceive.
Figure 6.13. OpenMP program that uses a ring of threads and a communication
pattern where information is shifted to the right (continued in Fig. 6.14)
#include <stdio.h>
#include <omp.h> // OpenMP include file
#define MAX 10 // maximum number of threads
//
// prototypes for functions to initialize the problem,
// extract the boundary region to share, and perform the
// field update. Note: the initialize routine is different
// here in that it sets up a large shared array (that is, for
// the full problem), not just a local block.
//
extern void init (int, int, double *, double * , double *);
extern void extract_boundary (int, int, double *, double *);
extern void update (int, int, double *, double *);
extern void output_results (int, double *);
int Nsize; // Problem size (order of field matrix)
int Bsize; // Number of doubles in the boundary
double *field;
double *boundary[MAX]; // array of pointers to a buffer to hold
// boundary data
//
// Create Team of Threads
//
Thiscodeissignificantlymorecomplex,butitcanbemoreefficientontwocounts.First,barriers
causeallthreadstowaitforthefullteam.Ifanyonethreadisdelayedforwhateverreason,itslows
downtheentireteam.Thiscanbedisastrousforperformance,especiallyifthevariabilitybetween
threads'workloadsishigh.Second,we'vereplacedtwobarrierswithonebarrierandaseriesof
flushes.Aflushisexpensive,butnoticethateachoftheseflushesonlyflushesasinglesmallarray
(done).Itislikelythatmultiplecallstoaflushwithasinglearraywillbemuchfasterthanthesingle
flushofallthreadvisibledataimpliedbyabarrier.
Java: message passing
TheJavalanguagedefinitiondoesnotspecifymessagepassing(asitdoesfacilitiesforconcurrent
programmingwiththreads).SimilartechniquestothosediscussedformessagepassinginOpenMP
couldbeused.However,thestandardclasslibrariesprovidedwiththeJavadistributionprovide
extensivesupportforvarioustypesofcommunicationindistributedenvironments.Ratherthan
provideexamplesofthosetechniques,wewillprovideanoverviewoftheavailablefacilities.
Figure 6.14. OpenMP program that uses a ring of threads and a communication
pattern where information is shifted to the right (continued from Fig. 6.13)
#pragma omp parallel shared(boundary, field, Bsize, Nsize)
//
// Set up the SPMD program. Note: by declaring ID and Num_threads
// inside the parallel region, we make them private to each thread.
//
int ID, nprocs, i, left;
nprocs = omp_get_num_threads();
if (nprocs > MAX) {
exit (-1);
}
//
//
left = (ID-1); if(left<0) left = nprocs-1;
#pragma omp barrier
extract_boundary(Nsize, Bsize, ID, nprocs, boundary[ID]);
#pragma omp barrier
update(Nsize, Bsize, ID, nprocs, field, boundary[left]);
}
return 0;
Javaprovidessignificantsupportforclientserverdistributedcomputinginheterogeneous
environments.ItisveryeasytosetupaTCPsocketconnectionbetweentwoPEsandsendandreceive
dataovertheconnection.Therelevantclassesarefoundinpackagesjava.netandjava.io.The
serializationfacilities(seethejava.io.Serializableinterface)supportconvertingapossibly
complexdatastructureintoasequenceofbytesthatcanbetransmittedoveranetwork(orwrittentoa
file)andreconstructedatthedestination.TheRMI(remotemethodinvocation)packages
java.rmi.*provideamechanismenablinganobjectononeJVMtoinvokemethodsonanobject
inanotherJVM,whichmayberunningonadifferentcomputer.SerializationisusedbytheRMI
packagestomarshaltheargumentstothemethodandreturntheresulttothecaller.
Figure 6.15. The message-passing block from Fig. 6.13 and Fig. 6.14, but with more
careful synchronization management (pairwise synchronization)
int done[MAX]; // an array of flags dimensioned for the
// boundary data
//
// Create Team of Threads
//
#pragma omp parallel shared(boundary, field, Bsize, Nsize)
//
// Set up the SPMD program. Note: by declaring ID and Num_threads
// inside the parallel region, we make them private to each thread.
//
int ID, nprocs, i, left;
nprocs = omp_get_num_threads();
if (nprocs > MAX) { exit (-1); }
//
//
left = (ID-1); if(left<0) left = nprocs-1;
done[ID] = 0; // set flag stating "buffer ready to fill"
#pragma omp barrier // all visible variables flushed, so we
// don't need to flush "done".
extract_boundary(Nsize, Bsize, ID, num_procs, boundary[ID]);
done[ID] = 1; // flag that the buffer is ready to use

#pragma omp flush (done)
while (done[left] != 1){
#pragma omp flush (done)
}
update(Nsize, Bsize, ID, num_procs, field, boundary[left]);
done[left] = 0; // set flag stating "buffer ready to fill"
}
Althoughthejava, io, java.net,andjava.rmi.*packagesprovideveryconvenient

programmingabstractionsandworkverywellinthedomainforwhichtheyweredesigned,theyincur
highparalleloverheadsandareconsideredtobeinadequateforhighperformancecomputing.High
performancecomputerstypicallyusehomogeneousnetworksofcomputers,sothegeneralpurpose
distributedcomputingapproachessupportedbyJava'sTCP/IPsupportresultindataconversionsand
checksthatarenotusuallyneededinhighperformancecomputing.Anothersourceofparallel
overheadistheemphasisinJavaonportability.Thisledtolowestcommondenominatorfacilitiesin
thedesignofJava'snetworkingsupport.AmajorproblemisblockingI/O.Thismeans,forexample,
thatareadoperationonasocketwillblockuntildataisavailable.Stallingtheapplicationisusually
avoidedinpracticebycreatinganewthreadtoperformtheread.Becauseareadoperationcanbe
appliedtoonlyonesocketatatime,thisleadstoacumbersomeandnonscalableprogrammingstyle
withseparatethreadsforeachcommunicationpartner.
Toremedysomeofthesedeficiencies,(whichwerealsoaffectinghighperformanceenterprise
servers)newI/Ofacilitiesinthejava.niopackageswereintroducedinJava21.4.BecausetheJava
specificationhasnowbeensplitintothreeseparateversions(enterprise,core,andmobile),thecore
JavaandenterpriseJavaAPIsarenolongerrestrictedtosupportingonlyfeaturesthatcanbe
supportedontheleastcapabledevices.ResultsreportedbyPughandSacco[PS04]indicatethatthe
newfacilitiespromisetoprovidesufficientperformancetomakeJavaareasonablechoiceforhigh
performancecomputinginclusters.
Thejava.niopackagesprovidenonblockingI/Oandselectorssothatonethreadcanmonitor
severalsocketconnections.Themechanismforthisisanewabstraction,calledachannel,thatserves
asanopenconnectiontosockets,files,hardwaredevices,etc.ASocketChannelisusedfor
communicationoverTCPorUDPconnections.Aprogrammayreadorwritetoorfroma
SocketChannelintoaByte Bufferusingnonblockingoperations.Buffersareanothernew
abstractionintroducedinJava21.4.Theyarecontainersforlinear,finitesequencesofprimitivetypes.
Theymaintainastatecontainingacurrentposition(alongwithsomeotherinformation)andare
accessedwith"put"and"get"operationsthatputorgetanelementintheslotindicatedbythecurrent
position.Buffersmaybeallocatedasdirectorindirect.Thespaceforadirectbufferisallocated
outsideoftheusualmemoryspacemanagedbytheJVMandsubjecttogarbagecollection.Asa
result,directbuffersarenotmovedbythegarbagecollector,andreferencestothemcanbepassedto
thesystemlevelnetworksoftware,eliminatingacopyingstepbytheJVM.Unfortunately,thiscomes
atthepriceofmorecomplicatedprogramming,becauseputandgetoperationsarenotespecially
convenienttouse.Typically,however,onewouldnotuseaBufferforthemaincomputationinthe
program,butwoulduseamoreconvenientdatastructuresuchasanarraythatwouldbebulkcopied
toorfromtheBuffer.[6]
Manyparallelprogrammersdesirehigherlevel,oratleastmorefamiliar,communicationabstractions
thanthosesupportedbythestandardpackagesinJava.SeveralgroupshaveimplementedMPIlike
bindingsforJavausingvariousapproaches.OneapproachusestheJNI(JavaNativeInterface)tobind
toexistingMPIlibraries.JavaMPI[Min97]andmpiJava[BCKL98]areexamples.Otherapproaches
usenewcommunicationsystemswritteninJava.Thesearenotwidelyusedandareinvariousstates
ofavailability.Some[JCS98]attempttoprovideanMPIlikeexperiencefortheprogrammer
followingproposedstandardsdescribedin[BC00],whileothersexperimentwithalternativemodels
[Man].Anoverviewisgivenin[AJMJS02].Thesesystemstypicallyusetheoldjava.iobecause
thejava.niopackagehasonlyrecentlybecomeavailable.Anexceptionistheworkreportedby
PughandSacco[PS04].Althoughtheyhavenot,atthiswriting,provideddownloadablesoftware,
theydescribeapackageprovidingasubsetofMPIwithperformancemeasurementsthatsuggestsan
optimisticoutlookforthepotentialofJavawithjava.nioforhighperformancecomputingin
clusters.Onecanexpectthatinthenearfuturemorepackagesbuiltonjava.niothatsimplify
parallelprogrammingwillbecomeavailable.
6.4.2. Collective Communication
Whenmultiple(morethantwo)UEsparticipateinasinglecommunicationevent,theeventiscalleda
collectivecommunicationoperation.MPI,asamessagepassinglibrary,includesmostofthemajor
collectivecommunicationoperations.
Broadcast.AmechanismtosendasinglemessagetoallUEs.
Barrier.ApointwithinaprogramatwhichallUEsmustarrivebeforeanyUEscancontinue.
Thisisdescribedearlierinthischapterwiththeothersynchronizationmechanisms,butitis
mentionedagainherebecauseinMPIitisimplementedasacollectivecommunication.
Reduction.Amechanismtotakeacollectionofobjects,oneoneachUE,andcombinethem
intoasingleobjectononeUE(MPI_Reduce)orcombinethemsuchthatthecombined
valueisleftoneachoftheUEs(MPI_Allreduce).
Ofthese,reductionisthemostcommonlyused.Itplayssuchanimportantroleinparallelalgorithms
thatitisincludedwithinmostparallelprogrammingAPIs(includingsharedmemoryAPIssuchas
OpenMP).Wewillfocusonreductioninthisdiscussionbecausethecommunicationpatternsand
techniquesusedwithreductionarethesameasthoseusedwithanyoftheglobalcommunication
events.
Reduction
Areductionoperationreducesacollectionofdataitemstoasingledataitembyrepeatedlycombining
thedataitemspairwisewithabinaryoperator,usuallyonethatisassociativeandcommutative.
Examplesofreductionsincludefindingthesum,product,ormaximumofallelementsinanarray.In
general,wecanrepresentsuchanoperationasthecalculationof
Equation6.1
whereoisabinaryoperator.Themostgeneralwaytoimplementareductionistoperformthe
calculationseriallyfromlefttoright,anapproachthatoffersnopotentialforconcurrentexecution.If
oisassociative,however,thecalculationinEq.6.1containsexploitableconcurrency,inthatpartial
resultsoversubsetsofv0,,vm1canbecalculatedconcurrentlyandthencombined.Ifoisalso
commutative,thecalculationcanbenotonlyregrouped,butalsoreordered,openingupadditional
possibilitiesforconcurrentexecution,asdescribedlater.
Notallreductionoperatorshavetheseusefulproperties,however,andonequestiontobeconsideredis
whethertheoperatorcanbetreatedasifitwereassociativeand/orcommutativewithoutsignificantly
changingtheresultofthecalculation.Forexample,floatingpointadditionisnotstrictlyassociative
(becausethefiniteprecisionwithwhichnumbersarerepresentedcancauseroundofferrors,
especiallyifthedifferenceinmagnitudebetweenoperandsislarge),butifallthedataitemstobe
addedhaveroughlythesamemagnitude,itisusuallycloseenoughtoassociativetopermitthe
parallelizationstrategiesdiscussednext.Ifthedataitemsvaryconsiderablyinmagnitude,thismay
notbethecase.
Mostparallelprogrammingenvironmentsprovideconstructsthatimplementreduction.
MPIprovidesgeneralpurposefunctionsMPI_ReduceandMPI_Allreduceaswellas
supportforseveralcommonreductionoperations(MPI_MIN, MPI_MAX,andMPI_SUM).
OpenMPincludesareductionclausethatcanbeappliedonaparallelsectionora
workshareconstruct.Itprovidesforaddition,subtraction,multiplication,andanumberof
bitwiseandlogicaloperators.
Asanexampleofareduction,considertheMPIprograminFig.6.16.Thisisacontinuationofour
earlierexamplewhereweusedabarriertoenforceconsistenttiming(Fig.6.3).Thetimingis
measuredforeachUEindependently.Asanindicationoftheloadimbalancefortheprogram,itis
usefultoreportthetimeasaminimumtime,amaximumtime,andanaveragetime.WedothisinFig.
6.16withthreecallstoMPI_ReduceonewiththeMPI_MINfunction,onewiththeMPI_MAX
function,andonewiththeMPI_SUMfunction.
AnexampleofreductioninOpenMPcanbefoundinFig.6.17.ThisisanOpenMPversionofthe
programinFig.6.16.Thevariablesforthenumberofthreads(num_threads)andtheaveragetime
(ave_time)aredeclaredpriortotheparallelregionsotheycanbevisiblebothinsideandfollowing
theparallelregion.Weuseareductionclauseontheparallelpragmatoindicatethatwewill
bedoingareductionwiththe+operatoronthevariableave_time.Thecompilerwillcreatealocal
copyofthevariableandinitializeittothevalueoftheidentityoftheoperatorinquestion(zerofor
addition).Eachthreadsumsitsvaluesintothelocalcopy.Followingthecloseoftheparallelregion,
allthelocalcopiesarecombinedwiththeglobalcopytoproducethefinalvalue.
Becauseweneedtocomputetheaveragetime,weneedtoknowhowmanythreadswereusedinthe
parallelregion.Theonlywaytosafelydeterminethenumberofthreadsisbyacalltothe
omp_num_threads ()functioninsidetheparallelregion.Multiplewritestothesamevariable
caninterferewitheachother,soweplacethecalltoomp_num_threads ()insideasingle
construct,therebyensuringthatonlyonethreadwillassignavaluetothesharedvariable
num_threads.AsingleinOpenMPimpliesabarrier,sowedonotneedtoexplicitlyincludeone
tomakesureallthethreadsenterthetimedcodeatthesametime.Finally,notethatwecomputeonly
theaveragetimeinFig.6.17becauseminandmaxoperatorsarenotincludedintheC/C++OpenMP
version2.0.
Figure 6.16. MPI program to time the execution of a function called runit(). We use
MPI_Reduce to find minimum, maximum, and average runtimes.
#include <stdio.h>
int main(int argc, char**argv) {
int num_procs; // the number of processes in the group
int ID; // a unique identifier ranging from 0 to (num_Procs-1)
double min_time, max_time, ave_time;
//
//
//
// Ensure that all processes are set up and ready to go before timing // runit ()
//
time_init = MPI_Wtime();
time_final = MPI_Wtime();
time_elapsed = time_final-time_init;
MPI_Reduce (&time_elapsed, &min_time, 1, MPI_DOUBLE, MPI_MIN, 0,
MPI_COMM_WORLD);
MPI_Reduce (&time_elapsed, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0,
MPI_COMM_WORLD);
MPI_Reduce (&time_elapsed, &ave_time, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
if (ID == 0){
ave_time = ave_time /(double)num_procs;
printf(" min, ave and max times (secs): %f, %f, %f\n",
min_time, ave_time,max_time);
}
MPI_Finalize();
return 0;
}
Implementing reduction operations
Mostprogrammerswillneverimplementtheirownreductionsbecausethereductionmechanisms
availablewithmostparallelprogrammingenvironmentsmeettheirneeds.Java,however,doesnot
provideareductionoperator,soJavaprogrammersmightneedtoimplementtheirownparallel
reductions.Inanycase,thealgorithmsusedforreductionsareinstructiveandwellworththeeffortto
understand.
Figure 6.17. OpenMP program to time the execution of a function called runit(). We
use a reduction clause to find sum of the runtimes.
#include <stdio.h>
#include <omp.h> // OpenMP include file
int main(int argc, char**argv) {
int num_threads; // the number of processes in the group
double ave_time=0.0;
#pragma omp parallel reduction(+ : ave_time) num_threads(10)
{
// The single construct causes one thread to set the value of
// the num_threads variable. The other threads wait at the
// barrier implied by the single construct.
#pragma omp single
num_threads = omp_get_num_threads();
time_init = omp_get wtime();
time_final = omp_get wtime();
time_elapsed = time_final - time_init;
}
ave_time += time_elapsed;
ave_time = ave_time/(double)num_threads;
printf(" ave time (secs): % \n", ave_time);

return 0;
}
Weconsidertreebasedreductionsinthefollowingdiscussion.Thesearethesimplestofthescalable
parallelreductionalgorithmsandalthoughnotoptimal,theyexposemostoftheissuesaddressedin
+
themoreoptimalreductionalgorithms[CKP
93
].
Serial computation
Ifthereductionoperatorisnotassociative,orcannotbetreatedasassociativewithoutsignificantly
affectingtheresult,itwilllikelybenecessarytoperformtheentirereductionseriallyinasingleUE,
assketchedinFig.6.18.IfonlyoneUEneedstheresultofthereduction,itisprobablysimplestto
havethatUEperformtheoperation;ifallUEsneedtheresult,thereductionoperationcanbefollowed
byabroadcastoperationtocommunicatetheresulttootherUEs.Forsimplicity,thefigureshowsa
situationinwhichthereareasmanyUEsasdataitems.Thesolutioncanbeextendedtosituationsin
whichtherearemoredataitemsthanUEs,buttherequirementthatalltheactualcomputationbedone
inasingleUEstillholdsbecauseofthelackofassociativity.(Clearlythissolutioncompletelylacks
concurrency;wementionitforthesakeofcompletenessandbecauseitmightstillbeusefuland
appropriateifthereductionoperationrepresentsonlyarelativelysmallpartofacomputationwhose
otherpartsdohaveexploitableconcurrency.)
Figure 6.18. Serial reduction to compute the sum of a(0) through a(3). sum(a(i:j))
denotes the sum of elements i through j of array a.
Inthisapproach,theindividualcombinetwoelementsoperationsmustbeperformedinsequence,so
itissimplesttohavethemallperformedbyasingletask.Becauseofthedatadependencies(indicated
bythearrowsinthefigure),however,somecautionisrequiredifthereductionoperationisperformed
aspartofalargercalculationinvolvingmultipleconcurrentUEs;theUEsnotperformingthe
reductioncancontinuewithotherworkonlyiftheycandosowithoutaffectingthecomputationofthe
reductionoperation(forexample,ifmultipleUEsshareaccesstoanarrayandoneofthemis
computingthesumoftheelementsofthearray,theothersshouldnotbesimultaneouslymodifying
elementsofthearray).Inamessagepassingenvironment,thiscanusuallybeaccomplishedusing
messagepassingtoenforcethedatadependencyconstraints(thatis,theUEsnotperformingthe
reductionoperationsendtheirdatatothesingleUEactuallydoingthecomputation).Inother
environments,theUEsnotperformingthereductionoperationcanbeforcedtowaitbymeansofa
barrier.
Tree-based reduction
Ifthereductionoperatorisassociativeorcanbetreatedassuch,thenthetreebasedreductionshown
inFig.6.19isappropriate.Thealgorithmproceedsthroughaseriesofstages;ateachstagehalfthe
UEspasstheirdatatoanotherUE.Atthebeginning,alloftheUEsareactivelyinvolvedinthe
reduction,butateachstage,halftheUEsdropout,andfinallyweareleftwithasingleUEholdingthe
reducedvalue.
Figure 6.19. Tree-based reduction to compute the sum of a(0) through a(3) on a
system with 4 UEs. sum(a(i:j)) denotes the sum of elements i through j of array a.
Forsimplicity,thefigureshowsthecasewherethereareasmanyUEsasdataitems.Thesolutioncan
beextendedtosituationsinwhichtherearemoredataitemsthanUEsbyfirsthavingeachUE
performaserialreductiononasubsetofthedataitemsandthencombiningtheresultsasshown.(The
serialreductions,oneperUE,areindependentandcanbedoneconcurrently.)
Inatreebasedreductionalgorithmsome,butnotall,ofthecombinetwoelementsoperationscanbe
performedconcurrently(forexample,inthefigure,wecancomputesum(a(0:1))andsum
(a(2:3))concurrently,butthecomputationofsum(a(0:3))mustoccurlater).Amoregeneral
sketchforperformingatreebasedreductionusing2nUEssimilarlybreaksdownintonsteps,with
eachstepinvolvinghalfasmanyconcurrentoperationsasthepreviousstep.Aswiththeserial
strategy,cautionisrequiredtomakesurethatthedatadependenciesshowninthefigurearehonored.
Inamessagepassingenvironment,thiscanusuallybeaccomplishedbyappropriatemessagepassing;
inotherenvironments,itcouldbeimplementedusingbarriersynchronizationaftereachofthensteps.
UsingthetreebasedreductionalgorithmisparticularlyattractiveifonlyoneUEneedstheresultof
thereduction.IfotherUEsalsoneedtheresult,thereductionoperationcanbefollowedbyabroadcast
operationtocommunicatetheresulttootherUEs.Noticethatthebroadcastisjusttheinverseofthe
reductionshowninFig.6.19;thatis,ateachstage,aUEpassesthevaluetotwoUEs,thereby
doublingthenumberofUEswiththebroadcastvalue.
Recursive doubling
IfalloftheUEsmustknowtheresultofthereductionoperation,thentherecursivedoublingscheme
ofFig.6.20isbetterthanthetreebasedapproachfollowedbyabroadcast.
Figure 6.20. Recursive-doubling reduction to compute the sum of a(0) through a(3).
sum (a(i:j)) denotes the sum of elements i through j of array a.
Aswiththetreebasedcode,ifthenumberofUEsisequalto2n,thenthealgorithmproceedsinn
steps.Atthebeginningofthealgorithm,everyUEhassomenumberofvaluestocontributetothe
reduction.Thesearecombinedlocallytoasinglevaluetocontributetothereduction.Inthefirststep,
theevennumberedUEsexchangetheirpartialsumswiththeiroddnumberedneighbors.Inthe
secondstage,insteadofimmediateneighborsexchangingvalues,UEstwostepsawayinteract.Atthe
nextstage,UEsfourstepsawayinteract,andsoforth,doublingthereachoftheinteractionateach
stepuntilthereductioniscomplete.
Attheendofnsteps,eachUEhasacopyofthereducedvalue.Comparingthistotheprevious
strategyofusingatreebasedalgorithmfollowedbyabroadcast,weseethefollowing:Thereduction
andthebroadcasttakenstepseach,andthebroadcastcannotbeginuntilthereductioniscomplete,so
theelapsedtimeisO(2n),andduringthese2nstepsmanyoftheUEsareidle.Therecursivedoubling
algorithm,however,involvesalltheUEsateachstepandproducesthesinglereducedvalueatevery
UEafteronlynsteps.
6.4.3. Other Communication Constructs

Wehavelookedatonlythemostcommonmessagepassingconstructs:pointtopointmessagepassing
andcollectivecommunication.Therearemanyimportantvariationsinmessagepassing.InMPI,itis
possibletofinetuneperformancebychanginghowthecommunicationbuffersarehandledorto
overlapcommunicationandcomputation.SomeofthesemechanismsaredescribedintheMPI
appendix,AppendixB.
Noticethatinourdiscussionofcommunication,thereisareceivingUEandasendingUE.Thisis
calledtwosidedcommunication.Forsomealgorithms(asdiscussedintheDistributedArraypattern),
thecalculationsinvolvedincomputingindicesintoadistributeddatastructureandmappingthemonto
theprocessowningthedesireddatacanbeverycomplicatedanddifficulttodebug.Someofthese
problemscanbeavoidedbyusingonesidedcommunication.
OnesidedcommunicationoccurswhenaUEmanagescommunicationwithanotherUEwithoutthe
explicitinvolvementoftheotherUE.Forexample,amessagecanbedirectly"put"intoabufferon
anothernodewithouttheinvolvementofthereceivingUE.TheMPI2.0standardincludesaonesided
communicationAPI.AnotherexampleofonesidedcommunicationisanoldersystemcalledGAor
GlobalArrays[NHL94,NHL96,NHK+02,Gloa].GAprovidesasimpleonesidedcommunication
environmentspecializedtotheproblemofdistributedarrayalgorithms.
Anotheroptionistoreplaceexplicitcommunicationwithavirtualsharedmemory,wheretheterm
"virtual"isusedbecausethephysicalmemorycouldbedistributed.Anapproachthatwaspopularin
theearly1990swasLinda[CG91].Lindaisbasedonanassociativevirtualsharedmemorycalleda
tuplespace.TheoperationsinLinda"put","take",or"read"asetofvaluesbundledtogetherintoan
objectcalledatuple.Tuplesareaccessedbymatchingagainstatemplate,makingthememory
contentaddressableorassociative.Lindaisgenerallyimplementedasacoordinationlanguage,thatis,
asmallsetofinstructionsthatextendanormalprogramminglanguage(thesocalledcomputation
language).Lindaisnolongerusedtoanysignificantextent,buttheideaofanassociativevirtual
sharedmemoryinspiredbyLindalivesoninJavaSpaces[FHA99].
Morerecentattemptstohidemessagepassingbehindavirtualsharedmemoryarethecollectionof
languagesbasedonthePartitionedGlobalAddressSpaceModel:UPC[UPC],Titanium[Tita],and
CoArrayFortran[Co].TheseareexplicitlyparalleldialectsofC,Java,andFortran(respectively)
basedonavirtualsharedmemory.Unlikeothersharedmemorymodels,suchasLindaorOpenMP,
thesharedmemoryinthePartitionedGlobalAddressSpacemodelispartitionedandincludesthe
conceptofaffinityofsharedmemorytoparticularprocessors.UEscanreadandwriteeachothers'
memoryandperformbulktransfers,buttheprogrammingmodeltakesintoaccountnonuniform
memoryaccess,allowingthemodeltobemappedontoawiderangeofmachines,fromSMPto
NUMAtoclusters.
Endnotes
6. PughandSacco[PS04]havereportedthatitisactuallymoreefficienttodobulkcopies
betweenbuffersandarraysbeforeasendthantoeliminatethearrayandperformthe
calculationupdatesdirectlyonthebuffersusingputandget.
Appendix A. A Brief Introduction to OpenMP

A.1CORECONCEPTS
A.2STRUCTUREDBLOCKSANDDIRECTIVEFORMATS
A.3WORKSHARING
A.4DATAENVIRONMENTCLAUSES
A.5THEOpenMPRUNTIMELIBRARY
A.6SYNCHRONIZATION
A.7THESCHEDULECLAUSE
A.8THERESTOFTHELANGUAGE
OpenMP[OMP]isacollectionofcompilerdirectivesandlibraryfunctionsthatareusedtocreate
parallelprogramsforsharedmemorycomputers.OpenMPiscombinedwithC,C++,orFortranto
createamultithreadingprogramminglanguage;thatis,thelanguagemodelisbasedonthe
assumptionthattheUEsarethreadsthatshareanaddressspace.
TheformaldefinitionofOpenMPiscontainedinapairofspecifications,oneforFortranandthe
otherforCandC++.Theydifferinsomeminordetails,butforthemostpart,aprogrammerwho
knowsOpenMPforonelanguagecanpickuptheotherlanguagewithlittleadditionaleffort.
OpenMPisbasedonthefork/joinprogrammingmodel.AnexecutingOpenMPprogramstartsasa
singlethread.Atpointsintheprogramwhereparallelexecutionisdesired,theprogramforks
additionalthreadstoformateamofthreads.Thethreadsexecuteinparallelacrossaregionofcode
calledaparallelregion.Attheendoftheparallelregion,thethreadswaituntilthefullteamarrives,
andthentheyjoinbacktogether.Atthatpoint,theoriginalormasterthreadcontinuesuntilthenext
parallelregion(ortheendoftheprogram).
ThegoalofOpenMP'screatorswastomakeOpenMPeasyforapplicationprogrammerstouse.
Ultimateperformanceisimportant,butnotifitwouldmakethelanguagedifficultforsoftware
engineerstouse,eithertocreateparallelprogramsortomaintainthem.Tothisend,OpenMPwas
designedaroundtwokeyconcepts:sequentialequivalenceandincrementalparallelism.
Aprogramissaidtobesequentiallyequivalentwhenityieldsthesame[1]resultswhetheritexecutes
usingonethreadormanythreads.Asequentiallyequivalentprogramiseasiertomaintainand,in
mostcases,mucheasiertounderstand(andhencewrite).
[1]
Theresultsmaydifferslightlyduetothenonassociativityoffloatingpointoperations.
Figure A.1. Fortran and C programs that print a simple string to standard output
Incrementalparallelismreferstoastyleofparallelprogramminginwhichaprogramevolvesfroma
sequentialprogramintoaparallelprogram.Aprogrammerstartswithaworkingsequentialprogram
andblockbyblockfindspiecesofcodethatareworthwhiletoexecuteinparallel.Thus,parallelismis
addedincrementally.Ateachphaseoftheprocess,thereisaworkingprogramthatcanbeverified,
greatlyincreasingthechancesthattheprojectwillbesuccessful.
ItisnotalwayspossibletouseincrementalparallelismortocreatesequentiallyequivalentOpenMP
programs.Sometimesaparallelalgorithmrequirescompleterestructuringoftheanalogoussequential
program.Inothercases,theprogramisconstructedfromthebeginningtobeparallelandthereisno
sequentialprogramtoincrementallyparallelize.Also,thereareparallelalgorithmsthatdonotwork
withonethreadandhencecannotbesequentiallyequivalent.Still,incrementalparallelismand
sequentialequivalenceguidedthedesignoftheOpenMPAPIandarerecommendedpractices.
A.1. CORE CONCEPTS

WewillstartourreviewofthecoreconceptsbehindOpenMPbyconsideringasimpleprogramthat
printsastringtothestandardoutputdevice.FortranandCversionsofthisprogramaregiveninFig.
A.1.
Wewillwriteaversionofthisprogramthatcreatesmultiplethreads,eachofwhichwillprint"Epur
simuove."[2]
[2]
ThesewordsareattributedtoGiordanoBrunoonFebruary16,1600,ashewasburned
atthestakeforinsistingthattheearthorbitedtheSun.ThisLatinphraseroughly
translatesas"Andnonetheless,itmoves."
OpenMPisanexplicitlyparallelprogramminglanguage.Thecompilerdoesn'tguesshowtoexploit
concurrency.Anyparallelismexpressedinaprogramistherebecausetheprogrammerdirectedthe
compilerto"putitthere".TocreatethreadsinOpenMP,theprogrammerdesignatesblocksofcode
thataretoruninparallel.ThisisdoneinCandC++withthepragma
orinFortranwiththedirective
C$OMP PARALLEL
ModernlanguagessuchasCandC++areblockstructured.Fortran,however,isnot.Hence,the
FortranOpenMPspecificationdefinesadirectivetoclosetheparallelblock:
$OHP END PARALLEL
ThispatternisusedwithotherFortranconstructsinOpenMP;thatis,oneformopensthestructured
block,andamatchingformwiththewordENDinsertedaftertheOMPclosestheblock.
InFig.A.2,weshowparallelprogramsinwhicheachthreadprintsastringtothestandardoutput.
Whenthisprogramexecutes,theOpenMPruntimesystemcreatesanumberofthreads,eachofwhich
willexecutetheinstructionsinsidetheparallelconstruct.Iftheprogrammerdoesn'tspecifythe
numberofthreadstocreate,adefaultnumberisused.Wewilllatershowhowtocontrolthedefault
numberofthreads,butforthesakeofthisexample,assumeitwassettothree.
OpenMPrequiresthatI/Obethreadsafe.Therefore,eachoutputrecordprintedbyonethreadis
printedcompletelywithoutinterferencefromotherthreads.TheoutputfromtheprograminFig.A.2
wouldthenlooklike:
E pur si muove
E pur si muove
E pur si muove
Inthiscase,eachoutputrecordwasidentical.Itisimportanttonote,however,thatalthougheach
recordprintsasaunit,therecordscanbeinterleavedinanyway,soaprogrammercannotdependon
thethreadswritingtheirrecordsinaparticularorder.
OpenMPisasharedmemoryprogrammingmodel.Thedetailsofthememorymodelwillbe
discussedlater,butagoodrulethatholdsinmostcasesisthis:Avariableallocatedpriortothe
parallelregionissharedbetweenthethreads.SotheprograminFig.A.3wouldprint:
E pur si muove 5
E pur si muove 5
E pur si muove 5
Ifavariableisdeclaredinsideaparallelregion,itissaidtobelocalorprivatetoathread.InC,a
variabledeclarationcanoccurinanyblock.InFortran,however,declarationscanonlyoccuratthe
beginningofasubprogram.
Fig.A.4showsaCprogramwithalocalvariable.Wehavealsoincludedacalltoafunctioncalled
omp_get_thread_num().ThisintegerfunctionispartoftheOpenMPstandardruntimelibrary
(describedlaterinthisappendix).Itreturnsanintegeruniquetoeachthreadthatrangesfromzeroto
thenumberofthreadsminusone.Ifweassumethedefaultnumberofthreadsisthree,thenthe
followinglines(interleavedinanyorder)willbeprinted:
c = 0, i = 5
c = 2, i = 5
c = 1, i = 5
Figure A.4. Simple program to show the difference between shared and local (or
private) data
#include <stdio.h>
#include <omp.h>
int main()
{
int i=5; // a shared variable

{
int c; // a variable local or private to each thread
c = omp_get_thread num();
printf("c = %d, i = %d\n",c,i);
}
}
Thefirstvalueprintedistheprivatevariable,c,whereeachthreadhasitsownprivatecopyholdinga
uniquevalue.Thesecondvariableprinted,i,isshared,andthusallthethreadsdisplaythesamevalue
fori.
Ineachoftheseexamples,theruntimesystemisallowedtoselectthenumberofthreads.Thisisthe
mostcommonapproach.Itispossibletochangetheoperatingsystem'sdefaultnumberofthreadsto
usewithOpenMPapplicationsbysettingtheOMP_NUM_THREADSenvironmentvariable.For
example,onaLinuxsystemwithcshastheshell,tousethreethreadsinourprogram,priorto
runningtheprogramonewouldissuethefollowingcommand:
setenv OHP_NUM_THREADS 3
Thenumberofthreadstobeusedcanalsobesetinsidetheprogramwiththenum_threadsclause.
Forexample,tocreateaparallelregionwiththreethreads,theprogrammerwouldusethepragma
#pragma omp parallel num_threads(3)
Ineachofthesecases,thenumberofthreadsshouldbethoughtofasarequestforacertainnumberof
threads.Thesystemincertaincircumstancesmayprovidefewerthreadsthanthenumberrequested.
Hence,ifanalgorithmmustknowtheactualnumberofthreadsbeingused,theprogrammustquery
thesystemforthisnumberinsidetheparallelregion.Wediscusshowtodothislaterwhenwe
describeOpenMP'sruntimelibrary.
A.2. STRUCTURED BLOCKS AND DIRECTIVE FORMATS

AnOpenMPconstructisdefinedtobeadirective(orpragma)plusablockofcode.Notjustanyblock
ofcodewilldo.Itmustbeastructuredblock;thatis,ablockwithonepointofentryatthetopanda
singlepointofexitatthebottom.
AnOpenMPprogramisnotallowedtobranchintooroutofastructuredblock.Toattempttodosois
afatalerrorgenerallycaughtatcompiletime.Likewise,thestructuredblockcannotcontaina
returnstatement.Theonlybranchstatementsallowedarethosethatshutdowntheentireprogram
(STOPinFortranorexit()inC).Whenthestructuredblockcontainsonlyasinglestatement,itis
notnecessarytoincludethebrackets(inC)ortheend constructdirectives.
Aswesawinoursimpleexample,anOpenMPdirectiveinCorC++hastheform
#pragma omp directive-name [clause[ clause] ... ]
wheredirective-nameidentifiestheconstructandtheoptionalclauses[3]modifytheconstruct.
SomeexamplesofOpenMPpragmasinCorC++follow:
[3]
Throughoutthisappendix,wewillusesquarebracketstoindicateoptionalsyntactic
elements.
#pragma omp parallel private(ii, jj, kk)
#pragma omp barrier
#pragma omp for reduction(+:result)
ForFortran,thesituationismorecomplicated.Wewillconsideronlythesimplestcase,fixedform[4]
Fortrancode.Inthiscase,anOpenMPdirectivehasthefollowingform:
[4]
Fixedformreferstothefixedcolumnconventionsforstatementsinolderversionsof
Fortran(Fortran77andearlier).
sentinel directive-name [clause[[,]clause] ... ]
wheresentinelcanbeoneof:
C$OHP
C!OHP
*!OHP
Therulesconcerningfixedformsourcelinesapply.Spaceswithintheconstructsareoptional,and
continuationisindicatedbyacharacterincolumnsix.Forexample,thefollowingthreeOpenMP
directivesareequivalent:
C$oMP PARALLEL DO PRIVATE(I,J)
*!OHP PARALLEL
*!OMP1 DOPRIVATE(I,J)
C!OMP PARALLEL DO PRIVATE(I,J)
A.3. WORKSHARING
Whenusingtheparallelconstructalone,everythreadexecutesthesameblockofstatements.There
aretimes,however,whenweneeddifferentcodetomapontodifferentthreads.Thisiscalled
worksharing.
ThemostcommonlyusedworksharingconstructinOpenMPistheconstructtosplitloopiterations
betweendifferentthreads.Designingaparallelalgorithmaroundparallelloopsisanoldtraditionin
parallelprogramming[X393].Thisstyleissometimescalledloopsplittingandisdiscussedatlength
intheLoopParallelismpattern.Inthisapproach,theprogrammeridentifiesthemosttimeconsuming
loopsintheprogram.Eachloopisrestructured,ifnecessary,sotheloopiterationsarelargely
independent.Theprogramisthenparallelizedbymappingdifferentgroupsofloopiterationsonto
differentthreads.
Forexample,considertheprograminFig.A.5.Inthisprogram,acomputationallyintensivefunction
big_comp()iscalledrepeatedlytocomputeresultsthatarethencombinedintoasingleglobal
answer.Forthesakeofthisexample,weassumethefollowing.
Figure A.5. Fortran and C examples of a typical loop-oriented program
Figure A.6. Fortran and C examples of a typical loop-oriented program. In this version
of the program, the computationally intensive loop has been isolated and modified so
the iterations are independent.
Thecombine()routinedoesnottakemuchtimetorun.
Thecombine()functionmustbecalledinthesequentialorder.
Thefirststepistomaketheloopiterationsindependent.OnewaytoaccomplishthisisshowninFig.
A.6.Becausethecombine()functionmustbecalledinthesameorderasinthesequential
program,thereisanextraorderingconstraintintroducedintotheparallelalgorithm.Thiscreatesa
dependencybetweentheiterationsoftheloop.Ifwewanttoruntheloopiterationsinparallel,we
needtoremovethisdependency.
Inthisexample,we'veassumedthecombine()functionissimpleanddoesn'ttakemuchtime.
Hence,itshouldbeacceptabletorunthecallstocombine ()outsidetheparallelregion.Wedo
thisbyplacingeachintermediateresultcomputedbybig_comp()intoanelementofanarray.Then
thearrayelementscanbepassedtothecombine()functioninthesequentialorderinaseparate
loop.Thiscodetransformationpreservesthemeaningoftheoriginalprogram(thatis,theresultsare
identicalbetweentheparallelcodeandtheoriginalversionoftheprogram).
Withthistransformation,theiterationsofthefirstloopareindependentandtheycanbesafely
computedinparallel.Todividetheloopiterationsamongmultiplethreads,anOpenMPworksharing
constructisused.Thisconstructassignsloopiterationsfromtheimmediatelyfollowingloopontoa
teamofthreads.Laterwewilldiscusshowtocontrolthewaytheloopiterationsarescheduled,butfor
now,weleaveittothesystemtofigureouthowtheloopiterationsaretobemappedontothethreads.
TheparallelversionsareshowninFig.A.7.
Figure A.7. Fortran and C examples of a typical loop-oriented program parallelized
with OpenMP
TheOpenMPparallelconstructisusedtocreatetheteamofthreads.Thisisfollowedbythe
worksharingconstructtosplituploopiterationsamongthethreads:aDOconstructinthecaseof
FortranandaforconstructforC/C++.Theprogramrunscorrectlyinparallelandpreserves
sequentialequivalencebecausenotwothreadsupdatethesamevariableandanyoperations(suchas
callstothecombine()function)thatdonotcommuteorarenotassociativearecarriedoutinthe
sequentialorder.Noticethat,accordingtotheruleswegaveearlierconcerningthesharingof
variables,theloopcontrolvariableiwouldbesharedbetweenthreads.TheOpenMPspecification,
however,recognizesthatitnevermakessensetosharetheloopcontrolindexonaparallelloop,soit
automaticallycreatesaprivatecopyoftheloopcontrolindexforeachthread.
Bydefault,thereisanimplicitbarrierattheendofanyOpenMPworkshareconstruct;thatis,allthe
threadswaitattheendoftheconstructandonlyproceedafterallofthethreadshavearrived.This
barriercanberemovedbyaddinganowaitclausetotheworksharingconstruct:
#pragma omp for nowait
Oneshouldbeverycarefulwhenusinganowaitbecause,inmostcases,thesebarriersareneededto
preventraceconditions.
TherearetwoothercommonlyusedtypesofworksharingconstructsinOpenMP:thesingle
constructandthesectionsconstruct.Thesingleconstructdefinesablockofcodethatwillbe
executedbythefirstthreadthatencounterstheconstruct.Theotherthreadsskiptheconstructand
waitattheimplicitbarrierattheendofthesingle(unlessthenowaitclauseisused).Weused
thisconstructintheOpenMPreductionexampleintheImplementationMechanismsdesignspace.We
neededtocopythenumberofthreadsusedinaparallelregionintoasharedvariable.Tomakesure
onlyonethreadwroteintothissharedvariable,weplaceditinasingleconstruct:
#pragma omp single
num_threads = omp_get_num_threads();
Thesectionsconstructisusedtosetuparegionoftheprogramwheredistinctblocksofcodeare
tobeassignedtodifferentthreads.Eachblockisdefinedwithasectionconstruct.Wedonotuse
sectionsinanyofourexamples,sowewillnotdiscussthemindetailhere.
Finally,itisverycommoninanOpenMPprogramtohaveaparallelconstructimmediatelyfollowed
byaworksharingconstruct.Forexample,inFig.A.7,wehad
{
#pragma omp for
. . .
}
Asashortcut,OpenMPdefinesacombinedconstruct:
. . .
Thisisidenticaltothecasewheretheparallelandforconstructsareplacedwithinseparate
pragmas.
A.4. DATA ENVIRONMENT CLAUSES

OpenMPisarelativelysimpleAPI.MostofthechallengesinworkingwithOpenMParisefromthe
detailsofhowdataissharedbetweenthreadsandhowdatainitializationinteractswithOpenMP.We
willaddresssomeofthemorecommonissueshere,buttothoroughlyunderstandthedata
environmentassociatedwithOpenMP,oneshouldreadtheOpenMPspecification[OMP].Detailsof
memorysynchronizationarediscussedinthesectiononsynchronization(Section6.3)inthe
ImplementationMechanismsdesignspace.
WebeginbydefiningthetermswewillusetodescribethedataenvironmentinOpenMP.Ina
program,avariableisacontainer(ormoreconcretely,astoragelocationinmemory)boundtoaname
andholdingavalue.Variablescanbereadandwrittenastheprogramruns(asopposedtoconstants
thatcanonlyberead).
InOpenMP,thevariablethatisboundtoagivennamedependsonwhetherthenameappearspriorto
aparallelregion,insideaparallelregion,orfollowingaparallelregion.Whenthevariableisdeclared
priortoaparallelregion,itisbydefaultsharedandthenameisalwaysboundtothesamevariable.
OpenMP,however,includesclausesthatcanbeaddedtoparallelandtotheworksharing
constructstocontrolthedataenvironment.Theseclausesaffectthevariableboundtoaname.A
private (list)clausedirectsthecompilertocreate,foreachthread,aprivate(orlocal)variable
foreachnameincludedinthelist.Thenamesintheprivatelistmusthavebeendefinedandboundto
sharedvariablespriortotheparallelregion.Theinitialvaluesofthesenewprivatevariablesare
undefined,sotheymustbeexplicitlyinitialized.Furthermore,aftertheparallelregion,thevalueofa
variableboundtoanameappearinginaprivateclausefortheregionisundefined.
Forexample,intheLoopParallelismpatternwepresentedaprogramtocarryoutasimpletrapezoid
integration.Theprogramconsistsofasinglemainloopinwhichvaluesoftheintegrandarecomputed
forarangeofxvalues.Thexvariableisatemporaryvariablesetandthenusedineachiterationof
theloop.Hence,anydependenciesimpliedbythisvariablecanberemovedbygivingeachthreadits
owncopyofthevariable.Thiscanbedonesimplywiththeprivate (x)clause,asshowninFig.
A.8.(Thereductionclauseinthisexampleisdiscussedlater.)
Figure A.8. C program to carry out a trapezoid rule integration to compute (here
comes equation)
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main () {
int i;
#pragma omp parallel for privateW reduction(+:sum)
{
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = step * sum;
return0
}
Severalotherclauseschangehowvariablesaresharedbetweenthreads.Themostcommonlyused
onesfollow.
firstprivate(list).Justaswithprivate,foreachnameappearinginthelist,anew
privatevariableforthatnameiscreatedforeachthread.Unlikeprivate,however,thenewly
createdvariablesareinitializedwiththevalueofthevariablethatwasboundtothenamein
theregionofcodeprecedingtheconstructcontainingthefirstprivateclause.This
clausecanbeusedonbothparallelandtheworksharingconstructs.
lastprivate (list).Onceagain,privatevariablesforeachthreadarecreatedforeach
nameinthelist.Inthiscase,however,thevalueoftheprivatevariablefromthe
sequentiallylastloopiterationiscopiedoutintothevariableboundtothenameintheregion
followingtheOpenMPconstructcontainingthelastprivateclause.Thisclausecanonly
beusedwiththelooporientedworkshareconstructs.
Variablesgenerallycanappearinthelistofonlyasingledataclause.Theexceptioniswith
lastprivateandfirstprivate,becauseitisquitepossiblethataprivatevariablewillneed
bothawelldefinedinitialvalueandavalueexportedtotheregionfollowingtheOpenMPconstructin
question.
AnexampleusingtheseclausesisprovidedinFig.A.9.Wedeclarefourvariablesprivatetoeach
threadandassignvaluestothreeofthem:h = 1, j = 2,andk = 0.Followingtheparallel
loop,weprintthevaluesofh, j,andk.Thevariablekiswelldefined.Aseachthreadexecutesa
loopiteration,thevariablekisincremented.Itisthereforeameasureofhowmanyiterationseach
threadhandled.
Figure A.9. C program showing use of the private, firstprivate, and lastprivate
clauses. This program is incorrect in that the variables h and j do not have welldefined values when the printf is called. Notice the use of a backslash to continue the
OpenMP pragma onto a second line.
#include <stdio.h>
#include <omp.h>
#define N 1000
int main()
{
int h, i, j, k;
h= 1; j =2; k=0;
#pragma omp parallel for private(h) firstprivate(j,k) \
lastprivate(j,k)
for(i=O;i<N;i++) {
k++;
j = h + i; //ERROR: h, and therefore j, is undefined
}
printf("h = %d, j = %d, k = %d\n",h, j, k); //ERROR j and h
//are undefined
}
Thevaluepassedoutsidetheloopbecauseofthelastprivateclauseisthevalueofkfor
whicheverthreadexecutedthesequentiallylastiterationoftheloop(thatis,theiterationforwhichi
= 999).Thevaluesofbothhandjareundefined,butfordifferentreasons.Thevariablejis
undefinedbecauseitwasassignedthevaluefromasumwithanuninitializedvariable(h)insidethe
parallelloop.Theproblemwiththevariablehismoresubtle.Itwasdeclaredasaprivatevariable,
butitsvaluewasunchangedinsidetheparallelloop.OpenMPstipulates,however,thatafteraname
appearsinanyoftheprivateclauses,thevariableassociatedwiththatnameintheregionofcode
followingtheOpenMPconstructisundefined.Hence,theprintstatementfollowingtheparallel
fordoesnothaveawelldefinedvalueofhtoprint.
Afewotherclausesaffectthewayvariablesareshared,butwedonotusetheminthisbookandhence
willnotdiscussthemhere.
Thefinalclausewewilldiscussthataffectshowdataissharedisthereductionclause.Reductions
werediscussedatlengthintheImplementationMechanismsdesignspace.Areductionisanoperation
that,usingabinary,associativeoperator,combinesasetofvaluesintoasinglevalue.Reductionsare
verycommonandareincludedinmostparallelprogrammingenvironments.InOpenMP,the
reductionclausedefinesalistofvariablenamesandabinaryoperator.Foreachnameinthelist,a
privatevariableiscreatedandinitializedwiththevalueoftheidentityelementforthebinaryoperator
(forexample,zeroforaddition).Eachthreadcarriesoutthereductionintoitscopyofthelocal
variableassociatedwitheachnameinthelist.Attheendoftheconstructcontainingthereduction
clause,thelocalvaluesarecombinedwiththeassociatedvaluepriortotheOpenMPconstructin
questiontodefineasinglevalue.Thisvalueisassignedtothevariablewiththesamenameinthe
regionfollowingtheOpenMPconstructcontainingthereduction.
AnexampleofareductionwasgiveninFig.A.8.Intheexample,thereductionclausewasapplied
withthe+operatortocomputeasummationandleavetheresultinthevariablesum.
Althoughthemostcommonreductioninvolvessummation,OpenMPalsosupportsreductionsin
Fortranfortheoperators+,*,-,.AND., .OR., .EQV., .NEQV., MAX, MIN, IAND,
IOR,andIEOR.InCandC++,OpenMPsupportsreductionwiththestandardC/C++operators*,-,
&,|,^,&&,and||.CandC++donotincludeanumberofusefulintrinsicfunctionssuchas"min"or
"max"withinthelanguagedefinition.Hence,OpenMPcannotprovidereductionsinsuchcases;if
theyarerequired,theprogrammermustcodethemexplicitlybyhand.Moredetailsaboutreductionin
OpenMParegivenintheOpenMPspecification[OMP].
A.5. THE OpenMP RUNTIME LIBRARY

Asmuchaspossible,thesyntaxofOpenMPisexpressedthroughcompilerdirectives.Certainfeatures
ofthelanguage,however,canonlybehandledwithruntimelibraryfunctions.Ofthefunctionswithin
theruntimelibrary,themostcommonlyusedfunctionsarethefollowing.
omp_set_num_threads ()takesanintegerargumentandrequeststhattheoperating
systemprovidethatnumberofthreadsinsubsequentparallelregions.
Figure A.10. C program showing use of the most
common runtime library functions
#include <stdio.h>
#include <omp.h>
int main() {
int id, numb;
omp_set num_threads(3);
#pragma omp parallel private (id, numb)
{
id = omp_get thread_num();
numb = omp_getnum_threads();
printf(" I am thread %d out of %d \n",id,numb);
}
}
omp_get_num_threads()(integerfunction)returnstheactualnumberofthreadsinthe
currentteamofthreads.
omp_get_thread_num()(integerfunction)returnstheIDofathread,wheretheID
rangesfrom0tothenumberofthreadsminus1.ThethreadwithIDof0isthemasterthread.
Thelockfunctionscreate,use,anddestroylocks.Thesearedescribedlaterwiththeother
synchronizationconstructs.
WeprovideasimpleexampleofhowthesefunctionsareusedinFig.A.10.Theprogramprintsthe
threadIDandthenumberofthreadstothestandardoutput.
TheoutputfromtheprograminFig.A.10wouldlooksomethinglikethis:
I am thread 2 out of 3
TheoutputrecordswouldbecompleteandnonoverlappingbecauseOpenMPrequiresthattheI/O
librariesbethreadsafe.Whichthreadprintsitsrecordwhen,however,isnotspecified,andanyvalid
interleavingoftheoutputrecordscanoccur.
A.6. SYNCHRONIZATION
ManyOpenMPprogramscanbewrittenusingonlytheparallelandparallel for
(parallel doinFortran)constructs.Therearealgorithms,however,whereoneneedsmorecareful
controloverhowvariablesareshared.Whenmultiplethreadsreadandwriteshareddata,the
programmermustensurethatthethreadsdonotinterferewitheachother,sothattheprogramreturns
thesameresultsregardlessofhowthethreadsarescheduled.Thisisofcriticalimportancesinceasa
multithreadedprogramruns,anysemanticallyallowedinterleavingoftheinstructionscouldactually
occur.Hence,theprogrammermustmanagereadsandwritestosharedvariablestoensurethatthreads
readthecorrectvalueandthatmultiplethreadsdonottrytowritetoavariableatthesametime.
Synchronizationistheprocessofmanagingsharedresourcessothatreadsandwritesoccurinthe
correctorderregardlessofhowthethreadsarescheduled.Theconceptsbehindsynchronizationare
discussedindetailinthesectiononsynchronization(Section6.3)intheImplementationMechanisms
designspace.OurfocusherewillbeonthesyntaxanduseofsynchronizationinOpenMP.
ConsidertheloopbasedprograminFig.A.5earlierinthischapter.Weusedthisexampletointroduce
worksharinginOpenMP,atwhichtimeweassumedthecombinationofthecomputedresults(res)
didnottakemuchtimeandhadtooccurinthesequentialorder.Hence,itwasnoproblemtostore
intermediateresultsinasharedarrayandlater(withinaserialregion)combinetheresultsintothe
finalanswer.
Inthemorecommoncase,however,resultsofbig_comp()canbeaccumulatedinanyorderaslong
astheaccumulationsdonotinterfere.Tomakethingsmoreinteresting,wewillassumethatthe
combine()andbig_comp()routinesarebothtimeconsumingandtakeunpredictableand
widelyvaryingamountsoftimetoexecute.Hence,weneedtobringthecombine()functioninto
theparallelregionandusesynchronizationconstructstoensurethatparallelcallstothecombine()
functiondonotinterfere.
ThemajorsynchronizationconstructsinOpenMParethefollowing.
flushdefinesasynchronizationpointatwhichmemoryconsistencyisenforced.Thiscanbe
subtle.Basically,amoderncomputercanholdvaluesinregistersorbuffersthatarenot
guaranteedtobeconsistentwiththecomputer'smemoryatanygivenpoint.Cachecoherency
protocolsguaranteethatallprocessorsultimatelyseeasingleaddressspace,buttheydonot
guaranteethatmemoryreferenceswillbeuptodateandconsistentateverypointintime.The
syntaxofflushis
#pragma omp flush [(list)]
wherelistisacommaseparatedlistofvariablesthatneedtobeflushed.Ifthelistis
omitted,allvariablesvisibletothecallingthreadwillbeflushed.Programmersonlyrarely
needtocallflushbecauseitisautomaticallyinsertedatmostpointswhereitisneeded.
Typicallyitisneededonlybyprogrammersbuildingtheirownlowlevelsynchronization
primitives.
criticalimplementsacriticalsectionformutualexclusion.Inotherwords,onlyone
threadatatimewillexecutethestructuredblockwithinacriticalsection.Theotherthreads
willwaittheirturnatthetopoftheconstruct.
Thesyntaxofacriticalsectionis
#pragma omp critical [(name)]
{ a structured block }
wherenameisanidentifierthatcanbeusedtosupportdisjointsetsofcriticalsections.A
criticalsectionimpliesacalltoflushonentrytoandonexitfromthecriticalsection.
barrierprovidesasynchronizationpointatwhichthethreadswaituntileverymemberof
theteamhasarrivedbeforeanythreadscontinue.Thesyntaxofabarrieris
#pragma omp barrier
Abarriercanbeaddedexplicitly,butitisalsoimpliedwhereitmakessense(suchasatthe
endofparallelorworksharingconstructs).Abarrierimpliesaflush.
Criticalsections,barriers,andflushesarediscussedfurtherintheImplementationMechanisms
designspace.
ReturningtoourexampleinFig.A.5,wecansafelyincludethecalltothecombine()routineinside
theparallelloopifweenforcemutualexclusion.Wewilldothiswiththecriticalconstructas
showninFig.A.11.Noticethatwehadtocreateaprivatecopyofthevariablerestoprevent
conflictsbetweeniterationsoftheloop.
Figure A.11. Parallel version of the program in Fig. A.5. In this case, however, we
assume that the calls to combine() can occur in any order as long as only one thread at
a time executes the function. This is enforced with the critical construct.
#include <stdio.h>
#include <omp.h>
#define N 1000
extern void combine(double,double);
extern double big_comp(int);
int main() {
int i;
double answer, res;
answer = 0.0;
#pragma omp parallel for private (res)
for (i=O;i<N;i++){
res = big_comp(i);
combine(answer,res);
}
printf("%f\n", answer);
}
Formostprogrammers,thehighlevelsynchronizationconstructsinOpenMParesufficient.Thereare
cases,however,wherethehighlevelsynchronizationconstructscannotbeused.Twocommoncases
wherethissituationcomesuparethefollowing.
ThesynchronizationprotocolsrequiredbyaproblemcannotbeexpressedwithOpenMP's
highlevelsynchronizationconstructs.
TheparalleloverheadincurredbyOpenMP'shighlevelsynchronizationconstructsistoo
large.
Toaddresstheseproblems,theOpenMPruntimelibraryincludeslowlevelsynchronizationfunctions
thatprovidealockcapability.Theprototypesforlockfunctionsareincludedintheomp. hinclude
file.Thelocksuseanopaquedatatypeomp_lock_tdefinedintheomp.hincludefile.Thekey
functionsfollow.
void omp_init_lock(omp_lock_t *lock).Initializethelock.
void omp_destroy_lock(omp_lock_t *lock).Destroythelock,therebyfreeing

anymemoryassociatedwiththelock.
void omp_set_lock(omp_lock_t *lock).Setoracquirethelock.Ifthelockis

free,thenthethreadcallingomp_set_lock()willacquirethelockandcontinue.Ifthe
lockisheldbyanotherthread,thethreadcallingomp_set_lock()willwaitonthecall
untilthelockisavailable.
void omp_unset_lock(omp_lock_t *lock).Unsetorreleasethelocksosome

otherthreadcanacquireit.
int omp_test_lock(omp_lock_t *lock).Testorinquireifthelockisavailable.If

itis,thenthethreadcallingthisfunctionwillacquirethelockandcontinue.Thetestand
acquisitionofthelockisdoneatomically.Ifitisnot,thefunctionreturnsfalse(nonzero)and
thecallingthreadcontinues.Thisfunctionisusedsoathreadcandousefulworkwhile
waitingforalock.
Thelockfunctionsguaranteethatthelockvariableitselfisconsistentlyupdatedbetweenthreads,but
donotimplyaflushofothervariables.Therefore,programmersusinglocksmustcallflush
explicitlyasneeded.AnexampleofaprogramusingOpenMPlocksisshowninFig.A.12.The
programdeclaresandtheninitializesthelockvariablesatthebeginningoftheprogram.Becausethis
occurspriortotheparallelregion,thelockvariablesaresharedbetweenthethreads.Insidethe
parallelregion,thefirstlockisusedtomakesureonlyonethreadatatimetriestoprintamessageto
thestandardoutput.Thelockisneededtoensurethatthetwoprintfstatementsareexecuted
togetherandnotinterleavedwiththoseofotherthreads.Thesecondlockisusedtoensurethatonly
onethreadatatimeexecutesthego_for_it()function,butthistime,theomp_test_lock()
functionisusedsoathreadcandousefulworkwhilewaitingforthelock.Aftertheparallelregion
completes,thememoryassociatedwiththelocksisfreedbyacalltoomp_lock_destroy().
Figure A.12. Example showing how the lock functions in OpenMP are used
#include <stdio.h>
#include <omp.h>
extern void do_something_else(int); extern void go_for_it(int);
int main() {
omp_lock_t lcki, lck2; int id;
omp_init_lock(&lcki);
omp_init_lock(&lck2);
#pragma omp parallel shared(lcki, lck2) private(id)
{
id = omp_get thread_numO ;
omp_set_lock(&lcki);
printf("thread %d has the lock \n", id);
printf("thread %d ready to release the lock \n", id);
omp unset_lock(&lcki);
while (! omp test_lock(&lck2)) {
do_something_else(id); // do something useful while waiting
// for the lock
}
go_for_it(id); // Thread has the lock
omp unset_lock(&lck2);
}
omp_destroy_lock(&lcki);
omp_destroy_lock(&lck2);
A.7. THE SCHEDULE CLAUSE

Thekeytoperformanceinaloopbasedparallelalgorithmistoscheduletheloopiterationsonto
threadssuchthattheloadisbalancedbetweenthreads.OpenMPcompilerstrytodothisforthe
programmer.Althoughcompilersareexcellentatmanagingdatadependenciesandareveryeffective
atgenericoptimizations,theydoapoorjobofunderstanding,foraparticularalgorithm,memory
accesspatternsandhowexecutiontimesvaryfromoneloopiterationtothenext.Togetthebest
performance,programmersneedtotellthecompilerhowtodividetheloopiterationsamongthe
threads.
Thisisaccomplishedbyaddingascheduleclausetotheforordoworksharingconstructs.In
bothCandFortran,thescheduleclausetakestheform
schedule( sched [,chunk])
whereschediseitherstatic, dynamic, guided,orruntimeandchunkisanoptional

integerparameter.
schedule (static [,chunk]).Theiterationspaceisdividedintoblocksofsize

chunk.Ifchunkisomitted,thentheblocksizeisselectedtoprovideoneapproximately
equalsizedblockperthread.Theblocksaredealtouttothethreadsthatmakeupateamina
roundrobinfashion.Forexample,achunksizeof2for3threadsand12iterationswillcreate
6blockscontainingiterations(0,1),(2,3),(4,5),(6,7),(8,9),and(10,11)andassignthemto
threadsas[(0,1),(6,7)]toonethread,[(2,3),(8,9)]toanotherthreadand[(4,5),(10,11)]tothe
lastthread.
schedule (dynamic [chunk]).Theiterationspaceisdividedintoblocksofsize

chunk.Ifchunkisomitted,theblocksizeissetto1.Eachthreadisinitiallygivenoneblock
ofiterationstoworkwith.Theremainingblocksareplacedinaqueue.Whenathreadfinishes
withitscurrentblock,itpullsthenextblockofiterationsthatneedtobecomputedoffthe
queue.Thiscontinuesuntilalltheiterationshavebeencomputed.
schedule (guided [chunk]).Thisisavariationofthedynamicscheduleoptimized

todecreaseschedulingoverhead.Aswithdynamicscheduling,theloopiterationsaredivided
intoblocks,andeachthreadisassignedoneblockinitiallyandthenreceivesanadditional
blockwhenitfinishesthecurrentone.Thedifferenceisthesizeoftheblocks:Thesizeofthe
firstblockisimplementationdependent,butlarge;blocksizeisdecreasedrapidlyfor
subsequentblocks,downtothevaluespecifiedbychunk.Thismethodhasthebenefitsofthe
dynamicschedule,butbystartingwithlargeblocksizes,thenumberofschedulingdecisions
atruntime,andhencetheparalleloverhead,isgreatlyreduced.
schedule (runtime).Theruntimeschedulestipulatesthattheactualscheduleand
chunksizefortheloopistobetakenfromthevalueoftheenvironmentvariable
OMP_SCHEDULE.Thisletsaprogrammertrydifferentscheduleswithouthavingtorecompile
foreachtrial.
ConsidertheloopbasedprograminFig.A.13.Becausethescheduleclauseisthesameforboth
FortranandC,wewillonlyconsiderthecaseforC.Iftheruntimeassociatedwithdifferentloop
iterationschangesunpredictablyastheprogramruns,astaticscheduleisprobablynotgoingtobe
effective.Wewillthereforeusethedynamicschedule.Schedulingoverheadisaseriousproblem,
however,sotominimizethenumberofschedulingdecisions,wewillstartwithablocksizeof10
iterationsperschedulingdecision.Therearenofirmrules,however,andOpenMPprogrammers
usuallyexperimentwitharangeofschedulesandchunksizesuntiltheoptimumvaluesarefound.For
example,parallelloopsinprogramssuchastheoneinFig.A.13canalsobeeffectivelyscheduled
withastaticscheduleaslongasthechunksizeissmallenoughthatworkisequallydistributedamong
threads.
Figure A.13. Parallel version of the program in Fig. A.11, modified to show the use of
the schedule clause
#include <stdio.h>
#include <omp.h>
#define N 1000
extern void combine(double,double);

extern double big_comp(int);
int main() {
int i;
double answer, res;
answer = 0.0;
#pragma omp parallel for private(res) schedule(dynamic,10)
for (i=O;i<N;i++){
res = big_comp(i);
combine(answer,res);
}
printf("%f\n", answer);
}
A.8. THE REST OF THE LANGUAGE

ThisdiscussionhasaddressedonlythefeaturesofOpenMPusedinthisbook.Althoughwehave
coveredthemostcommonlyusedconstructsinOpenMP,aprogrammerinterestedinworkingwiththe
OpenMPlanguageshouldobtainthecompletespecification[OMP].ThedefinitionofthefullAPIis
only50pagesoftext.Thespecificationalsoincludesmorethan25pagesofexamples.Abookabout
thelanguageisavailable[CDK+00],andadescriptionoftheabstractionsbehindOpenMPandsome
ideasaboutfuturedirectionsarepresentedin[MatOS].Thereisagrowingbodyofliteratureabout
OpenMP;manyOpenMPapplicationsaredescribedintheproceedingsofthevariousOpenMP
workshops[VJKT00,EV01,EWO01].
Appendix B. A Brief Introduction to MPI

B.1CONCEPTS
B.2GETTINGSTARTED
B.3BASICPOINTTOPOINTMESSAGEPASSING
B.4COLLECTIVEOPERATIONS
B.5ADVANCEDPOINTTOPOINTMESSAGEPASSING
B.6MPIANDFORTRAN
B.7CONCLUSION
MPI(MessagePassingInterface)isthestandardprogrammingenvironmentfordistributedmemory
parallelcomputers.ThecentralconstructinMPIismessagepassing:Oneprocess[1]packages
informationintoamessageandsendsthatmessagetoanotherprocess.MPI,however,includesfar
morethansimplemessagepassing.MPIincludesroutinestosynchronizeprocesses,sumnumbers
distributedamongacollectionofprocesses,scatterdataacrossacollectionofprocesses,andmuch
more.
[1]
TheUEsareprocessesinMPI.
MPIwascreatedintheearly1990stoprovideacommonmessagepassingenvironmentthatcouldrun
onclusters,MPPs,andevensharedmemorymachines.MPIisdistributedintheformofalibrary,and
theofficialspecificationdefinesbindingsforCandFortran,althoughbindingsforotherlanguages
havebeendefinedaswell.TheoverwhelmingmajorityofMPIprogrammerstodayuseMPIversion
1.1(releasedin1995).Thespecificationofanenhancedversion,MPI2.0,withparallelI/O,dynamic
processmanagement,onesidedcommunication,andotheradvancedfeatureswasreleasedin1997.
Unfortunately,itwassuchacomplexadditiontothestandardthatatthetimethisisbeingwritten
(morethansixyearsafterthestandardwasdefined),onlyahandfulofMPIimplementationssupport
MPI2.0.Forthisreason,wewillfocusonMPI1.1inthisdiscussion.
ThereareseveralimplementationsofMPIincommonuse.ThetwomostcommonareLAM/MPI
[LAM]andMPICH[MPI].Bothmaybedownloadedfreeofchargeandarestraightforwardtoinstall
usingtheinstructionsthatcomewiththem.Theysupportawiderangeofparallelcomputers,
includingLinuxclusters,NUMAcomputers,andSMPs.
B.1. CONCEPTS
Thebasicideaofpassingamessageisdeceptivelysimple:Oneprocesssendsamessageandanother
onereceivesit.Diggingdeeper,however,thedetailsbehindmessagepassingbecomemuchmore
complicated:Howaremessagesbufferedwithinthesystem?Canaprocessdousefulworkwhileitis
sendingorreceivingmessages?Howcanmessagesbeidentifiedsothatsendsarealwayspairedwith
theirintendedreceives?
ThelongtermsuccessofMPIisduetoitselegantsolutiontothese(andother)problems.The
approachisbasedontwocoreelementsofMPI:processgroupsandacommunicationcontext.A
processgroupisasetofprocessesinvolvedinacomputation.InMPI,alltheprocessesinvolvedinthe
computationarelaunchedtogetherwhentheprogramstartsandbelongtoasinglegroup.Asthe
computationproceeds,however,theprogrammercandividetheprocessesintosubgroupsand
preciselycontrolhowthegroupsinteract.
Acommunicationcontextprovidesamechanismforgroupingtogethersetsofrelated
communications.Inanymessagepassingsystem,messagesmustbelabeledsotheycanbedelivered
totheintendeddestinationordestinations.ThemessagelabelsinMPIconsistoftheIDofthesending
process,theIDoftheintendedreceiver,andanintegertag.Areceivestatementincludesparameters
indicatingasourceandtag,eitherorbothofwhichmaybewildcards.Theresult,then,ofexecutinga
receivestatementatprocessiisthedeliveryofamessagewithdestinationiwhosesourceandtag
matchthoseinthereceivestatement.
Whilestraightforward,identifyingmessageswithsource,destination,andtagmaynotbeadequatein
complexapplications,particularlythosethatincludelibrariesorotherfunctionsreusedfromother
programs.Often,theapplicationprogrammerdoesn'tknowanyofthedetailsaboutthisborrowed
code,andifthelibraryincludescallstoMPI,thepossibilityexiststhatmessagesintheapplication
andthelibrarymightaccidentallysharetags,destinationsIDs,andsourceIDs.Thiscouldleadto
errorswhenalibrarymessageisdeliveredtoapplicationcode,orviceversa.Onewaytodealwith
thisproblemisforlibrarywriterstospecifyreservedtagsthatusersmustavoidintheircode.This
approachhasprovedcumbersome,however,andispronetoerrorbecauseitrequiresprogrammersto
carefullyreadandfollowtheinstructionsinthedocumentation.
MPI'ssolutiontothisproblemisbasedonthenotionofcommunicationcontexts.[2]Eachsend(and
itsresultingmessage)andreceivebelongtoacommunicationcontext,andonlythosecommunication
eventsthatshareacommunicationcontextwillmatch.Hence,evenifmessagesshareasource,a
destination,andatag,theywillnotbeconfusedwitheachotheraslongastheyhavedifferent
contexts.Communicationcontextsaredynamicallycreatedandguaranteedtobeunique.
[2]
+
TheZipcodemessagepassinglibrary[SSD
94
]wastheonlymessagepassinglibrary
inusepriortoMPIthatincludeddistinctcommunicationcontexts.
InMPI,theprocessgroupandcommunicationcontextarecombinedintoasingleobjectcalleda
communicator.Withonlyafewexceptions,thefunctionsinMPIincludeareferencetoa
communicator.Atprogramstartup,theruntimesystemcreatesacommoncommunicatorcalled
MPI_COMM_WORLD
Inmostcases,MPIprogrammersonlyneedasinglecommunicatorandjustuseMPI_COMM_WORLD.
Whilecreatingandmanipulatingcommunicatorsisstraightforward,onlyprogrammerswriting
reusablesoftwarecomponentsneedtodoso.Hence,manipulatingcommunicatorsisbeyondthescope
ofthisdiscussion.
B.2. GETTING STARTED

TheMPIstandarddoesnotmandatehowajobisstarted,sothereisconsiderablevariationbetween
differentMPIimplementationsandindifferentsituations(forexample,jobsstartedinteractively
versusjobsstartedbyabatchscheduler).ForstartingajobinteractivelywithMPI1.1,themost
commonmethodlaunchesalltheprocessesinvolvedintheMPIprogramtogetheronnodesobtained
fromalistinaconfigurationfile.Allprocessesexecutethesameprogram.Thecommandthat
accomplishesthisisusuallycalledmpirunandtakesthenameoftheprogramasaparameter.
Unfortunately,theMPIstandarddoesnotdefinetheinterfacetompirun,anditvariesbetweenMPI
implementations.Detailscanbefoundineachimplementation'sdocumentation[LAM,MPI].
AllMPIprogramsincludeafewbasicelements.ConsidertheprograminFig.B.1.
WewillexploretheelementsofMPIrequiredtoturnthisintoaparallelprogramwheremultiple
processesexecutethefunction.First,thefunctionprototypesforMPIneedtobedefined.Thisisdone
withtheMPIincludefile:
#include <mpi.h>
Next,theMPIenvironmentmustbeinitialized.Aspartoftheinitialization,thecommunicatorshared
byalltheprocessesiscreated.Asmentionedearlier,thisiscalledMPI_COMM_WORLD:
ThecommandlineargumentsarepassedintoMPI_InitsotheMPIenvironmentcaninfluencethe
behavioroftheprogrambyaddingitsowncommandlinearguments(transparentlytothe
programmer).Thisfunctionreturnsanintegerstatusflagusedtoindicatesuccessorfailureofthe
functioncall.Withveryfewexceptions,allMPIfunctionsreturnthisflag.Possiblevaluesofthisflag
aredescribedintheMPIincludefile,mpi. h.
Figure B.1. Program to print a simple string to standard output
#include <stdio.h>
printf("\n Never miss a good chance to shut up \n"); }
Figure B.2. Parallel program in which each process prints a simple string to the
standard output
#include <stdio.h>
#include "mpi.h"
int ID; // a unique identifier ranging from 0 to (num_procs-1)
if (MPI_Init(&argc, &argv) != MPI_SUCCESS) {
// print error message and exit
}
MPI_Comm_rank (MPI_COMM_WORLD, &ID);
printf("\n Never miss a good chance to shut up %d \n",ID);
MPI_Finalize();
}
Althoughnotrequired,almosteveryMPIprogramusesthenumberofprocessesinthegroupandthe
rank[3]ofeachprocessinthegrouptoguidethecomputation,asdescribedintheSPMDpattern.This
informationisfoundthroughcallstothefunctionsMPI_Comm_sizeandMPI_Comm_rank:
[3]
Therankisanintegerrangingfromzerotothenumberofprocessesinthegroupminus
one.Itindicatesthepositionofeachprocesswithintheprocessgroup.
int ID;// a unique identifier ranging from 0 to (num_procs-1)
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
WhentheMPIprogramfinishesrunning,theenvironmentneedstobecleanlyshutdown.Thisisdone
withthisfunction:
MPI_Finalize();
Usingtheseelementsinoursimpleexampleprogram,wearriveattheparallelprograminFig.B.2.
Ifatanypointwithinaprogramafatalconditionisdetected,itmightbenecessarytoshutdownthe
program.Ifthisisnotdonecarefully,processescanbeleftonsomeofthenodes.Theseprocesses,
calledorphanprocesses,caninprinciplestayaround"forever"waitingforinteractionwithother
processeswithinthegroup.ThefollowingfunctiontellstheMPIruntimeenvironmenttomakeabest
effortattempttoshutdownalltheprocessesintheMPIprogram:
MPI_Abort();
B.3. BASIC POINT-TO-POINT MESSAGE PASSING

ThepointtopointmessagepassingroutinesinMPIsendamessagefromoneprocesstoanother.
Therearemorethan21functionsinMPI1.1forpointtopointcommunication.Thislargesetof
messagepassingfunctionsprovidesthecontrolsneededtooptimizetheuseofcommunicationbuffers
andspecifyhowcommunicationandcomputationoverlap.
ThemostcommonlyusedmessagepassingfunctionsinMPI1.1aretheblockingsend/receive
functionsdefinedinFig.B.3.TheMPI_Send()functionreturnswhenthebuffer(buff)hasbeen
transmittedintothesystemandcansafelybereused.Onthereceivingside,theMPI_Recv()
functionreturnswhenthebuffer(buff)hasreceivedthemessageandisreadytouse.
Figure B.3. The standard blocking point-to-point communication routines in the C

binding for MPI 1.1
TheMPI_Statusdatatypeisdefinedinmpi.h.Thestatusvariableisusedtocharacterizea
receivedmessage.IfMPI_ANY_TAGwasusedbyMPI_Recv(),forexample,theactualtagforthe
messagecanbeextractedfromthestatusvariablesasstatus.MPI_TAG.
TheprograminFig.B.4providesanexampleofhowtousethebasicmessagepassingfunctions.In
thisprogram,amessageisbouncedbetweentwoprocesses.Thisprogram(sometimescalleda"ping
pong"program)isfrequentlyusedasaperformancemetricforcommunicationsystemsinparallel
computers.
Figure B.4. MPI program to "bounce" a message between two processes using the
standard blocking point-to-point communication routines in the C binding to MPI 1.1
#include <stdio.h> // standard I/O include file
#include "memory.h" // standard include file with function
// prototypes for memory management
int Tag1 = 1; int Tag2 = 2; // message tags
int ID; // a unique identifier ranging from 0 to (num_procs-1)
int buffer_count = 100; // number of items in message to bounce
long *buffer; // buffer to bounce between processes
int i;

MPI_Init(&argc,&argv); // Initialize MPI environment
MPI_Comm_rank(MPI_COMM_WORLD, &ID); // Find Rank of this
// process in the group
MPI_Comm_size (MPI_COMM_WORLD, &num_procs); //Find number of
// processes in the group
if (num_procs != 2) MPI_Abort(MPI_COMM_WORLD, 1);
buffer = (long *)malloc(buffer_count* sizeof(long));
for (i=0; i<buffer_count; i++) // fill buffer with some values
buffer[i] = (long) i;
if (ID == 0) {
MPI_Send (buffer, buffer_count, MPI_LONG, 1, Tag1,
MPI_COMM_WORLD);
MPI_Recv (buffer, buffer_count, MPI_LONG, 1,
Tag2, MPI_COMM_WORLD, &stat);
}
else {
MPI_Recv (buffer, buffer_count, MPI_LONG, 0, Tag1,
MPI_COMM_WORLD,&stat);
MPI_Send (buffer, buffer_count, MPI_LONG, 0, Tag2,
MPI_COMM_WORLD);
}
}
MPI_Finalize();
Theprogramopenswiththeregularincludefilesanddeclarations.WeinitializeMPIasthefirst
executablestatementintheprogram.Thiscouldbedonelater,theonlyfirmrequirementbeingthat
initializationmustoccurbeforeanyMPIroutinesarecalled.Next,wedeterminetherankofeach
processandthesizeoftheprocessgroup.Tokeeptheexampleshortandsimple,theprogramhas
beenwrittenassumingtherewillonlybetwoprocessesinthegroup.Weverifythisfact(abortingthe
programifnecessary)andthenallocatememorytoholdthebuffer.
Thecommunicationitselfissplitintotwoparts.Intheprocesswithrankequalto0,wesendandthen
receive.Intheotherprocess,wereceiveandthensend.Matchingupsendsandreceivesinthisway
(sendingandthenreceivingononesideofapairwiseexchangeofmessageswhilereceivingandthen
sendingontheother)canbeimportantwithblockingsendsandlargemessages.Tounderstandthe
issues,consideraproblemwheretwoprocesses(ID0andID1)bothneedtosendamessage(from
bufferoutgoing)andreceiveamessage(intobufferincoming).BecausethisisanSPMDprogram,it
istemptingtowritethisas
int neigh; // the rank of the neighbor process
If (ID == 0) neigh = 1; else neigh = 0;
MPI_Send (outgoing, buffer_count, MPI_LONG, neigh, Tag1,
MPI_COMM_WORLD);
// ERROR: possibly deadlocking program

MPI_Recv (incoming, buffer_count, MPI_LONG, neigh, Tag2,
Thiscodeissimpleandcompact.Unfortunately,insomecases(largemessagesizes),thesystemcan't
freeupthesystembuffersusedtosendthemessageuntiltheyhavebeencopiedintotheincoming
bufferonthereceivingendofthecommunication.Becausethesendsblockuntilthesystembuffers
canbereused,thereceivefunctionsarenevercalledandtheprogramdeadlocks.Hence,thesafestway
towritethepreviouscodeistosplitupthecommunicationsaswedidinFig.B.4.
B.4. COLLECTIVE OPERATIONS

Inadditiontothepointtopointmessagepassingroutines,MPIincludesasetofoperationsinwhich
alltheprocessesinthegroupworktogethertocarryoutacomplexcommunication.Thesecollective
communicationoperationsareextremelyimportantinMPIprogramming.Infact,manyMPI
programsconsistprimarilyofcollectiveoperationsandcompletelylackpairwisemessagepassing.
Themostcommonlyusedcollectiveoperationsincludethefollowing.
MPI_Barrier.Abarrierdefinesasynchronizationpointatwhichallprocessesmustarrive
beforeanyofthemareallowedtoproceed.ForMPI,thismeansthateveryprocessusingthe
indicatedcommunicatormustcallthebarrierfunctionbeforeanyofthemproceed.This
functionisdescribedindetailintheImplementationMechanismsdesignspace.
MPI_Bcast.Abroadcastsendsamessagefromoneprocesstoalltheprocessesinagroup.
MPI_Reduce.Areductionoperationtakesasetofvalues(inthebufferpointedtobyinbuff)
spreadoutaroundaprocessgroupandcombinesthemusingtheindicatedbinaryoperation.To
bemeaningful,theoperationinquestionmustbeassociative.Themostcommonexamplesfor
thebinaryfunctionaresummationandfindingthemaximumorminimumofasetofvalues.
Noticethatthefinalreducedvalue(inthebufferpointedtobyoutbuff)isonlyavailableinthe
indicateddestinationprocess.Ifthevalueisneededbyallprocesses,thereisavariantofthis
routinecalledMPI_All_reduce().Reductionsaredescribedindetailinthe
ImplementationMechanismsdesignspace.
ThesyntaxofthesefunctionsisdefinedinFig.B.5.
InFigs.B.6andB.7wepresentaprogramthatusesthesethreefunctions.Inthisprogram,wewishto
timeafunctionthatpassesamessagearoundaringofprocesses.Thecodeforthering
communicationfunctionisnotrelevantforthisdiscussion,butforcompleteness,weprovideitinFig.
B.8.Theruntimeforanyparallelprogramisthetimetakenbytheslowestprocess.Soweneedtofind
thetimeconsumedineachprocessandthenfindthemaximumacrossalltheprocesses.Thetimeis
measuredusingthestandardMPItimingfunction:
double MPI_Wtime();
Figure B.6. Program to time the ring function as it passes messages around a ring of
processes (continued in Fig. B.7). The program returns the time from the process that
takes the longest elapsed time to complete the communication. The code to the ring
function is not relevant for this example, but it is included in Fig. B.8.
//
// Ring communication test.
// Command-line arguments define the size of the message
// and the number of times it is shifted around the ring:
//
// a.out msg_size num_shifts
//
#include "mpi.h"
#include <stdio.h>
#include <memory.h>
int num_shifts = 0; // number of times to shift the message
int buff_count = 0; // number of doubles in the message
int num_procs = 0; // number of processes in the ring
int ID; // process (or node) id
int buff_size_bytes; // message size in bytes
int i;
double
double
double
double
double
t0; // wall-clock time in seconds

ring_time; // ring comm time - this process
max_time; // ring comm time - max time for all processes
*x; // the outgoing message
*incoming; // the incoming message
MPI_Status stat;
// Initialize the MPI environment MPI_Init(&argc,&argv);
// Process, test, and broadcast input parameters
if (ID == 0)f
if (argc != 3){
printf("Usage: %s <size of message> <Num of shifts> \n",*argv);
fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
}
buff_count = atoi(*++argv); num_shifts = atoi(*++argv);
printf(": shift %d doubles %d times \n",buff_count, num_shifts);
fflush(stdout);
}
// Continued in the next figure
Figure B.7. Program to time the ring function as it passes messages around a ring of
processes (continued from Fig. B.6)
// Continued from the previous figure
// Broadcast data from rank 0 process to all other processes
MPI_Bcast (&buff_count, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (&num_shifts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Allocate space and fill the outgoing ("x") and "incoming" vectors.
buff_size_bytes = buff_count * sizeof(double);
x = (double*)malloc(buff_size_bytes);
incoming = (double*)malloc(buff_size_bytes);
for(i=0;i<buff_count;i++){
x[i] = (double) i;
incoming[i] = -1.0;
}
// Do the ring communication tests.
t0 = MPI_Wtime();
/* code to pass messages around a ring */
ring (x,incoming,buff_count,num_procs,num_shifts,ID);
ring_time = MPI_Wtime() - t0;
// Analyze results
MPI_Reduce(&ring_time, &max_time, 1, MPI_DOUBLE, MPI_MAX, 0,
MPI_COMM_WORLD);
if(ID == 0){
printf("\n Ring test took %f seconds", max_time);
}
MPI_Finalize();
}
Figure B.5. The major collective communication routines in the C binding to MPI 1.1
(MPI_Barrier, MPI_Bcast, and MPI_Reduce)
MPI_Wtime()returnsthetimeinsecondssincesomearbitrarypointinthepast.Usuallythetime
intervalofinterestiscomputedbycallingthisfunctiontwice.
ThisprogrambeginsasmostMPIprogramsdo,withdeclarations,MPIinitialization,andfindingthe
rankandnumberofprocesses.Wethenprocessthecommandlineargumentstodeterminethe
messagesizeandnumberoftimestoshiftthemessagearoundtheringofprocesses.Everyprocess
willneedthesevalues,sotheMPI_Bcast()functioniscalledtobroadcastthesevalues.
Wethenallocatespacefortheoutgoingandincomingvectorstobeusedintheringtest.Toproduce
consistentresults,everyprocessmustcompletetheinitializationbeforeanyprocessesenterthetimed
sectionoftheprogram.ThisisguaranteedbycallingtheMPI_Barrier()functionjustbeforethe
timedsectionofcode.Thetimefunctionisthencalledtogetaninitialtime,theringtestitselfis
called,andthenthetimefunctioniscalledasecondtime.Thedifferencebetweenthesetwocallsto
thetimefunctionsistheelapsedtimethisprocessspentpassingmessagesaroundthering.
Figure B.8. Function to pass a message around a ring of processes. It is deadlock-free
because the sends and receives are split between the even and odd processes.
/*******************************************************************
NAME: ring
PURPOSE: This function does the ring communication, with the
odd numbered processes sending then receiving while the even

processes receive and then send.
The sends are blocking sends, but this version of the ring
test still is deadlock-free since each send always has a
posted receive.
*******************************************************************/
#define IS_ODD(x) ((x)%2) /* test for an odd int */
#include "mpi.h"
ring(
double *x, /* message to shift around the ring */
double *incoming, /* buffer to hold incoming message */
int buff_count, /* size of message */
int num_procs, /* total number of processes */a
int num_shifts, /* numb of times to shift message */
int my_ID) /* process id number */
{
int next; /* process id of the next process */
int prev; /* process id of the prev process */
int i;
MPI_Status stat;
/*******************************************************************
** In this ring method, odd processes snd/rcv and even processes rcv/snd.
******************************************************************/
next = (my_ID +1 )%num_procs;
prev = ((my_ID==0)?(num_procs-1):(my_ID-1));
if( IS_ODD(my_ID) ){
for(i=0;i<num_shifts; i++){
MPI_Send (x, buff_count, MPI_DOUBLE, next, 3,
MPI_COMM_WORLD);
MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3,
}
}
else{
MPI_Recv (incoming, buff_count, MPI_DOUBLE, prev, 3,
MPI_Send (x, buff_count, MPI_DOUBLE, next, 3,
MPI_COMM_WORLD);
}
}
}
ThetotalruntimeforanMPIprogramisgivenbythetimerequiredbytheslowestprocessor.Soto
reportasinglenumberforthetime,weneedtodeterminethemaximumofthetimestakenbyeach
processors.WedothiswithasinglecalltoMPI_Reduce()withMPI_MAX.
Figure B.9. The nonblocking or asynchronous communication functions
B.5. ADVANCED POINT-TO-POINT MESSAGE PASSING

MostMPIprogrammersuseonlythestandardpointtopointmessagepassingfunctions.MPIincludes
additionalmessagepassingfunctions,however,givingprogrammersmorecontroloverthedetailsof
thecommunication.Probablythemostimportantoftheseadvancedmessagepassingroutinesarethe
nonblockingorasynchronouscommunicationfunctions.
Nonblockingcommunicationcansometimesbeusedtodecreaseparalleloverheadbyoverlapping
computationandcommunication.Thenonblockingcommunicationfunctionsreturnimmediately,
providing"requesthandles"thatcanbewaitedonandqueried.Thesyntaxofthesefunctionsis
presentedinFig.B.9.Tounderstandhowthesefunctionsareused,considertheprograminFig.B.10.
AfterinitializingtheMPIenvironment,memoryisallocatedforthefield(U)andthebuffersusedto
communicatetheedgesofU(BandinB).Theringcommunicationpatternisthenestablishedforeach
processbycomputingtheIDsfortheprocessestotheleftandtotheright.Enteringthemain
processingloop,eachiterationbeginsbypostingthereceive:
MPI_Irecv(inB, N, MPI_DOUBLE, left, i, MPI_COMM_WORLD, &req_recv);
Figure B.10. Program using nonblocking communication to iteratively update a field

using an algorithm that requires only communication around a ring (shifting messages
to the right)
#include <stdio.h>
#include <mpi.h>
// Update a distributed field with a local N by N block on each process
// (held in the array U). The point of this example is to show
// communication overlapped with computation, so code for other
// functions is not included.
#define N 100 // size of an edge in the square field.
void extern initialize(int, double*);
void extern extract_boundary(int, double*, double*);
void extern update_interior(int, double*);
void extern update_edge(int,double*, double*);
double *U, *B, *inB;
int i, num_procs, ID, left, right, Nsteps = 100;
MPI_Status status;
MPI_Request req_recv, req_send;
// Initialize the MPI environment
// allocate space for the field (U), and the buffers
// to send and receive the edges (B, inB)
U = (double*)malloc(N*N * sizeof(double));
B = (double*)malloc(N * sizeof(double));
inB = (double*)malloc(N * sizeof(double));
// Initialize the field and set up a ring communication pattern initialize (N, U);
right = ID + 1; if (right > (num_procs-1)) right = 0;
left = ID - 1; if (left < 0 ) left = num_procs-1;
// Iteratively update the field U
for(i=0; i<Nsteps; i++){
MPI_Irecv(inB, N, MPI_DOUBLE, left, i, MPI_COMM_WORLD, &req_recv);
extract_boundary(N, U, B); //Copy the edge of U into B
MPI_Isend(B, N, MPI_DOUBLE, right, i, MPI_COMM_WORLD, &req_send);
update_interior(N, U);
MPI_Wait(&req_recv, &status); MPI_Wait(&req_send, &status);
update_edge(ID, inB, U);
}
MPI_Finalize();
}
Thefunctionreturnsassoonasthesystemsetsupresourcestoholdthemessageincomingfromthe
left.Thehandlereq_recvprovidesamechanismtoinquireaboutthestatusofthecommunication.The
edgeofthefieldisthenextractedandsenttotheneighborontheright:
MPI_Isend(B, N, MPI_DOUBLE, right, i, MPI_COMM_WORLD, &req_send);
Whilethecommunicationistakingplace,theprogramupdatestheinteriorofthefield(theinterior
referstothatpartoftheupdatethatdoesnotrequireedgeinformationfromtheneighboring
processes).Afterthatworkiscomplete,eachprocessmustwaituntilthecommunicationiscomplete
MPI_Wait(&req_send, &status);
MPI_Wait(&req_recv, &status);
atwhichpoint,thefieldedgesareupdatedandtheprogramcontinuestothenextiteration.
AnothertechniqueforreducingparalleloverheadinanMPIprogramispersistentcommunication.
Thisapproachisusedwhenaproblemisdominatedbyrepeateduseofacommunicationpattern.The
ideaistosetupthecommunicationonceandthenuseitmultipletimestopasstheactualmessages.
Thefunctionsusedinpersistentcommunicationare
MPI_Send_init(outbuff, count, MPI_type, dest, tag, MPI_COMM, &request);
MPI_Recv_init(inbuff, count, MPI_type, src, tag, MPI_COMM, &request);
MPI_Start(&request);
MPI_Wait(&request, &status);
MPI_Send_initandMPI_Recv_initarecalledtosetupthecommunication.Arequesthandle
isreturnedtomanipulatetheactualcommunicationevents.Thecommunicationisinitiatedwithacall
toMPI_Start,atwhichpoint,theprocessisfreetocontinuewithanycomputation.Whenno
furtherworkcanbedoneuntilthecommunicationiscomplete,theprocessescanwaitonacallto
MPI_Wait.Afunctionusingpersistentcommunicationfortheringcommunicationpatternusedin
Fig.B.8isshowninFig.B.11.
Inadditiontononblockingandpersistentcommunication,MPIdefinesseveralcommunicationmodes
correspondingtodifferentwaysthesendsworkwiththecommunicationbuffers:
Standardmode(MPI_Send).ThestandardMPIsend;thesendwillnotcompleteuntilthe
sendbufferisemptyandreadytoreuse.
Synchronousmode(MPI_Ssend).Thesenddoesnotcompleteuntilafteramatchingreceive
hasbeenposted.Thismakesitpossibletousethecommunicationasapairwise
synchronizationevent.
Bufferedmode(MPI_Bsend).Usersuppliedbufferspaceisusedtobufferthemessages.The
sendwillcompleteassoonasthesendbufferiscopiedtothesystembuffer.
Figure B.11. Function to pass a message around a ring of processes using persistent
communication
/*******************************************************************
NAME: ring_persistent
PURPOSE: This function uses the persistent communication request
mechanism to implement the ring communication in MPI.
#include "mpi.h"
#include <stdio.h>
ring_persistent(
double *x, /* message to shift around the ring */
double *incoming, /* buffer to hold incoming message */
int buff_count, /* size of message */
int num_procs, /* total number of processes */
int num_shifts, /* numb of times to shift message */
int my_ID) /* process id number */
{
int next; /* process id of the next process */
int prev; /* process id of the prev process */
int i;
MPI_Request snd_req; /* handle to the persistent send */
MPI_Request rcv_req; /* handle to the persistent receive */
MPI_Status stat;
/*******************************************************************
** In this ring method, first post all the sends and then pick up
** the. messages with the receives.
*******************************************************************/
next = (my_ID +1 )%num_procs;
prev = ((my_ID==0)?(num_procs-1):(my_ID-1));
MPI_Send_init(x, buff_count, MPI_DOUBLE, next, 3,
MPI_COMM_WORLD, &snd_req);
MPI_Recv_init(incoming, buff_count, MPI_DOUBLE, prev, 3,
MPI_COMM_WORLD, &rcv_req);
MPI_Start(&snd_req);
MPI_Start(&rcv_req);
MPI_Wait(&snd_req, &stat);
MPI_Wait(&rcv_req, &stat);
}
Readymode(MPI_Rsend).Thesendwilltransmitthemessageimmediatelyunderthe
assumptionthatamatchingreceivehasalreadybeenposted(anerroneousprogramotherwise).
Onsomesystems,readymodecommunicationismoreefficient.
MostoftheMPIexamplesinthisbookusestandardmode.Weusedthesynchronousmode
communicationtoimplementmutualexclusionintheImplementationMechanismsdesignspace.
InformationabouttheothermodescanbefoundintheMPIspecification[Mesb].
B.6. MPI AND FORTRAN

TheformalMPIdefinitionincludeslanguagebindingsforCandFortran.BindingsforMPIandJava
havebeendefinedaswell[BC00,Min97,BCKL98,AJMJS02].Uptothispoint,wehavefocusedon
C.TranslatingfromCtoFortran,however,isstraightforwardandbasedonafewsimplerules.
TheincludefileinFortrancontainingconstants,errorcodes,etc.,iscalledmpif.h.
MPIroutinesessentiallyhavethesamenamesinthetwolanguages.WhereastheMPI
functionsinCarecasesensitive,theMPIsubprogramsinFortranarecaseinsensitive.
Ineverycaseexceptforthetimingroutines,FortranusessubroutineswhileCusesfunctions.
TheargumentstotheCfunctionsandFortransubroutinesarethesamewiththeobvious
mappingsontoFortran'sstandarddatatypes.Thereisoneadditionalargumentaddedtomost
Fortransubroutines.Thisisanintegerparameter,ierr,thatholdstheMPIerrorreturncode.
Forexample,weshowCandFortranversionofMPI'sreductionroutineinFig.B.12.
Figure B.12. Comparison of the C and Fortran language bindings for the reduction
routine in MPI 1.1
[Viewfullsizeimage]
Opaqueobjects(suchasMPI_COMMorMPI_Request)areoftypeINTEGERinFortran(withthe
exceptionofBooleanvaluedvariables,whichareoftypeLOGICAL).
AsimpleFortranprogramusingMPIisshowninFig.B.13.Thisprogramshowshowthebasicsetup
andfinalizationsubroutinesworkinMPIFortran.ThedirectanalogywiththeMPIClanguage
bindingshouldbeclearfromthisfigure.Basically,ifaprogrammerunderstandsMPIwithone
language,thenheorsheknowsMPIfortheotherlanguage.Thebasicconstructsarethesame
betweenlanguagebindings;however,programmersmustbecarefulwhenmixingFortranMPIandC
MPIbecausetheMPIspecificationsdonotguaranteeinteroperabilitybetweenlanguages.
Figure B.13. Simple Fortran MPI program where each process prints its ID and the
number of processes in the computation
program firstprog
include "mpif.h"
integer ID, Nprocs, ierr
call MPI_INIT( ierr )
call MPI_COMM_RANK ( MPI_COMM_WORLD, ID, ierr)
call MPI_COMM_SIZE ( MPI_COMM_WORLD, Nprocs, ierr )
print *, "Process ", ID, " out of ", Nprocs

call MPI_FINALIZE( ierr )
end
B.7. CONCLUSION
MPIisbyfarthemostcommonlyusedAPIforparallelprogramming.Itisoftencalledthe"assembly
code"ofparallelprogramming.MPI'slowlevelconstructsarecloselyalignedtotheMIMDmodelof
parallelcomputers.ThisallowsMPIprogrammerstopreciselycontrolhowtheparallelcomputation
unfoldsandwritehighlyefficientprograms.Perhapsevenmoreimportant,thisletsprogrammers
writeportableparallelprogramsthatrunwellonsharedmemorymachines,massivelyparallel
supercomputers,clusters,andevenoveragrid.
LearningMPIcanbeintimidating.Itishuge,withmorethan125differentfunctionsinMPI1.1.The
largesizeofMPIdoesmakeitcomplex,butmostprogrammersavoidthiscomplexityanduseonlya
smallsubsetofMPI.Manyparallelprogramscanbewrittenwithjustsixfunctions:MPI_Init,
MPI_Comm_Size,MPI_Comm_Rank,MPI_Send,MPI_Recv,andMPI_Finalize.Good
sourcesofmoreinformationaboutMPIinclude[Pac96],[GLS99],and[GS98].VersionsofMPIare
availableformostcomputersystems,usuallyintheformofopensourcesoftwarereadilyavailableon
line.ThemostcommonlyusedversionsofMPIareLAM/MPI[LAM]andMPICH[MPI].
Appendix C. A Brief Introduction to Concurrent

Programming in Java
C.1CREATINGTHREADS
C.2ATOMICITY,MEMORYSYNCHRONIZATION,ANDTHEvolatileKEYWORD
C.3SYNCHRONIZEDBLOCKS
C.4WAITANDNOTIFY
C.5LOCKS
C.6OTHERSYNCHRONIZATIONMECHANISMSANDSHAREDDATA
STRUCTURES
C.7INTERRUPTS
Javaisanobjectorientedprogramminglanguagethatprovideslanguagesupportforexpressing
concurrencyinsharedmemoryprograms.Java'ssupportforpolymorphismcanbeexploitedtowrite
frameworksdirectlysupportingsomeofthepatternsdescribedinthisbook.Theframeworkprovides
theinfrastructureforthepattern;theapplicationprogrammeraddssubclassescontainingthe
applicationspecificcode.AnexampleisfoundintheExamplessectionofthePipelinepattern.
BecauseJavaprogramsaretypicallycompiledtoanintermediatelanguagecalledJavabytecode,
whichisthencompiledand/orinterpretedonthetargetmachine,Javaprogramsenjoyahighdegreeof
portability.
Javacanalsobeusedfordistributedcomputing,andthestandardlibrariesprovidesupportforseveral
mechanismsforinterprocesscommunicationindistributedsystems.Abriefoverviewoftheseisgiven
intheImplementationMechanismsdesignspace.Inaddition,thereisanevergrowingcollectionof
packagesthatallowconcurrencyandinterprocesscommunicationtobeexpressedathigherlevelsthan
withthefacilitiesprovidedbythelanguageandcorepackages.Currently,themostcommontypeof
applicationsthatexploitJava'sfacilitiesforconcurrencyaremultithreadedserversideapplications,
graphicaluserinterfaces,andprogramsthatarenaturallydistributedduetotheiruseofdataand/or
resourcesindifferentlocations.
TheperformanceavailablefromcurrentimplementationsofJavaisnotasgoodasthatfrom
programminglanguagesmoretypicallyusedinhighperformancescientificcomputing.However,the
ubiquityandportabilityofJava,alongwiththelanguagesupportforconcurrency,makeitan
importantplatform.Formanypeople,aJavaprogrammaybetheirfirstconcurrentprogram.In
addition,continuallyimprovingcompilertechnologyandlibrariesshouldreducetheperformancegap
inthefuture.
Figure C.1. A class holding pairs of objects of an arbitrary type. Without generic
types, this would have been done by declaring x and y to be of type Object, requiring
casting the returned values of getX and getY. In addition to less-verbose programs,
this allows type errors to be found by the compiler rather than throwing a
ClassCastException at runtime.
/*class holding two objects of arbitrary type providing swap and accessor methods. The gene
class Pair<Gentype>
//Gentype is a type variable
{ private Gentype x,y;
void swap(){Gentype temp = x; x=y; y=temp;}
public Gentype getX(){return x;}
public Gentype getY(){return y;}
Pair(Gentype x, Gentype y){this.x = x; this.y = y;}

public String toString(){return "(" + x + "," + y + ")";}
}
/*The following method, defined in some class, instantiates two Pair objects, replacing the
public static void test()
{ /*create a Pair holding Strings.*/
Pair<String> p1 = new Pair<String>("hello", "goodbye");
/*create a Pair holding Integers. Autoboxing automatically creates Integer objects from
new Pair<Integer>(new Integer(l), new Integer(2));*/
Pair<Integer> p2 = new Pair<Integer>(l,2);
/*do something with the Pairs.*/

System.out.println(pl); pl.swap(); System.out.println(pl);
System.out.println(p2); p2.swap(); System.out.println(p2);
Inthisappendix,wedescribeJava21.5[1][Java],which,comparedwithearlierversions,includes
severaladditionstotheprogramminglanguageandstandardlibraries.Additionstotheprogramming
languageincludesupportforgenerictypes,autoboxing(whichprovidesautomaticconversionbetween
primitiveandobjecttypessuchasintandInteger),andenhancedforloops.Theexamplein
Fig.C.1illustratesgenerictypes,theonlylanguageadditionusedinthisbook.
[1]
TheversionavailableatpresstimewasJ2SE1.5.0Beta1.
Java21.5alsoprovidessignificantlyenhancedsupportforconcurrentprogrammingviaseveral
changesinthejava.langandjava.utilpackagesandtheintroductionofnewpackages
java.util.concurrent,java.util.concurrent.atomic,and
java.util.concurrent.locks.Anew,moreprecisespecificationofthememorymodelhas
alsobeenincorporated.Thefacilitiesforconcurrentprogrammingaredescribedin[JSRb,JSRc]and
thenewmemorymodelisdescribedin[JSRa].
Itisn'tfeasibletogiveacompleteintroductiontoJavainthisappendix.Hence,weassumethatthe
readerhassomefamiliaritywiththelanguageandfocusonthoseaspectsthataremostrelevantto
sharedmemoryparallelprogramming,includingnewfeaturesintroducedinJava21.5.Introductions
toJavaandits(preJava21.5)librariescanbefoundin[AGH00,HC02,HC01].Anexcellent
discussionofconcurrentprogramminginJavais[Lea00a].
C.1. CREATING THREADS

InJava,anapplicationalwayshasatleastonethread,theonethatexecutesthemainmethod.Anew
threadiscreatedbyinstantiatingandstartingathreadobject;thecodetobeexecutedisprovidedin
arunmethod,asdescribedlater.Athreadterminateswhenitsrunmethodreturns.Athreadcanbe
normalormarkedasadaemonthread.Eachthreadrunsindependentlyoftheotherthreadsinan
application,andtheapplicationterminateswhenallofitsnondaemonthreadshaveterminated.
AthreadcanaccessallvariablesvisibletoitsrunmethodaccordingtotheusualJavascoperules.
Thus,bymakingvariablesvisibletomultiplethreads,memorycanbesharedbetweenthem.By
encapsulatingvariables,memorycanbeprotectedfromaccessbyotherthreads.Inaddition,avariable
canbemarkedfinal,whichmeansthat,onceinitialized,itsvaluewillnotchange.Providedthe
initializationisdoneproperly,finalvariablescanbeaccessedbymultiplethreadswithoutrequiring
synchronization.(However,markingareferencefinaldoesnotguaranteeimmutabilityofthe
referencedobject;caremustbetakentoensurethatitisthreadsafeaswell.)
Therearetwodifferentwaystospecifytherunmethodthatathreadwillexecute.Oneistoextend
theThreadclassandoverrideitsrunmethod.Then,onesimplyinstantiatesanobjectofthe
subclassandcallsitsinheritedstartmethod.Themorecommonapproachistoinstantiatea
ThreadobjectusingtheconstructorthattakesaRunnableobjectasaparameter.Asbefore,
invokingtheThreadobject'sstartmethodbeginsthethread'sexecution.TheRunnable
interfacecontainsapublic void runmethod;thenewlycreatedthreadexecutestheRunnable
object'srunmethod.Becausetherunmethodisparameterless,informationistypicallypassedtothe
threadviadatamembersoftheThreadsubclassortheRunnable.Thesearetypicallysetinthe
constructor.
Inthefollowingexample,theclassThinkParallelimplementstheRunnableinterface(thatis,
itdeclaresthatitimplementstheinterfaceandprovidesapublic void run()method).The
mainmethodcreatesandstartsfourThreadobjects.EachoftheseispassedaThinkParallel
object,whoseidfieldhasbeensettothevalueofloopcounteri.Thiscausesfourthreadstobe
created,eachexecutingtherunmethodoftheThinkParallelclass.
Figure C.2. Program to create four threads, passing a Runnable in the Thread
constructor. Thread-specific data is held in a field of the Runnable object.
class ThinkParallel implements Runnable
{
int id; //thread-specific variable containing thread ID
/*The run method defines the thread's behavior*/
public void run()
{ System.out.println(id + ": Are we there yet?");
}
/*Constructor sets id*/
ThinkParallel (int id){this.id = id;}
/*main method instantiates and starts the threads*/ public static void main(String[] arg
{ /*create and start 4 Thread objects, passing each a ThinkParallel object
*/
for(int i = 0; i != 4; i++)
{ new Thread(new ThinkParallel(i)) .start();
}
C.1.1. Anonymous Inner Classes

Acommonprogrammingidiomfoundinseveraloftheexamplesusesananonymousclasstodefine
theRunnablesorThreadsatthepointwheretheyarecreated.Thisoftenmakesreadingthe
sourcecodemoreconvenientandavoidsfileandclassclutter.Weshowhowtorewritetheexampleof
Fig.C.2usingthisidiominFig.C.3.Variablesthatarelocaltoamethodcannotbementionedinan
anonymousclassunlessdeclaredfinal,andarethusimmutableafterbeingassignedavalue.Thisis
thereasonforintroducingtheseeminglyredundantvariablej.
C.1.2. Executors and Factories
Thejava, util. concurrentpackageprovidestheExecutorinterfaceanditssubinterface
ExecutorServicesalongwithseveralimplementations.AnExecutorexecutesRunnable
objectswhilehidingthedetailsofthreadcreationandscheduling.[2]Thus,insteadofinstantiatinga
ThreadobjectandpassingitaRunnabletoexecuteatask,onewouldinstantiateanExecutor
objectandthenusetheExecutor's executemethodtoexecutetheRunnable.Forexample,a
ThreadPoolExecutormanagesapoolofthreadsandarrangesfortheexecutionofsubmitted
Runnablesusingoneofthepooledthreads.Theclassesimplementingtheinterfacesprovide
adjustableparameters,buttheExecutorsclassprovidesfactorymethodsthatcreatevarious
ExecutorServicespreconfiguredforthemostcommonscenarios.Forexample,thefactory
methodExecutors.newCachedThreadPool ()returnsanExecutorServicewithan
unboundedthreadpoolandautomaticthreadreclamation.Thiswillattempttouseathreadfromthe
poolifoneisavailable,and,ifnot,createanewthread.Idlethreadsareremovedfromthepoolaftera
certainamountoftime.AnotherexampleisExecutors.newFixedThreadPool (int
nThreads),whichcreatesafixedsizethreadpoolcontainingnThreadsthreads.Itcontainsan
unboundedqueueforwaitingtasks.Otherconfigurationsarepossible,bothusingadditionalfactory
methodsnotdescribedhereandalsobydirectlyinstantiatingaclassimplementing
ExecutorServiceandadjustingtheparametersmanually.
[2]
TheExecutorServicesinterfaceprovidesadditionalmethodstomanagethread
terminationandsupport Futures.TheimplementationsofExecutorin
java.util. concurrentalsoimplementtheExecutorServicesinterface.
Figure C.3. Program similar to the one in Fig. C.2, but using an anonymous class to
define the Runnable object
class ThinkParallelAnon {
/*main method instantiates and starts the threads*/
{ /*create and start 4 Thread objects,
passing each a Runnable object defined by an anonymous class
*/
for(int i = 0; i != 4; i++)
{ final int j = i;
new Thread( new Runnable() //define Runnable objects
// anonymously
{ int id = j; //references
public void run()
{ System.out.println(id + ":
Are we there yet?");}
}
) . start();
}
}
Asanexample,wecouldrewritethepreviousexampleasshowninFig.C.4.Notethattochangetoa
differentthreadmanagementpolicy,wewouldonlyneedtoinvokeadifferentfactorymethodforthe
instantiationoftheExecutor.Therestofthecodewouldremainthesame.
TherunmethodintheRunnableinterfacedoesnotreturnavalueandcannotbedeclaredtothrow
exceptions.Tocorrectthisdeficiency,theCallableinterface,whichcontainsamethodcallthat
throwsanExceptionandreturnsaresult,wasintroduced.Theinterfacedefinitionexploitsthenew
supportforgenerictypes.Callablescanarrangeexecutionbyusingthesubmitmethodofan
objectthatimplementstheExecutorServiceinterface.ThesubmitmethodreturnsaFuture
object,whichrepresentstheresultofanasynchronouscomputation.TheFutureobjectprovides
methodstocheckwhetherthecomputationiscomplete,towaitforcompletion,andtogettheresult.
Thetypeenclosedwithinanglebracketsspecifiesthattheclassistobespecializedtothattype.
Figure C.4. Program using a ThreadPoolExecutor instead of creating threads directly
class ThinkParalleln implements Runnable {
int id; //thread-specific variable containing thread ID
/*The run method defines the tasks behavior*/
public void run()
}
/*Constructor sets id*/
ThinkParalleln(int id){this. id = id;}
/*main method creates an Executor to manage the tasks*/
{ /*create an Executor using a factory method in Executors*/
ExecutorService executor = Executors.newCachedThreadPool();
// send each task to the executor
for(int i = 0; i != 4; i++)
{ executor.execute(new ThinkParalleln(i)); }
/*shuts down after all queued tasks handled*/

executor.shutdown();
}
}
Asanexample,weshowinFig.C.5acodefragmentinwhichthemainthreadsubmitsananonymous
CallabletoanExecutor.TheExecutorarrangesforthecallmethodtobeexecutedin
anotherthread.Meanwhile,theoriginalthreadperformssomeotherwork.Whenfinishedwiththe
otherwork,itusesthegetmethodtoobtaintheresultoftheCallable'sexecution.Ifnecessary,
themainthreadwillblockonthegetmethoduntiltheresultisavailable.
Figure C.5. Code fragment illustrating use of Callable and Future
/*execute a Callable that returns a Double.
The notation Future<Double> indicates a (generic) Future that
has been specialized to be a <Double>.
*/
Future<Double> future = executor.submit(
new Callable<Double>() {
public Double call() { return result_of_long_computation; }
});
do_something_else( ); /* do other things while long
computation proceeds in another thread */
try {
Double d = (future.get()); /* get results of long
computation, waiting if necessary*/
} catch (ExecutionException ex) { cleanup(); return; }
C.2. ATOMICITY, MEMORY SYNCHRONIZATION, AND THE

volatile KEYWORD
TheJavaVirtualMachine(JVM)specificationrequiresthatreadsandwritesofalltheprimitivetypes,
exceptlonganddouble,beatomic.Thus,astatementsuchasdone = truewillnotbe
interferedwithbyanotherthread.Unfortunately,aswasexplainedintheImplementationMechanisms
designspace,morethanjustatomicityisneeded.Wealsoneedtoensurethatthewriteismadevisible
tootherthreadsandthatvaluesthatarereadarethefreshestvalues.InJava,avariablemarked
volatileisguaranteedtohavememorysynchronizedoneachaccess.Thevolatilekeyword
appliedtolonganddoubleguarantees,inaddition,atomicity.
Thejava.util. concurrent .atomicpackageprovidesextendedsupportforatomicityand
memorysynchronization.Forexample,thepackageincludesavarietyofclassesthatprovideatomic
compareAndSetoperationsontheirtypes.Onmanysystems,suchoperationscanbeimplemented
withasinglemachineinstruction.Whereitmakessense,otheroperationssuchasatomicincrement
maybeprovided.Thepackagealsoprovidesarrayclasseswheretheindividualarrayelementshave
volatilesemantics.[3]Theclassesinthejava. util. concurrent. atomicpackageare
onlyusefulwhencriticalupdatesforanobjectarelimitedtoasinglevariable.Asaresult,theyare
mostcommonlyusedasbuildingblocksforhigherlevelconstructs.
[3]
IntheJavalanguage,declaringanarraytobevolatileonlymakesthereferencetothe
arrayvolatile,nottheindividualelements.
ThefollowingcodeusesthegetAndIncrementmethodoftheAtomicLongclasstoimplement
athreadsafesequencer.ThegetAndIncrementmethodatomicallyobtainsthevalueofthe
variable,incrementsit,andreturnstheoriginalvalue.Suchasequencercouldbeusedtoimplementa
taskqueueinsomemaster/workerdesigns.
class Sequencer
{
private AtomicLong sequenceNumber = new AtomicLong(0);
public long next() { return sequenceNumber.getAndIncrement(); }
}
C.3. SYNCHRONIZED BLOCKS

Whenvariablescanbeaccessedbymultiplethreads,caremustbetakentoensurethatdatacorrupting
raceconditionscannotoccur.Javaprovidesaconstructcalledsynchronizedblockstoallowthe
programmertoensuremutuallyexclusiveaccesstosharedvariables.Synchronizedblocksalsoserve
asimplicitmemoryfences,asdescribedintheImplementationMechanismsdesignspace(Section
6.3).
EveryclassinaJavaprogramisadirectorindirectsubclassofclassObject.EveryObject
instanceimplicitlycontainsalock.Asynchronizedblockisalwaysassociatedwithanobject:Beforea
threadcanenterasynchronizedblock,itmustacquirethelockassociatedwiththatobject.Whenthe
threadleavesthesynchronizedblock,whethernormallyorbecauseanexceptionwasthrown,thelock
isreleased.Atmostonethreadcanholdthelockatthesametime,sosynchronizedblockscanbeused
toensuremutuallyexclusiveaccesstodata.Theyarealsousedtosynchronizememory.
Asynchronizedblockisspecifiedincodeas
synchronized(object_ref){...body of block....}
Thecurlybracesdelimittheblock.Thecodeforacquiringandreleasingthelockisgeneratebythe
compiler.
Supposeweaddavariablestatic int counttotheThinkParallelclasstobeincremented
byeachthreadafteritprintsitsmessage.Thisisastaticvariable,sothereisoneperclass(notoneper
object),anditisvisibleandthussharedbyallthethreads.Toavoidraceconditions,countcouldbe
accessedinasynchronizedblock.[4]Toprovideprotection,allthreadsmustusethesamelock,sowe
usetheobjectassociatedwiththeclassitself.ForanyclassX, X. classisareferencetothe
uniqueobjectrepresentingclassX,sowecouldwritethefollowing:
[4]
Ofcourse,forthisparticularsituation,onecouldinsteaduseanatomicvariableas
definedintheJava.util.concurrent. atomicpackage.
public void run()
synchronized(ThinkParallel.class){count++;}
}
Itisimportanttoemphasizethatonlysynchronizedblocksassociatedwiththesameobjectexclude
eachother.Twosynchronizedblocksassociatedwithdifferentobjectscouldexecuteconcurrently.For
example,acommonprogrammingerrorwouldbetowritethepreviouscodeas
public void run()
synchronized(this){count++;} //WRONG!
}
Inthebuggyversion,eachthreadwouldbesynchronizingonthelockassociatedwiththe"self"or
thisobject.Thiswouldmeanthateachthreadlocksadifferentlock(theoneassociatedwiththe
threadobjectitself)asitentersthesynchronizedblock,sononeofthemwouldexcludeeachother.
Also,asynchronizedblockdoesnotconstrainthebehaviorofathreadthatreferencesashared
variableincodethatisnotinasynchronizedblock.Itisuptotheprogrammertocarefullyensurethat
allmentionsofsharedvariablesareappropriatelyprotected.
Specialsyntaxisprovidedforthecommonsituationinwhichtheentiremethodbodyshouldbeinside
asynchronizedblockassociatedwiththethisobject.Inthiscase,thesynchronizedkeywordis
usedtomodifythemethoddeclaration.Thatis,
public synchronized void updateSharedVariables(...)
{....body.... }
isshorthandfor
public void updateSharedVariables(...)
{ synchronized(this){....body....} }
C.4. WAIT AND NOTIFY

Itissometimesthecasethatathreadneedstochecktoseewhethersomeconditionholds.Ifitholds,
thenthethreadshouldperformsomeaction;ifnot,itshouldwaituntilsomeotherthreadestablished
thecondition.Forexample,athreadmightcheckabuffertoseewhetheritcontainsanitem.Ifitdoes,
thenanitemisremoved.Ifnot,thethreadshouldwaituntiladifferentthreadhasinsertedone.Itis
importantthatcheckingtheconditionandperformingtheactionbedoneatomically.Otherwise,one
threadcouldcheckthecondition(thatis,findthebuffernonempty),anotherthreadcouldfalsifyit(by
removingtheonlyitem),andthenthefirstthreadwouldperformanactionthatdependedonthe
condition(whichnolongerholds)andcorrupttheprogramstate.Thus,checkingtheconditionand
performingtheactionneedtobeplacedinsideasynchronizedblock.Ontheotherhand,thethread
cannotholdthelockassociatedwiththesynchronizedblockwhilewaitingbecausethiswouldblock
otherthreadsfromaccess,preventingtheconditionfromeverbeingestablished.Whatisneededisa
wayforathreadwaitingonaconditiontoreleasethelockandthen,aftertheconditionhasbeen
satisfied,reacquirethelockbeforerecheckingtheconditionandperformingitsaction.Traditional
monitors[Hoa74]wereproposedtohandlethissituation.Javaprovidessimilarfacilities.
TheObjectclass,alongwiththepreviouslymentionedlock,alsocontainsanimplicitwaitsetthat
servesasaconditionvariable.TheObjectclassprovidesseveralversionsofwaitmethodsthat
causethecallingthreadtoimplicitlyreleasethelockandadditselftothewaitset.Threadsinthewait
setaresuspendedandnoteligibletobescheduledtorun.
Figure C.6. Basic idiom for using wait. Because wait throws an InterruptedException,
it should somehow be enclosed in a try-catch block, omitted here.
synchronized(lockObject)
{ while( ! condition ){ lockObject.wait();}
action;
}
ThebasicidiomforusingwaitisshowninFig.C.6.Thescenarioisasfollows:Thethreadacquires
thelockassociatedwithlockObject.Itcheckscondition.Iftheconditiondoesnothold,then
thebodyofthewhileloop,thewaitmethod,isexecuted.Thiscausesthelocktobereleased,
suspendsthethread,andplacesitinthewaitsetbelongingtolockObject.Iftheconditiondoes
hold,thethreadperformsactionandleavesthesynchronizedblock.Onleavingthesynchronized
block,thelockisreleased.
Threadsleavethewaitsetinoneofthreeways.First,theObjectclassmethodsnotifyand
notifyAllawakenoneorallthreads,respectively,inthewaitsetofthatobject.Thesemethodsare
intendedtobeinvokedbyathreadthatestablishestheconditionbeingwaitedupon.Anawakened
threadleavesthewaitsetandjoinsthethreadswaitingtoreacquirethelock.Theawakenedthreadwill
reacquirethelockbeforeitcontinuesexecution.Thewaitmethodmaybecalledwithoutparameters
orwithtimeoutvalues.Athreadthatusesoneofthetimedwaitmethods(thatis,onethatisgivena
timeoutvalue)maybeawakenedbynotificationasjustdescribed,orbythesystematsomepointafter
thetimeouthasexpired.Uponbeingreawakened,itwillreacquirethelockandcontinuenormal
execution.Unfortunately,thereisnoindicationofwhetherathreadwasawakenedbyanotificationor
atimeout.Thethirdwaythatathreadcanleavethewaitsetisifitisinterrupted.Thiscausesan
InterruptedExceptiontobethrown,whereuponthecontrolflowinthethreadfollowsthe
normalrulesforhandlingexceptions.
WenowcontinuedescribingthescenariostartedpreviouslyforFig.C.6,atthepointatwhichthe
threadhaswaitedandbeenawakened.Whenawakenedbysomeotherthreadexecutingnotifyor
notifyAllonlockObject(orbyatimeout),thethreadwillberemovedfromthewaitset.At
somepoint,itwillbescheduledforexecutionandwillattempttoreacquirethelockassociatedwith
lockObject.Afterthelockhasbeenreacquired,thethreadwillrechecktheconditionandeither
releasethelockandwaitagainor,iftheconditionholds,executeactionwithoutreleasingthelock.
Itisthejoboftheprogrammertoensurethatwaitingthreadsareproperlynotifiedafterthecondition
hasbeenestablished.Failuretodosocancausetheprogramtostall.Thefollowingcodeillustrates
usingthenotifyAllmethodafterstatementsintheprogramthatestablishthecondition:
synchronized(lockObject) {
establish_the_condition;
lockObject.notifyAll()
}
Inthestandardidiom,thecalltowaitisthebodyofawhileloop.Thisensuresthatthecondition
willalwaysberecheckedbeforeperformingtheactionandaddsaconsiderabledegreeofrobustnessto
theprogram.OneshouldneverbetemptedtosaveafewCPUcyclesbychangingthewhileloopto
anifstatement.Amongotherthings,thewhileloopensuresthatanextranotifymethodcan
nevercauseanerror.Thus,asafirststep,onecanusenotifyAllatanypointthatmightpossibly
establishthecondition.Performanceoftheprogrammightbeimprovedbycarefulanalysisthatwould
eliminatespuriousnotifyAlls,andinsomeprograms,itmaybepossibletoreplacenotifyAll
withnotify.However,theseoptimizationsshouldbedonecarefully.Anexampleillustratingthese
pointsisfoundintheSharedQueuepattern.
C.5. LOCKS
Thesemanticsofsynchronizedblockstogetherwithwaitandnotifyhavecertaindeficiencies
whenusedinthestraightforwardwaydescribedintheprevioussection.Probablytheworstproblemis
thatthereisnoaccesstoinformationaboutthestateoftheassociatedimplicitlock.Thismeansthata
threadcannotdeterminewhetherornotalockisavailablebeforeattemptingtoacquireit.Further,a
threadblockedwaitingforthelockassociatedwithasynchronizedblockcannotbeinterrupted.[5]
Anotherproblemisthatonlyasingle(implicit)conditionvariableisassociatedwitheachlock.Thus,
threadswaitingfordifferentconditionstobeestablishedshareawaitset,withnotifypossibly
wakingthewrongthread(andforcingtheuseofnotifyAll).
[5]
Thisisdiscussedfurtherinthenextsection.
Forthisreason,inthepastmanyJavaprogrammersimplementedtheirownlockingprimitivesorused
thirdpartypackagessuchasutil. concurrent[Lea].Now,thejava.util.concurrent.
lockspackageprovidesReentrantLock,[6]whichissimilartosynchronizedblocks,butwith
extendedcapabilities.Thelockmustbeexplicitlyinstantiated
[6]
Alockisreentrantifitcanbeacquiredmultipletimesbythesamethreadwithout
causingdeadlock.
//instantiate lock
private final ReentrantLock lock = new ReentrantLock() ;
Figure C.7. A version of SharedQueue2 (see the Shared Queue pattern) using a Lock and
Condition instead of synchronized blocks with wait and notify
import java.uti1.*;
import java.util.concurrent.locks.*;
class SharedQueue2 {
class Node
{ Object task;
Node next;
Node(Object task)

Lock lock = new ReentrantLock();
final Condition notEmpty = lock.newCondition();
public void put(Object task)
{ //cannot insert null
assert task != null: "Cannot insert null task";
lock.lock();
try{ Node p = new Node(task); last.next = p; last = p;
notEmpty.signalAll() ;
} finally{lock.unlock();}
}
Object task = null;
lock.lock();
try {
while (isEmpty())
{ try{ notEmpty.await(); }
catch(InterruptedException error)
{ assert false:"sq2: no interrupts here";}
}
Node first = head.next; task = first.task; first.task = null;
head = first;
} finally{lock.unlock();}
return task;
}
}
private boolean isEmpty() {return head.next == null;}
andshouldalwaysbeusedinatrycatchblock,suchas
//critical section
lock.lock(); // block until lock acquired
try { critical_section }
finally { lock.unlock(); }
Othermethodsallowinformationaboutthestateofthelocktobeacquired.Theselockstrade
syntacticconvenienceandacertainamountofsupportbythecompiler(itisimpossibleforthe
programmertoforgettoreleasethelockassociatedwithasynchronizedblock)forgreaterflexibility.
Inaddition,thepackageprovidesimplementationsofthenewConditioninterfacethatimplements
aconditionvariable.Thisallowsmultipleconditionvariablestobeassociatedwithasinglelock.A
Conditionassociatedwithalockisobtainedbycallingthelock'snewConditionmethod.The
analoguesofwait,notify,andnotifyAllareawait,signal,andsignalAll.An
exampleofusingthesenewclassestoimplementasharedqueue(asdescribedintheSharedQueue
pattern)isshowninFig.C.7.
C.6. OTHER SYNCHRONIZATION MECHANISMS AND SHARED

DATA STRUCTURES
AcomparisonwithOpenMPrevealsthatOpenMPhasconstructsanalogoustosynchronizedblocks
(locksandcriticalsections),butlacksfeaturessimilartowaitandnotify.Thisisbecause
OpenMPoffershigherlevelconstructstypicallyusedinparallelprogrammingratherthanthemore
generalapproachinJava.Onecaneasilyimplementavarietyofhigherlevelconstructsusingthe
availablefeatures.Also,thejava.util.concurrentpackageprovidesseveralhigherlevel
synchronizationprimitivesandshareddatastructures.Thesynchronizationprimitivesinclude
CountDownLatch(asimplesingleusebarrierthatcausesthreadstoblockuntilagivennumberof
threadshavereachedthelatch),CyclicBarrier(acyclicbarrierthatautomaticallyresetswhenit
hasbeenpassedandthusisconvenienttousemultipletimes,forexample,inaloop),and
Exchanger(whichallowstwothreadstoexchangeobjectsatasynchronizationpoint).
CountDownLatchandCyclicBarrierarealsodiscussedintheImplementationMechanisms
designspaceinSec.6.3.2.
InFig.C.8,weshowaverysimpleloopbasedprogramalongwithaparallelversioninFig.C.9.This
issimilartotheexampleintheOpenMPappendix,AppendixA.
Figure C.8. Simple sequential loop-based program similar to the one in Fig. A.5
class SequentialLoop {
static int num_iters = 1000;
static double[] res = new double[num_iters];
static double answer = 0.0;
static void combine(int i){.....}
static double big_comp(int i){.....}
{ for (int i = 0; i < num_iters; i++){ res[i] = big_comp(i); }
for (int i = 0; i < num_iters; i++){ combine(i);}
System.out.println(answer));
}
Figure C.9. Program showing a parallel version of the sequential program in Fig. C.8
where each iteration of the big_comp loop is a separate task. A thread pool containing
ten threads is used to execute the tasks. A CountDownLatch is used to ensure that all of
the tasks have completed before executing the (still sequential) loop that combines
the results.
class ParallelLoop {
static
static
static
static
static
ExecutorService exec;
CountDownLatch done;
int num_iters = 1000;
double [] res = new double [num_iters] ;
double answer = 0.0;
static void combine(int i){..... }

static double big_comp(int i){..... }
{ /*create executor with pool of 10 threads */
exec = Executors.newFixedThreadPool(10);
/*create and initialize the countdown latch*/
done = new CountDownLatch(num_iters);
long startTime = System.nanoTime();
for (int i = 0; i < num_iters; i++)
{ //only final local vars can be referenced in an anonymous class
final int j = i;
/*pass the executor a Runnable object to execute the loop
body and decrement the CountDownLatch */
exec.execute(new Runnable(){
public void run()
{ res[j] = big_comp(j);
done.countDown(); /*decrement the CountDownLatch*/
}
});
}
done.await(); //wait until all tasks have completed
/*combine results using sequential loop*/
for (int i = 0; i < num_iters; i++){ combined); }
System.out.println(answer);
}
}
/*cleanly shut down thread pool*/

exec.shutdown();
Thejava. util. concurrentpackagealsoprovidesseveralimplementationsoftheShared

QueuepatternaswellassomethreadsafeCollectionclasses,including
ConcurrentHashMap,CopyOnWriteArrayList,andCopyOnWriteArraySet.
C.7. INTERRUPTS
Partofthestateofathreadisitsinterruptstatus.Athreadcanbeinterruptedusingtheinterrupt
method.Thissetstheinterruptstatusofthethreadtointerrupted.
Ifthethreadissuspended(thatis,ithasexecutedawait, sleep, join,orothercommandthat
suspendsthethread),thesuspensionwillbeinterruptedandanInterruptedExceptionthrown.
Becauseofthis,themethodsthatcancauseblocking,suchaswait,throwthisexception,andthus
musteitherbecalledfromamethodthatdeclaresitselftothrowtheexception,orthecallmustbe
enclosedwithinatrycatchblock.BecausethesignatureoftherunmethodinclassThreadand
interfaceRunnabledoesnotincludethrowingthisexception,atrycatchblockmustenclose,either
directlyorindirectly,anycalltoablockingmethodinvokedbyathread.Thisdoesnotapplytothe
mainthread,becausethemainmethodcanbedeclaredtothrowanInterruptedException.
Theinterruptstatusofathreadcanbeusedtoindicatethatthethreadshouldterminate.Toenable
this,thethread'srunmethodshouldbecodedtoperiodicallychecktheinterruptstatus(usingthe
isInterruptedortheinterruptedmethod);ifthethreadhasbeeninterrupted,thethread
shouldreturnfromitsrunmethodinanorderlyway.InterruptedExceptionscanbecaught
andthehandlerusedtoensuregracefulterminationifthethreadisinterruptedwhenwaiting.Inmany
parallelprograms,provisionstoexternallystopathreadarenotneeded,andthecatchblocksfor
InterruptedExceptionscaneitherprovidedebugginginformationorsimplybeempty.
TheCallableinterfacewasintroducedasanalternativetoRunnablethatallowsanexceptionto
bethrown(andalso,asdiscussedpreviously,allowsaresulttobereturned).Thisinterfaceexploitsthe
supportforgenerictypes.
Glossary
Abstractdatatype(ADT).
Adatatypegivenbyitssetofallowedvaluesandtheavailableoperationsonthosevalues.The
valuesandoperationsaredefinedindependentlyofaparticularrepresentationofthevaluesor
implementationoftheoperations.InaprogramminglanguagethatdirectlysupportsADTs,the
interfaceofthetyperevealstheoperationsonit,buttheimplementationishiddenandcan(in
principle)bechangedwithoutaffectingclientsthatusethetype.TheclassicexampleofanADT
isastack,whichisdefinedbyitsoperations,typicallyincludingpushandpop.Manydifferent
internalrepresentationsarepossible.
Abstraction.
Abstractioncanhaveseveralmeaningsdependingonthecontext.Insoftware,itoftenmeans
combiningasetofsmalloperationsordataitemsandgivingthemaname.Forexample,control
abstractiontakesagroupofoperations,combinesthemintoaprocedure,andgivesthe
procedureaname.Asanotherexample,aclassinobjectorientedprogrammingisanabstraction
ofbothdataandcontrol.Moregenerally,anabstractionisarepresentationthatcapturesthe
essentialcharacterofanentity,buthidesthespecificdetails.Oftenwewilltalkaboutanamed
abstractionwithoutconcernfortheactualdetails,whichmaynotbedetermined.
Addressspace.
Therangeofmemorylocationsthataprocessorprocessorcanaccess.Dependingoncontext,
thiscouldrefertoeitherphysicalorvirtualmemory.
ADT.
See[abstractdatatype]
Amdahl'slaw.
Alawstatingthat(undercertainassumptions,asdescribedinSec.2.5)themaximumspeedup
thatcanbeobtainedbyrunninganalgorithmonasystemofPprocessorsis
whereistheserialfractionoftheprogram,andT(n)runningonnprocessors.
Seealso[speedup]
Seealso[serialfraction]
ANDparallelism.
Thisisoneofthemaintechniquesforintroducingparallelismintoalogiclanguage.Consider
thegoalA: B,C,D(read"AfollowsfromBandCandD"),whichmeansthatgoalAsucceeds
ifandonlyifallthreesubgoalsBandCandDsucceed.InANDparallelism,subgoalsB,C,
andDareevaluatedinparallel.
API.
See[applicationprogramminginterface]
ApplicationProgrammingInterface(API).
AnAPIdefinesthecallingconventionsandotherinformationneededforonesoftwaremodule
(typicallyanapplicationprogram)toutilizetheservicesprovidedbyanothersoftwaremodule.
MPIisanAPIforparallelprogramming.Thetermissometimesusedmorelooselytodefinethe
notationusedbyprogrammerstoexpressaparticularfunctionalityinaprogram.Forexample,
theOpenMPspecificationisreferredtoasanAPI.AnimportantaspectofanAPIisthatany
programcodedtoitcanberecompiledtorunonanysystemthatsupportsthatAPI.
Atomic.
Atomichasslightlydifferentmeaningsindifferentcontexts.Anatomicoperationatthe
hardwarelevelisuninterruptible,forexampleloadandstore,oratomictestandsetinstructions.
Inthedatabaseworld,anatomicoperation(ortransaction)isonethatappearstoexecute
completelyornotatall.Inparallelprogramming,anatomicoperationisoneforwhich
sufficientsynchronizationhasbeenprovidedthatitcannotbeinterferedwithbyotherUEs.
Atomicoperationsalsomustbeguaranteedtoterminate(forexample,noinfiniteloops).
Autoboxing.
Alanguagefeature,availableinJava21.5,thatprovidesautomaticconversionofdataofa
primitivetypetothecorrespondingwrappertypeforexample,frominttoInteger.
Bandwidth.
Thecapacityofasystem,usuallyexpressedasitemspersecond.Inparallelcomputing,themost
commonusageoftheterm"bandwidth"isinreferencetothenumberofbytespersecondthat
canbemovedacrossanetworklink.Aparallelprogramthatgeneratesrelativelysmallnumbers
ofhugemessagesmaybelimitedbythebandwidthofthenetwork,inwhichcaseitiscalleda
bandwidthlimitedprogram.
Seealso[bisectionbandwidth]
Barrier.
AsynchronizationmechanismappliedtogroupsofUEs,withthepropertythatnoUEinthe
groupcanpassthebarrieruntilallUEsinthegrouphavereachedthebarrier.Inotherwords,
UEsarrivingatthebarriersuspendorblockuntilallUEshavearrived;theycanthenall
proceed.
Beowulfcluster.
AclusterbuiltfromPCsrunningtheLinuxoperatingsystem.Clusterswerealreadywell
establishedwhenBeowulfclusterswerefirstbuiltintheearly1990s.PriortoBeowulf,however,
clusterswerebuiltfromworkstationsrunningUNIX.Bydroppingthecostofclusterhardware,
Beowulfclustersdramaticallyincreasedaccesstoclustercomputing.
Bisectionbandwidth.
Thebidirectionalcapacityofanetworkbetweentwoequalsizedpartitionsofnodes.Thecut
acrossthenetworkistakenatthenarrowestpointineachbisectionofthenetwork.
Broadcast.
Sendingamessagetoallmembersofagroupofrecipients,usuallyallUEsparticipatingina
computation.
Cache.
Arelativelysmallregionofmemorythatislocaltoaprocessorandisconsiderablyfasterthan
thecomputer'smainmemory.Cachehierarchiesconsistingofoneormorelevelsofcacheare
essentialinmoderncomputersystems.Becauseprocessorsaresomuchfasterthanthe
computer'smainmemory,aprocessorcanrunatasignificantfractionoffullspeedonlyifthe
datacanbeloadedintocachebeforeitisneededandthatdatacanbereusedduringa
calculation.Dataismovedbetweenthecacheandthecomputer'smainmemoryinsmallblocks
ofbytescalledcachelines.Anentirecachelineismovedwhenanybytewithinthememory
mappedtothecachelineisaccessed.Cachelinesareremovedfromthecacheaccordingto
someprotocolwhenthecachebecomesfullandspaceisneededforotherdata,orwhentheyare
accessedbysomeotherprocessor.Usuallyeachprocessorhasitsowncache(thoughsometimes
multipleprocessorssharealevelofcache),sokeepingthecachescoherent(thatis,ensuringthat
allprocessorshavethesameviewofmemorythroughtheirdistinctcaches)isanissuethatmust
bedealtwithbycomputerarchitectsandcompilerwriters.Programmersmustbeawareof
cachingissueswhenoptimizingtheperformanceofsoftware.
ccNUMA.
CachecoherentNUMA.ANUMAmodelwheredataiscoherentatthelevelofthecache.
Seealso[NUMA]
Cluster.
Anycollectionofdistinctcomputersthatareconnectedandusedasaparallelcomputer,orto
formaredundantsystemforhigheravailability.Thecomputersinaclusterarenotspecializedto
clustercomputingandcould,inprinciple,beusedinisolationasstandalonecomputers.Inother
words,thecomponentsmakingupthecluster,boththecomputersandthenetworksconnecting
them,arenotcustombuiltforuseinthecluster.ExamplesincludeEthernetconnected
workstationnetworksandrackmountedworkstationsdedicatedtoparallelcomputing.
Seealso[workstationfarm]
Collectivecommunication.
AhighleveloperationinvolvingagroupofUEsandhavingatitscorethecooperativeexchange
ofinformationbetweentheUEs.Thehighleveloperationmightbeapurecommunicationevent
(forexample,abroadcast)oritmightincludesomecomputation(forexample,areduction).
Seealso[broadcast]
Seealso[reduction]
Concurrentexecution.
AconditioninwhichtwoormoreUEsareactiveandmakingprogresssimultaneously.Thiscan
beeitherbecausetheyarebeingexecutedatthesametimeondifferentPEs,orbecausethe
actionsoftheUEsareinterleavedonthesamePE.
Concurrentprogram.
Aprogramwithmultiplelociofcontrol(threads,processes,etc.).
Conditionvariable.
Conditionvariablesarepartofthemonitorsynchronizationmechanism.Aconditionvariableis
usedbyaprocessorthreadtodelayuntilthemonitor'sstatesatisfiessomecondition;itisalso
usedtoawakenadelayedprocesswhentheconditionbecomestrue.Associatedwitheach
conditionvariableisawaitsetofsuspended(delayed)processesorthreads.Operationsona
conditionvariableincludewait(addthisprocessorthreadtothewaitsetforthisvariable)and
signalornotify(awakenaprocessorthreadonthewaitsetforthisvariable).
Seealso[monitor]
Copyonwrite.
Atechniquethatensures,usingminimalsynchronization,thatconcurrentthreadswillneversee
adatastructureinaninconsistentstate.Toupdatethestructure,acopyismade,modifications
aremadeonthecopy,andthenthereferencetotheoldstructureisatomicallyreplacedwitha
referencetothenew.Thismeansthatathreadholdingareferencetotheoldstructuremay
continuetoreadanold(consistent)version,butnothreadwilleverseethestructureinan
inconsistentstate.Synchronizationisonlyneededtoacquireandupdatethereferencetothe
structure,andtoserializetheupdates.
Countingsemaphore.
Countingsemaphoresaresemaphoreswhosestatecanrepresentanyinteger.Some
implementationsallowthePandVoperationstotakeanintegerparameterandincrementor
decrementthestate(atomically)bythatvalue.
Seealso[semaphore]
Cyclicdistribution.
Adistributionofdata(forexample,componentsofarrays)ortasks(forexample,loopiterations)
producedbydividingthesetintoanumberofblocksgreaterthanthenumberofUEsandthen
allocatingthoseblockstoUEsinacyclicmanneranalogoustodealingadeckofcards.
Dataparallel.
Atypeofparallelcomputinginwhichtheconcurrencyisexpressedbyapplyingasinglestream
ofinstructionssimultaneouslytotheelementsofadatastructure.
Deadlock.
Anerrorconditioncommoninparallelprogramminginwhichthecomputationhasstalled
becauseagroupofUEsareblockedandwaitingforeachotherinacyclicconfiguration.
Designpattern.
Adesignpatternisa"solutiontoaproblemincontext";thatis,itrepresentsahighquality
solutiontoarecurringproblemindesign.
Distributedcomputing.
Atypeofcomputinginwhichacomputationaltaskisdividedintosubtasksthatexecuteona
collectionofnetworkedcomputers.Thenetworksaregeneralpurposenetworks(LANs,WANs,
ortheInternet)asopposedtodedicatedclusterinterconnects.
Distributedsharedmemory(DSM).
AnaddressspacesharedamongmultipleUEsthatisconstructedfrommemorysubsystemsthat
aredistinctanddistributedaboutthesystem.Theremaybeoperatingsystemandhardware
supportforthedistributedsharedmemorysystem,orthesharedmemorymaybeimplemented
entirelyinsoftwareasaseparatemiddlewarelayer
Seealso[Virtualsharedmemory]
DSM.
See[distributedsharedmemory]
Eagerevaluation.
Aschedulingstrategywheretheevaluationofanexpression,orexecutionofaprocedure,can
occurassoonas(butnotbefore)allofitsargumentshavebeenevaluated.Eagerevaluationis
typicalformostprogrammingenvironmentsandcontrastswithlazyevaluation.Eager
evaluationcansometimesleadtoextrawork(orevennontermination)whenanargumentthat
willnotactuallybeneededinacomputationmustbecomputedanyway.
Efficiency.
TheefficiencyEofacomputationisthespeedupnormalizedbythenumberofPEs(P).Itis
givenby
andindicateshoweffectivelytheresourcesinaparallelcomputerareused.
Embarrassinglyparallel.
Ataskparallelalgorithminwhichthetasksarecompletelyindependent.SeetheTask
Parallelismpattern.
Explicitlyparallellanguage.
Aparallelprogramminglanguageinwhichtheprogrammerfullydefinestheconcurrencyand
howitwillbeexploitedinaparallelcomputation.OpenMP,Java,andMPIareexplicitlyparallel
languages.
Factory.
Aclasswithmethodstocreateobjects,usuallyinstancesofanyoneofseveralsubclassesofan
abstractbaseclass.Designpatternsforfactoryclasses(AbstractFactoryandFactoryMethod)
weregivenin[GHJV95].
Falsesharing.
Falsesharingoccurswhentwosemanticallyindependentvariablesresideinthesamecacheline
andUEsrunningonmultipleprocessorsmodifythesevariables.Theyaresemantically
independentsomemoryconflictsareavoided,butthecachelineholdingthevariablesmustbe
shuffledbetweentheprocessors,andtheperformancesuffers.
Fork.
See[fork/join]
Fork/join.
AprogrammingmodelusedinmultithreadedAPIssuchasOpenMP.Athreadexecutesafork
andcreatesadditionalthreads.Thethreads(calledateaminOpenMP)executeconcurrently.
Whenthemembersoftheteamcompletetheirconcurrenttasks,theyexecutejoinsandsuspend
untileverymemberoftheteamhasarrivedatthejoin.Atthatpoint,themembersoftheteam
aredestroyedandtheoriginalthreadcontinues.
Framework.
Areusable,partiallycompleteprogramthatembodiesadesignforapplicationsinaparticular
domain.Programmerscompletetheprogrambyprovidingapplicationspecificcomponents.
Futurevariable.
Amechanismusedinsomeparallelprogrammingenviromentsforcoordinatingtheexecutionof
UEs.Thefuturevariableisaspecialvariablethatwilleventuallyholdtheresultfroman
asynchronouscomputation.Forexample,Java(inpackagejava.util. concurrent)
containsaclassFuturetoholdfuturevariables.
Generics.
Programminglanguagefeaturesthatallowprogramstocontainplaceholdersforcertainentities,
typicallytypes.Thegenericcomponent'sdefinitioniscompletedbeforeitisusedinaprogram.
GenericsareincludedinAda,C++(viatemplates),andJava.
Grid.
Agridisanarchitecturefordistributedcomputingandresourcesharing.Agridsystemis
composedofaheterogeneouscollectionofresourcesconnectedbylocalareaand/orwidearea
networks(oftentheInternet).Theseindividualresourcesaregeneralandincludecompute
servers,storage,applicationservers,informationservices,orevenscientificinstruments.Grids
areoftenimplementedintermsofWebservicesandintegratedmiddlewarecomponentsthat
provideaconsistentinterfacetothegrid.Agridisdifferentfromaclusterinthattheresources
inagridarenotcontrolledthroughasinglepointofadministration;thegridmiddleware
managesthesystemsocontrolofresourcesonthegridandthepoliciesgoverninguseofthe
resourcesremainwiththeresourceowners.
Heterogeneous.
Aheterogeneoussystemisconstructedfromcomponentsofmorethanonekind.Anexampleis
adistributedsystemwithavarietyofprocessortypes.
Homogeneous.
Thecomponentsofahomogeneoussystemareallofthesamekind.
Hypercube.
Amulticomputerinwhichthenodesareplacedattheverticesofaddimensionalcube.The
mostfrequentlyusedconfigurationisabinaryhypercubewhereeachof2nnodesisconnected
tonothers.
Implicitlyparallellanguage.
Aparallelprogramminglanguageinwhichthedetailsofwhatcanexecuteconcurrentlyand
howthatconcurrencyisimplementedislefttothecompiler.Mostparallelfunctionaland
dataflowlanguagesareimplicitlyparallel.
Incrementalparallelism.
Incrementalparallelismisatechniqueforparallelizinganexistingprogram,inwhichthe
parallelizationisintroducedasasequenceofincrementalchanges,parallelizingoneloopata
time.Followingeachtransformation,theprogramistestedtoensurethatitsbehavioristhe
sameastheoriginalprogram,greatlydecreasingthechancesofintroducingundetectedbugs.
Seealso[refactoring]
JavaVirtualMachine(JVM).
AnabstractstackbasedcomputingmachinewhoseinstructionsetiscalledJavabytecode.
Typically,Javaprogramsarecompiledintoclassfilescontainingbytecode,asymboltable,and
otherinformation.ThepurposeoftheJVMistoprovideaconsistentexecutionenvironmentfor
classfilesregardlessoftheunderlyingplatform.
Join.
See[fork/join]
JVM.
See[JavaVirtualMachine]
Latency.
Thefixedcostofservicingarequest,suchassendingamessageoraccessinginformationfrom
adisk.Inparallelcomputing,thetermmostoftenisusedtorefertothetimeittakestosendan
emptymessageoverthecommunicationmedium,fromthetimethesendroutineiscalledtothe
timetheemptymessageisreceivedbytherecipient.Programsthatgeneratelargenumbersof
smallmessagesaresensitivetothelatencyandarecalledlatencyboundprograms.
Lazyevaluation.
Aschedulingpolicythatdoesnotevaluateanexpression(orinvokeaprocedure)untilthe
resultsoftheevaluationareneeded.Lazyevaluationmayavoidsomeunnecessaryworkandin
somesituationsmayallowacomputationtoterminatethatotherwisewouldnot.Lazy
evaluationisoftenusedinfunctionalandlogicprogramming.
Linda.
Acoordinationlanguageforparallelprogramming.
Seealso[tuplespace]
Loadbalance.
Inaparallelcomputation,tasksareassignedtoUEs,whicharethenmappedontoPEsfor
execution.ThenetworkcarriedoutbythecollectionofPEsisthe"load"associatedwiththe
computation.LoadbalancereferstohowthatloadisdistributedamongthePEs.Inanefficient
parallelprogram,theloadisbalancedsoeachPEspendsaboutthesameamountoftimeonthe
computation.Inotherwords,inaprogramwithgoodloadbalance,eachPEfinisheswithits
shareoftheloadataboutthesametime.
Loadbalancing.
TheprocessofdistributingworktoUEssuchthateachUEinvolvedinaparallelcomputation
takesapproximatelythesameamountoftime.Therearetwomajorformsofloadbalancing.In
staticloadbalancing,thedistributionofworkisdeterminedbeforethecomputationstarts.In
dynamicloadbalancing,theloadismodifiedasthecomputationproceeds(thatis,during
runtime).
Locality.
TheextenttowhichthecomputationscarriedoutbyaPEusedatathatisassociatedwith(that
is,iscloseto)thatPE.Forexample,inmanydenselinearalgebraproblems,thekeytohigh
performanceistodecomposematricesintoblocksandthenstructurethecalculationsinterms
oftheseblockssodatabroughtintoaprocessor'scacheisusedmanytimes.Thisisanexample
ofanalgorithmtransformationthatincreaseslocalityinacomputation.
Massivelyparallelprocessor(MPP).
Adistributedmemoryparallelcomputerdesignedtoscaletohundredsifnotthousandsof
processors.Tobettersupporthighscalability,thecomputerelementsornodesintheMPP
machinearecustomdesignedforuseinascalablecomputer.Thistypicallyincludestight
integrationbetweenthecomputingelementsandthescalablenetwork.
MessagePassingInterface(MPI).
AstandardmessagepassinginterfaceadoptedbymostMPPvendorsaswellasbythecluster
computingcommunity.Theexistenceofawidelysupportedstandardenhancesprogram
portability;anMPIbasedprogramdevelopedforoneplatformshouldalsorunonanyother
platformforwhichanimplementationofMPIexists.
MIMD(MultipleInstruction,MultipleData).
OneofthecategoriesofarchitecturesinFlynn'staxonomyofcomputerarchitectures.Ina
MIMDsystem,eachPEhasitsownstreamofinstructionsoperatingonitsowndata.Thevast
majorityofmodernparallelsystemsusetheMIMDarchitecture.
Monitor.
MonitorsareasynchronizationmechanismoriginallyproposedbyHoare[Hoa74].Amonitoris
anADTimplementationthatguaranteesmutuallyexclusiveaccesstoitsinternaldata.
Conditionalsynchronizationisprovidedbyconditionvariables
Seealso[conditionvariable]
MPI.
See[MessagePassingInterface]
MPP.
See[massivelyparallelprocessor]
Multicomputer.
Aparallelcomputerbasedonadistributedmemory,MIMDparallelarchitecture.Thesystem
appearstotheuserasasinglecomputer.
Multiprocessor.
Aparallelcomputerwithmultipleprocessorsthatshareanaddressspace.
Mutex.
Amutualexclusionlock.Amutexserializestheexecutionofmultiplethreads.
Node.
Commontermforthecomputationalelementsthatmakeupadistributedmemoryparallel
machine.Eachnodehasitsownmemoryandatleastoneprocessor;thatis,anodecanbea
uniprocessororsometypeofmultiprocessor.
NUMA.
Thistermisusedtodescribeasharedmemorycomputersystemwherenotallmemoryis
equidistantfromallprocessors.Thus,thetimerequiredtoaccessmemorylocationsisnot
uniform,andforgoodperformancetheprogrammerusuallyneedstobeconcernedwiththe
placementofdatainthememory.
Opaquetype.
Atypethatcanbeusedwithoutknowledgeoftheinternalrepresentation.Instancesofthe
opaquetypecanbecreatedandmanipulatedviaawelldefinedinterface.Thedatatypesused
forMPIcommunicatorsandOpenMPlocksareexamples.
OpenMP.
Aspecificationdefiningcompilerdirectives,libraryroutines,andenvironmentvariablesthat
canbeusedtoexpresssharedmemoryparallelisminFortranandC/C++programs.OpenMP
implementationsexistforalargevarietyofplatforms.
ORparallelism.
Anexecutiontechniqueinparallellogiclanguagesinwhichmultipleclausescanbeevaluatedin
parallel.Forexample,consideraproblemwithtwoclauses:A: B, CandA: E,F.The
clausescanexecuteinparalleluntiloneofthemsucceeds.
Parallelfilesystem.
Afilesystemthatisvisibletoanyprocessorinthesystemandcanbereadandwrittenby
multipleUEssimultaneously.Althoughaparallelfilesystemappearstothecomputersystemas
asinglefilesystem,itisphysicallydistributedamonganumberofdisks.Tobeeffective,the
aggregatethroughputforreadandwritemustbescalable.
Paralleloverhead.
Thetimespentinaparallelcomputationmanagingthecomputationratherthancomputing
results.Contributorstoparalleloverheadincludethreadcreationandscheduling,
communication,andsynchronization.
PE.
See[processingelement]
Peertopeercomputing.
Adistributedcomputingmodelinwhicheachnodehasequalstandingamongthecollectionof
nodes.Inthemosttypicalusageofthisterm,thesamecapabilitiesareofferedbyeachnode,and
anynodecaninitiateacommunicationsessionwithanothernode.Thiscontrastswith,for
example,clientservercomputing.Thecapabilitiesthataresharedinpeertopeercomputing
includefilesharingaswellascomputation.
POSIX.
ThePortableOperatingSystemInterfaceasdefinedbythePortableApplicationsStandards
Committee(PASC)oftheIEEEComputerSociety.Whereasotheroperatingsystemsfollow
someofthePOSIXstandards,theprimaryuseofthistermreferstothefamilyofstandardsthat
definetheinterfacesinUNIXandUNIXlike(forexample,Linux)operatingsystems.
Precedencegraph.
Awayofrepresentingtheorderconstraintsamongacollectionofstatements.Thenodesofthe
graphrepresentthestatements,andthereisadirectededgefromnodeAtonodeBifstatement
AmustbeexecutedbeforestatementB.Aprecedencegraphwithacyclerepresentsacollection
ofstatementsthatcannotbeexecutedwithoutdeadlock.
Process.
Acollectionofresourcesthatenabletheexecutionofprograminstructions.Theseresourcescan
includevirtualmemory,I/Odescriptors,aruntimestack,signalhandlers,userandgroupIDs,
andaccesscontroltokens.Amorehighlevelviewisthataprocessisa"heavyweight"UEwith
itsownaddressspace.
Seealso[unitofexecution]
Seealso[thread]
Processmigration.
Changingtheprocessorresponsibleforrunningaprocessduringexecution.Processmigrationis
commonlyusedtodynamicallybalancetheloadonmultiprocessorsystems.Itisalsousedto
supportfaulttolerantcomputingbymovingprocessesawayfromfailingprocessors.
Processingelement(PE).
Agenerictermusedtoreferenceahardwareelementthatexecutesastreamofinstructions.The
contextdefineswhatunitofhardwareisconsideredaPE.ConsideraclusterofSMP
workstations.Insomeprogrammingenvironments,eachworkstationisviewedasexecutinga
singleinstructionstream;inthiscase,aPEisaworkstation.Adifferentprogramming
environmentrunningonthesamehardware,however,mightvieweachprocessorofthe
individualworkstationsasexecutinganindividualinstructionstream;inthiscase,thePEisthe
processorratherthantheworkstation.
Programmingenvironment.
ProgrammingenvironmentsprovidethebasictoolsandAPIsneededtoconstructprograms.A
programmingenvironmentimpliesaparticularabstractionofthecomputersystemcalleda
programmingmodel.
Programmingmodel.
Abstractionofacomputersystem,forexamplethevonNeumannmodelusedintraditional
sequentialcomputers.Forparallelcomputing,therearemanypossiblemodelstypically
reflectingdifferentwaysprocessorscanbeinterconnected.Themostcommonarebasedon
sharedmemory,distributedmemorywithmessagepassing,orahybridofthetwo.
Pthreads.
AnothernameforPOSIXthreads,thatis,thedefinitionofthreadsinthevariousPOSIX
standards.
Seealso[POSIX]
PVM(ParallelVirtualMachine).
Amessagepassinglibraryforparallelcomputing.PVMplayedanimportantroleinthehistory
ofparallelcomputingasitwasthefirstportablemessagepassingprogrammingenvironmentto
gainwidespreaduseintheparallelcomputingcommunity.Ithaslargelybeensupersededby
MPI.
Racecondition.
Anerrorconditionpeculiartoparallelprogramsinwhichtheoutcomeofaprogramchangesas
therelativeschedulingofUEsvaries.
Reader/writerlocks.
ThispairoflocksissimilartomutexesexceptthatmultipleUEscanholdareadlock,whereasa
writelockexcludesbothotherwritersandallreaders.Reader/writerlocksareofteneffective
whenresourcesprotectedbythelockarereadfarmoreoftenthantheyarewritten.
Reduction.
Anoperationthattakesacollectionofobjects(usuallyoneoneachUE)andcombinesthem
intoasingleobjectononeUEorcombinesthemsuchthateachUEhasacopyofthecombined
object.Reductionstypicallyinvolvecombiningasetofvaluespairwiseusinganassociative,
commutativeoperator,suchasadditionormax.
Refactoring.
Refactoringisasoftwareengineeringtechniqueinwhichaprogramisrestructuredcarefullyso
astoalteritsinternalstructurewithoutchangingitsexternalbehavior.Therestructuringoccurs
throughaseriesofsmalltransformations(calledrefactorings)thatcanbeverifiedaspreserving
behaviorfollowingeachtransformation.Thesystemisfullyworkingandverifiablefollowing
eachtransformation,greatlydecreasingthechancesofintroducingserious,undetectedbugs.
Incrementalparallelismcanbeviewedasanapplicationofrefactoringtoparallelprogramming.
Seealso[incrementalparallelism]
Remoteprocedurecall(RPC).
Aprocedureinvokedinadifferentaddressspacethanthecaller,oftenonadifferentmachine.
Remoteprocedurecallsareapopularapproachforinterprocesscommunicationandlaunching
remoteprocessesindistributedclientservercomputingenvironments.
RPC.
See[remoteprocedurecall]
Semaphore.
AnADTusedtoimplementcertainkindsofsynchronization.Asemaphorehasavaluethatis
constrainedtobeanonnegativeintegerandtwoatomicoperations.Theallowableoperationsare
V(sometimescalledup)andP(sometimescalleddown).AVoperationincreasesthevalueof
thesemaphorebyone.APoperationdecreasesthevalueofthesemaphorebyone,provided
thatcanbedonewithoutviolatingtheconstraintthatthevaluebenonnegative.APoperation
thatisinitiatedwhenthevalueofthesemaphoreis0suspends.Itmaycontinuewhenthevalue
ispositive.
Serialfraction.
Mostcomputationsconsistofpartsthatcontainexploitableconcurrencyandpartsthatmustbe
executedserially.Theserialfractionisthatfractionoftheprogram'sexecutiontimetakenupby
thepartsthatmustexecuteserially.Forexample,ifaprogramdecomposesintosetup,compute,
andfinalization,wecouldwrite
Ifthesetupandfinalizationphasesmustexecuteserially,thentheserialfractionwouldbe
Sharedaddressspace.
AnaddressableblockofmemorythatissharedbetweenacollectionofUEs.
Sharedmemory.
Atermappliedtobothhardwareandsoftwareindicatingthepresenceofamemoryregionthat
issharedbetweensystemcomponents.Forprogrammingenvironments,thetermmeansthat
memoryissharedbetweenprocessesorthreads.Appliedtohardware,itmeansthatthe
architecturalfeaturetyingprocessorstogetherissharedmemory.
Seealso[sharedaddressspace]
Sharednothing.
AdistributedmemoryMIMDarchitecturewherenothingotherthanthelocalareanetworkis
sharedbetweenthenodes.
Simultaneousmultithreading(SMT).
Anarchitecturalfeatureofsomeprocessorsthatallowsmultiplethreadstoissueinstructionson
eachcycle.
Inotherwords,SMTallowsthefunctionalunitsthatmakeuptheprocessortoworkonbehalfof
morethanonethreadatthesametime.ExamplesofsystemsutilizingSMTaremicroprocessors
fromIntelCorporationthatuseHyperThreadingTechnology.
SIMD(SingleInstruction,MultipleData).
OneofthecategoriesinFlynn'staxonomyofcomputerarchitectures.InaSIMDsystem,a
singleinstructionstreamrunssynchronouslyonmultipleprocessors,eachwithitsowndata
stream.
Singleassignmentvariable.
Aspecialkindofvariabletowhichavaluecanbeassignedonlyonce.Thevariableinitiallyis
inanunassignedstate.Afteravaluehasbeenassigned,itcannotbechanged.Thesevariables
arecommonlyusedwithprogrammingenvironmentsthatemployadataflowcontrolstrategy,
withtaskswaitingtofireuntilallinputvariableshavebeenassigned.
SMP.
See[symmetricmultiprocessor]
SMT.
See[simultaneousmultithreading]
Speedup.
Speedup,S,isamultiplierindicatinghowmanytimesfastertheparallelprogramisthanits
sequentialcounterpart.Itisgivenby
whereT(n)onasystemwithnPEs.WhenthespeedupequalsthenumberofPEsinthe
parallelcomputer,thespeedupissaidtobeperfectlylinear.
SingleProgram,MultipleData(SPMD).
Thisisthemostcommonwaytoorganizeaparallelprogram,especiallyonMIMDcomputers.
Theideaisthatasingleprogramiswrittenandloadedontoeachnodeofaparallelcomputer.
Eachcopyofthesingleprogramrunsindependently(asidefromcoordinationevents),sothe
instructionstreamsexecutedoneachnodecanbecompletelydifferent.Thespecificpath
throughthecodeisinpartselectedbythenodeID.
SPMD.
Seesingleprogram
Seealso[multipledata]
Stride.
Theincrementusedwhensteppingthroughastructureinmemory.Theprecisemeaningof
strideiscontextdependent.Forexample,inanMxNarraystoredinacolumnmajororderina
contiguousblockofmemory,traversingtheelementsofacolumnofthematrixinvolvesastride
ofone.Inthesameexample,traversingacrossarowrequiresastrideofM.
Symmetricmultiprocessor(SMP).
Asharedmemorycomputerinwhicheveryprocessorisfunctionallyidenticalandhasequal
timeaccesstoeverymemoryaddress.Inotherwords,bothmemoryaddressesandoperating
systemservicesareequallyavailabletoeveryprocessor.
Synchronization.
EnforcingconstraintsontheorderingofeventsoccurringindifferentUEs.Thisisprimarily
usedtoensurethatsharedresourcesareaccessedbyacollectionofUEsinsuchawaythatthe
programiscorrectregardlessofhowtheUEsarescheduled.
Systolicarray.
Aparallelarchitectureconsistingofanarrayofprocessorswitheachprocessorconnectedtoa
smallnumberofitsnearestneighbors.Dataflowsthroughthearray.Asdataarrivesata
processor,itcarriesoutitsassignedoperationsandthenpassestheoutputtooneormoreofits
nearestneighbors.Althougheachprocessorinasystolicarraycanrunadistinctstreamof
instructions,theyprogressinlockstep,alternatingbetweencomputationandcommunication
phases.Hence,systolicarrayshaveagreatdealincommonwiththeSIMDarchitecture.
Systolicalgorithm.
Aparallelalgorithmwheretasksoperatesynchronouslywitharegularnearestneighbor
communicationpattern.Manycomputationalproblemscanbeformulatedassystolicalgorithms
byreformulatingasacertaintypeofrecurrencerelation.
Task.
Ataskisasequenceofinstructionsthatoperatetogetherasagroup.Thisgroupcorrespondsto
somelogicalpartofanalgorithmorprogram.
Taskqueue.
AqueuethatholdstasksforexecutionbyoneormoreUEs.Taskqueuesarecommonlyusedto
implementdynamicschedulingalgorithmsinprogramsusingtheTaskParallelismpattern,
particularlywhenusedwiththeMaster/Workerpattern.
Thread.
Afundamentalunitofexecutiononcertaincomputers.InaUNIXcontext,threadsare
associatedwithaprocessandsharetheprocess'senvironment.Thismakesthethreads
lightweight(thatis,acontextswitchbetweenthreadsischeap).Amorehighlevelviewisthata
threadisa"lightweight"unitofexecutionthatsharesanaddressspacewithotherthreads.
Seealso[unitofexecution]
Seealso[process]
Transputer.
ThetransputerisamicroprocessordevelopedbyInmosLtd.withonchipsupportforparallel
processing.Eachprocessorcontainsfourhighspeedcommunicationlinksthatareeasily
connectedtothelinksofothertransputersandaveryefficientbuiltinschedulerfor
multiprocessing.
Tuplespace.
Asharedmemorysystemwheretheelementsheldinthememoryarecompoundobjectsknown
astuples.Atupleisasmallsetoffieldsholdingvaluesorvariables,asinthefollowing
examples:
(3,"the larch",4)
(X,47,[2,4,89,3])
("done")
Asseenintheseexamples,thefieldsmakingupatuplecanholdintegers,strings,variables,
arrays,oranyothervaluedefinedinthebaseprogramminglanguage.Whereastraditional
memorysystemsaccessobjectsthroughanaddress,tuplesareaccessedbyassociation.The
programmerworkingwithatuplespacedefinesatemplateandasksthesystemtodelivertuples
matchingthetemplate.TuplespaceswerecreatedaspartoftheLindacoordinationlanguage
[CG91].TheLindalanguageissmall,withonlyahandfulofprimitivestoinserttuples,remove
tuples,andfetchacopyofatuple.Itiscombinedwithabaselanguage,suchasC,C++,or
Fortran,tocreateacombinedparallelprogramminglanguage.Inadditiontoitsoriginal
implementationsonmachineswithasharedaddressspace,Lindawasalsoimplementedwitha
virtualsharedmemoryandusedtocommunicatebetweenUEsrunningonthenodesof
distributedmemorycomputers.Theideaofanassociativevirtualsharedmemoryasinspiredby
LindahasbeenincorporatedintoJavaSpaces[FHA99].
UE
See[unitofexecution]
Unitofexecution(UE).
Generictermforoneofacollectionofconcurrentlyexecutingentities,usuallyeitherprocesses
orthreads.
Seealso[process]
Seealso[thread]
Vectorsupercomputer.
Asupercomputerwithavectorhardwareunitasanintegralpartofitscentralprocessingunit
boards.Thevectorhardwareprocessesarraysinapipelinefashion.
Virtualsharedmemory.
Asystemthatprovidestheabstractionofsharedmemory,allowingprogrammerstowritetoa
sharedmemoryevenwhentheunderlyinghardwareisbasedonadistributedmemory
architecture.Virtualsharedmemorysystemscanbeimplementedwithintheoperatingsystem
oraspartoftheprogrammingenvironment.
Workstationfarm.
AclusterconstructedfromworkstationstypicallyrunningsomeversionofUNIX.Insome
cases,theterm"farm"isusedtoimplythatthesystemwillbeusedtorunlargenumbersof
independentsequentialjobsasopposedtoparallelcomputing.
About the Authors

TIMOTHYG.MATTSON
TimothyG.MattsonearnedaPh.D.inchemistryfromtheUniversityofCaliforniaatSantaCruzfor
hisworkonquantummolecularscatteringtheory.ThiswasfollowedbyapostdocatCaltechwherehe
portedhismolecularscatteringsoftwaretotheCaltech/JPLhypercubes.Sincethen,hehashelda
numberofcommercialandacademicpositionswithcomputationalscienceonhighperformance
computersasthecommonthread.Hehasbeeninvolvedwithanumberofnoteworthyprojectsin
parallelcomputing,includingtheASCIRedproject(thefirstTeraFLOPMPP),thecreationof
OpenMP,andOSCAR(apopularpackageforclustercomputing).Currentlyheisresponsiblefor
Intel'sstrategyforthelifesciencesmarketandisIntel'schiefspokesmantothelifesciences
community.
BEVERLYA.SANDERS
BeverlyA.SandersreceivedaPh.D.inappliedmathematicsfromHarvardUniversity.Shehasheld
facultypositionsattheUniversityofMaryland,theSwissFederalInstituteofTechnology(ETH
Zurich),andCaltech,andiscurrentlywiththeDepartmentofComputerandInformationScienceand
EngineeringattheUniversityofFlorida.Amainthemeofherteachingandresearchhasbeenthe
developmentandapplicationoftechniques,includingdesignpatterns,formalmethods,and
programminglanguageconcepts,tohelpprogrammersconstructhighquality,correctprograms,
particularlyprogramsinvolvingconcurrency.
BERNAL.MASSINGILL
BernaL.MassingillearnedaPh.D.incomputersciencefromCaltech.Thiswasfollowedbyapostdoc
attheUniversityofFlorida,wheresheandtheotherauthorsbegantheirworkondesignpatternsfor
parallelcomputing.ShecurrentlyholdsafacultypositionintheDepartmentofComputerScienceat
TrinityUniversity(SanAntonio,Texas).Shealsospentmorethantenyearsasaworkingprogrammer,
firstinmainframesystemsprogrammingandlaterasadeveloperforasoftwarecompany.Her
researchinterestsincludeparallelanddistributedcomputing,designpatterns,andformalmethods,
andagoalofherteachingandresearchhasbeenapplyingideasfromthesefieldstohelpprogrammers
constructhighquality,correctprograms.

Patterns For Parallel Programming

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Patterns For Parallel Programming

Uploaded by

Copyright:

Available Formats

"Ifyoubuildit,theywillcome.

Chapter 1. A Pattern Language for Parallel

1.2. PARALLEL PROGRAMMING

1.3. DESIGN PATTERNS AND PATTERN LANGUAGES

1.4. A PATTERN LANGUAGE FOR PARALLEL PROGRAMMING

Chapter 2. Background and Jargon of Parallel

2.1. CONCURRENCY IN PARALLEL PROGRAMS VERSUS

2.2. PARALLEL ARCHITECTURES: A BRIEF INTRODUCTION

Figure 2.1. The Single Instruction, Single Data (SISD) architecture

2.2.2. A Further Breakdown of MIMD

2.3. PARALLEL PROGRAMMING ENVIRONMENTS

2.4. THE JARGON OF PARALLEL COMPUTING

2.5. A QUANTITATIVE LOOK AT PARALLEL COMPUTATION

2.6.2. Overlapping Communication and Computation and Latency Hiding

Chapter 3. The Finding Concurrency Design

3.1. ABOUT THE DESIGN SPACE

Figure 3.2. Pseudocode for the molecular dynamics example

atoms (3,N) //3D coordinates

loop over time steps

3.2. THE TASK DECOMPOSITION PATTERN

atoms (3,N) //3D coordinates

loop over time steps

non_bonded_forces (N, atoms, neighbors, forces)

3.3. THE DATA DECOMPOSITION PATTERN

3.4. THE GROUP TASKS PATTERN

3.5. THE ORDER TASKS PATTERN

Figure 3.4. Ordering of tasks in molecular dynamics problem

3.6. THE DATA SHARING PATTERN

3.7. THE DESIGN EVALUATION PATTERN

Chapter 4. The Algorithm Structure Design

4.2. CHOOSING AN ALGORITHM STRUCTURE PATTERN

4.4. THE TASK PARALLELISM PATTERN

Figure 4.3. Good versus poor load balance

Int const N // number of atoms

end loop [i]

4.5. THE DIVIDE AND CONQUER PATTERN

Figure 4.5. The divide-and-conquer strategy

Figure 4.6. Sequential pseudocode for the divide-and-conquer algorithm

solve returns Solution; // a solution stage

4.6. THE GEOMETRIC DECOMPOSITION PATTERN

Example: matrix-multiplication program

The exchange operation

#define LEFTVAL 1.0

/* pointers to arrays for two iterations of algorithm */

#define RIGHTVAL 10.0

int numProcs, int myID) {

void printValues(double uk[], int step, int numPoints, int myID)

/* exchange boundary information */

/* clean up and end */

Figure 4.16. Parallel heat-diffusion program using MPI with overlapping

Figure 4.17. Sequential matrix multiplication

memory for matrices */

for (int ib=0; ib < NB; ++ib) {

/* Remaining code is the same as for the first version */

/* continued in next figure */

4.7. THE RECURSIVE DATA PATTERN

4.8. THE PIPELINE PATTERN

Figure 4.25. Example pipelines

Defining the stages of the pipeline

Throughput and latency

specify initialization step

Figure 4.28. Base class for linear pipeline

abstract class LinearPipeline {