Professional Documents
Culture Documents
Lesson 4 Notes
Lesson 4 Notes
Lesson 3 Notes
Introduction
Inthislessonweregoingtotakealookattheactualcodeweraninthelast
lesson.ThenyoullwriteyourownMapReducecode.
Ifyourecall,thecodefoundthetotalsalesperstore.Typically,thedatawould
havecomefromdatabasetablesandweduseSqooptoimportitintoHDFS.
InputData
Sofirstletstakeacloserlookattheinputdataformat.RememberthateachMapper
processesaportionoftheinputdata,andeachonewillbegivenalineatatime.Thelines
looklikethis:
TheMapperneedstotakethatlineandextracttheinformationitneeds.Oftenwhenwere
dealingwithtextitsprettyfreeformsowedusesomethinglikearegularexpression.Butin
thiscase,itsniceandregular:itstabdelimited.Sowecanjustsplitthelinebasedontabs
andextractthevaluesforallfields.
Quiz:HowToFindTotalSales?
QuestionYouvebeenaskedtocalculatethetotalsalesforeachstore.Whatshouldyouuse
astheintermediatekeyandvalue?
Key Value
[]time storename
[]cost storename
[]storename cost
[]storename itemdescription
Copyright2014Udacity,Inc.AllRightsReserved.
Answer:
Thestorenameandtheamountisthecorrectanswer.
DefensiveMapperCode
HeresourMappercode.Letslook
atitlinebyline.Weregoingtoloop
aroundstandardinput,whichwill
giveusalineatatime.Ofcourse,
thelinewillhaveanewlinecharacter
attheendofit,soletsstripoutthat,
plusanyotherwhitespacearound
theline,andsinceourlineistab
delimited,wecansplititatthesame
time.Thatgivesusanarray,which
wellcalldata.
re
ProgrammingQuiz:FinishTheMapper
AsyouuseHadoopmoreandmore,youlldiscoverthatthemoredatayouhave,themore
likelyyouaretoencounterweirdnessinthatdata.Lineswillbemalformed,therewillbe
strangelogmessagesinthedatayouregoingtocomeacrosseverystrangeedgecase
youcanimagine,andplentythatyoucant.Sohere,youshouldmakesurethatnomatter
whatkindofmalformedlinethefilehas,themappercancontinueworking.Youwouldntwant
your2TBprocessingjobtodiepartwaythrough.
So,wedlikeyoutoaddsomedefensiveprogrammingtomakesurethingsdontbreakifyou
getastrangelineinthemiddleofyourdata.
Answer:
Inthiscase,werejustcheckingthatthelineactuallyhassixfields.Ifitdoesnt,welljust
ignorethatline.Butanothergoodthingtodowouldbetocheckthatthecostisactuallyavalid
number.
AssumingthatthelineisOK,wellsimplywriteourintermediatedataoutintheformofthe
key,thenatab,thenthevalue.Andthenweloopbackandreadthenextlinefromourinput
file.
Copyright2014Udacity,Inc.AllRightsReserved.
Quiz:WhatHappensBetweenMapperAndReducer?
OncetheMapperisdone,theHadoopframeworkpassestheintermediatedatatothe
Reducers.WhatstheprocesscalledthathappensbetweenMappersandReducers?
[]Bubblesort
[]Shuffleandsort
[]FindandShuffle
[]Quicksort
Answer:
TheprocessiscalledtheShuffleandSort.Itensuresthatthevaluesforanyparticularkeyare
collectedtogether,andsendsthekeysandtheirlistsofvaluestotheReducer.
Reducer
Inourcase,weonlyhaveasingleReducer,becausethatsthe
Hadoopdefault,soitwillgetallthekeys.Ifwehadspecified
morethanoneReducer,eachwouldreceivesomeofthekeys,
alongwithallthevaluesfromalltheMappersforthosekeys.
WereusingHadoopStreaminghere,
becausewerewritingourcodein
Python.HadoopStreamingallowsyouto
writeyourMappersandReducersin
prettymuchanylanguage,ratherthan
forcingyoutouseJava.
ButthewaytheReducergetsthedatais
alittletrickytodealwith.Itsgoingtoget
thedatacominginsomethinglikethis.
Copyright2014Udacity,Inc.AllRightsReserved.
Quiz:WhatVariablesDoWeNeedToKeepTrackOf?
Asyoucansee,thedatacomesinasastreamoflines,eachcontainingastorenameand
cost.Thestorenamesaresorted,whichwereguaranteedbecauseoftheshuffleandsort,so
weknowthatallthelinesfor,say,Miamiwillappearoneaftertheother.So,whatvariablesdo
weneedtokeeptrackoftocalculatethesalesperstore,basedonhowthedatais
appearing?
[]previoussale
[]currentsale
[]totalsalesperstore
[]previousstore
[]currentstorename
[]allstorenames
Answer:
SowhenwecodetheReducer,weregoingtoneedtokeeptrackofthekeys.Whenthekey
changes,weknowwevereceivedallthedatafromthepreviouskey,sowecanthenwriteout
thefinalresultforthatpreviouskeyandinourcase,wellhavebeenaddingupallthe
valuesforthatkey.
ReducerCode
SoherestheReducercode.Letsstepthroughit.
Wellstartbysettingacoupleofvariablesup.salesTotaliswhatwellusetokeeptherunning
total.Weinitializethattozero.Andsincewehaventreadanydatainyet,wehaventhadany
keys,sooldKeyisinitializedtoNone.
Thenwestartreadingfromstandardinput.
Eachlinewillcontainakey,atab,anda
value.Inourcase,astorename,atab,
andoneofthesalesfromthatstore.
So,westripoffthenewlinecharacterthe
endofthelineandsplitthelinebasedon
thetab.Thatshouldgiveusexactlytwo
items,whichwellstoreinthedataarray.
Ifwedonthavetwoitems,somethingstrangehashappened,sowellskipthatlineofinput
althoughthatshouldinfactneverbethecase,sinceweknowourMappersarewritingthe
dataoutinthisformat.
Copyright2014Udacity,Inc.AllRightsReserved.
Nowwellpullthetwoelementsofthearrayoutintonamedvariablesforclarity.thisKeywill
holdthestorename,thisValuethesaleamount.
Nowheresthetrickypart.Wewanttoknowifthekeyhaschangedsincethelastonewe
read.SowechecktoseeifoldKeyisevensetbecauseifitsnotthenthiswillbethefirst
linewevereadand,ifitis,weseeifitsdifferenttothekeywejustreadin.
Ifthatstrue,thenthekeyhasjustchangedintheexamplewejustlookedat,wedhave
readalltheMiamilinesandnowwevejustreceivedaNewYorkkey.Soweneedtowriteout
thedataforthepreviouskey.Wedothatbywritingthatkey,atab,andtherunningtotalweve
beenkeeping.ThatdatawillbewrittentoafileinHDFSbytheHadoopframework.
Oncewevedonethat,wesetthesalesTotalbacktozerosincewerenowdealingwithanew
store.
OK,nowwecanactuallyprocessthedatawevejustread.WesetoldKeyupwiththe
contentsofthekey,andthenaddthevaluetoourrunningtotal.
Andthenweloopbackanddothewholethingagain.
Eventually,wellrunoutofdatatoprocess,whichtakesusoutoftheloop.
Copyright2014Udacity,Inc.AllRightsReserved.
Question:AreWeDone?
DoyouthinktheReducerisfinishedatthispoint?
[]Yes,itsfinished
[]No,anotherprocessneedstoberunontheoutput
[]No,thelastkeyhasnotyetbeenoutput
Answer:
Becareful!Whenweexittheloop,wehaventyetoutputthedataforthelastkeywevebeen
tracking.Thatswhatthelasttwolinesareforwewriteoutthekeyandvalueforthelaststore
weveprocessed.Ifwedidnthavethoselines,wewouldntwriteoutdataforthatlaststore.
NowletshaveIantalkalittleabouttestingthecode,andthenactuallyrunitonourcluster.
PuttingItAllTogether
Sothatsthecode.OneofthenicethingsaboutusingHadoopstreamingisthatitseasyto
testourcodeoutsideofHadoop.Letsseehowtodothat.OurMappertakesdatainfrom
standardinput,andwritesitsresultstostandardoutput,sowecanjustrunitfromthe
commandlineandtypedataintotestit.Or,evenbetter,wecanbuildjustasmallsample
datafileandpipethattotheMapper.Letsdothat.Herewehaveaverysmallfilejust10or
solines.SototesttheMapper,wecanjustdothis:cattestfile|./mapper.py
Andtheresourresultstorenamesandsales.Excellent!Ifwehadproblems,wecouldgo
backandedittheMapperuntilitworked,anditsreallyniceandquicktodothiswithout
needingtorunitviaHadoopeverytime.
Copyright2014Udacity,Inc.AllRightsReserved.
WecandoasimilarthingwiththeReducer.Itsexpectingasetoflineswhichlooklike
storenametabvalueso,again,wecancreateasamplefilewhichlookslikethatandpassit
in.Butevennicer,wecantesttheentirepipeline.RememberthattheMappersoutputis
sortedbytheHadoopframeworkandthenpassedtotheReducer.Sowecansimulatethe
entirethingonthecommandlinelikethis,usingtheUnixsortutilityinbetweentheMapper
andReducer:
...andtheresouroutput,exactlyaswedexpect.Sonowthatwevetestedonthecommand
line,wecannowtestitonthecluster.BestpracticewhenyouredevelopingMapReducejobs
istofirsttestwithasmalldatasetbeforeyourunyourcodeonyourentire,hugesetofdata,
butwereprettyconfidentheresoletsjustrunthethingonourwholepurchases.txtfile.
Copyright2014Udacity,Inc.AllRightsReserved.
Welluseouraliashstocutdownonourtyping.HeresourMapper...andourReducer
andthefilesargumentswealsohavetospecify.Ourinputdirectoryismyinput,andwelltell
Hadooptowritetheresultstooutput2.Remember,theoutputdirectorymustnotalreadyexist
orthejobwillfail.
Offitgoes.OnthispseudodistributedclusterwecanonlyruntwoMapperssimultaneously,
anditturnsoutthatweneedfourtoprocesstheentiredatasetbecauseofitssize.Sotwowill
run,thenwhentheyvefinishedthenexttwowillstart.OncetheyredonetheReducerswill
thenbegin.Itturnsout,youcanwatchthishappeningviaaWebbaseduserinterfacethat
Hadoopgivesus.YoupointyourWebbrowserattheJobTracker,whichonourmachineis
justlocalhost,onport50030.Hereyoucanseethattheresonerunningjob,andwhenwe
clickonitwecanseetheMappersandReducersrunning.
ItgivesusatonofotherinterestinginformationonekeythingisthatifaMapperorReducer
fails,youcanactuallydrilldownandviewthelogsfromthatparticularpieceofcode.
OK,itsdone.Soletstakealookattheoutputandthereitis.Oursales,totalledbystore,
justaswedexpected.
Copyright2014Udacity,Inc.AllRightsReserved.