Hadoop

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 28

Apache > Hadoop > Core > common > docs > r0.17.

0
Search the site with google

Search
Project
Wiki
Hadoop 0.17 Documentation
Last Published: 05/21/2008 20:01:10

Documentation
Overview
Quickstart
Custer !etup
H"#! Architecture
H"#! $ser %uide
H"#! !he %uide
H"#! Permissions %uide
&ap'(educe )utoria
*ative Hadoop +i,raries
!treamin-
Hadoop On "emand
AP. "ocs
Wiki
#AQ
&aiin- +ists
(eease *otes
A Chan-es
P"#
Hadoop Map-Reduce Tutorial
Purpose
Pre're/uisites
Overview
.nputs and Outputs
01ampe2 WordCount v1.0
o !ource Code
o $sa-e
o Wak'throu-h
&ap'(educe ' $ser .nter3aces
o Pa4oad
&apper
(educer
Partitioner
(eporter
OutputCoector
o 5o, Con3i-uration
o )ask 01ecution 6 0nvironment
o 5o, !u,mission and &onitorin-
5o, Contro
o 5o, .nput
.nput!pit
(ecord(eader
o 5o, Output
)ask !ide'033ect #ies
(ecordWriter
o Other $se3u #eatures
Counters
"istri,utedCache
)oo
.soation(unner
"e,u--in-
5o,Contro
"ata Compression
01ampe2 WordCount v7.0
o !ource Code
o !ampe (uns
o Hi-hi-hts
Purpose
)his document comprehensive4 descri,es a user'3acin- 3acets o3 the Hadoop &ap'(educe
3ramework and serves as a tutoria.
Pre-requisites
0nsure that Hadoop is instaed8 con3i-ured and is runnin-. &ore detais2
Hadoop Quickstart 3or 3irst'time users.
Hadoop Custer !etup 3or ar-e8 distri,uted custers.
Overview
Hadoop &ap'(educe is a so3tware 3ramework 3or easi4 writin- appications which process
vast amounts o3 data 9muti'tera,4te data'sets: in'parae on ar-e custers 9thousands o3
nodes: o3 commodit4 hardware in a reia,e8 3aut'toerant manner.
A &ap'(educe job usua4 spits the input data'set into independent chunks which are
processed ,4 the map tasks in a compete4 parae manner. )he 3ramework sorts the
outputs o3 the maps8 which are then input to the reduce tasks. )4pica4 ,oth the input and
the output o3 the jo, are stored in a 3ie's4stem. )he 3ramework takes care o3 scheduin-
tasks8 monitorin- them and re'e1ecutes the 3aied tasks.
)4pica4 the compute nodes and the stora-e nodes are the same8 that is8 the &ap'(educe
3ramework and the "istri,uted #ie!4stem are runnin- on the same set o3 nodes. )his
con3i-uration aows the 3ramework to e33ective4 schedue tasks on the nodes where data is
aread4 present8 resutin- in ver4 hi-h a--re-ate ,andwidth across the custer.
)he &ap'(educe 3ramework consists o3 a sin-e master JobTracker and one
save TaskTracker per custer'node. )he master is responsi,e 3or scheduin- the jo,s;
component tasks on the saves8 monitorin- them and re'e1ecutin- the 3aied tasks. )he
saves e1ecute the tasks as directed ,4 the master.
&inima48 appications speci34 the input<output ocations and
supp4 map and reduce 3unctions via impementations o3 appropriate inter3aces and<or
a,stract'casses. )hese8 and other jo, parameters8 comprise the job configuration. )he
Hadoop job client then su,mits the jo, 9jar<e1ecuta,e etc.: and con3i-uration to
the JobTrackerwhich then assumes the responsi,iit4 o3 distri,utin- the
so3tware<con3i-uration to the saves8 scheduin- tasks and monitorin- them8 providin-
status and dia-nostic in3ormation to the jo,'cient.
Athou-h the Hadoop 3ramework is impemented in 5ava
)&
8 &ap'(educe appications need
not ,e written in 5ava.
Hadoop !treamin- is a utiit4 which aows users to create and run jo,s with an4
e1ecuta,es 9e.-. she utiities: as the mapper and<or the reducer.
Hadoop Pipes is a !W.%' compati,e C++ API to impement &ap'(educe appications
9non 5*.
)&
,ased:.
Inputs and Outputs
)he &ap'(educe 3ramework operates e1cusive4 on <key, value> pairs8 that is8 the
3ramework views the input to the jo, as a set o3 <key, value> pairs and produces a set
o3 <key, value> pairs as the output o3 the jo,8 conceiva,4 o3 di33erent t4pes.
)he key and value casses have to ,e seriai=a,e ,4 the 3ramework and hence need to
impement the Writa,e inter3ace. Additiona48 the key casses have to impement
the Writa,eCompara,e inter3ace to 3aciitate sortin- ,4 the 3ramework.
.nput and Output t4pes o3 a &ap'(educe jo,2
9input: <k1, v1> '> map '> <k2, v2> '> combine '> <k2, v2> '> reduce '> <k3,
v3> 9output:
Example: WordCount v!"
>e3ore we jump into the detais8 ets wak throu-h an e1ampe &ap'(educe appication to
-et a 3avour 3or how the4 work.
WordCount is a simpe appication that counts the num,er o3 occurences o3 each word in a
-iven input set.
)his works with a oca'standaone8 pseudo'distri,uted or 3u4'distri,uted Hadoop
instaation.
#ource Code
WordCount.java
1. package org.myorg;
2.
3. mport !ava.o."#$%cepton;
4. mport !ava.utl.&;
5.
6. mport org.apac'e.'adoop.(s.)at';
7. mport org.apac'e.'adoop.con(.&;
8. mport org.apac'e.'adoop.o.&;
9. mport org.apac'e.'adoop.mapred.&;
10. mport org.apac'e.'adoop.utl.&;
11.
12. publc class WordCount *
13.
14. publc statc class +ap e%tends +ap,educe-ase mplements +apper<.ongWrtable, Te%t, Te%t, "ntWrtable> *
15. prvate (nal statc "ntWrtable one / ne0 "ntWrtable112;
16. prvate Te%t 0ord / ne0 Te%t12;
17.
18. publc vod map1.ongWrtable key, Te%t value, #utputCollector<Te%t, "ntWrtable> output, ,eporter reporter2 t'ro0s "#$%cepton *
19. 3trng lne / value.to3trng12;
20. 3trngToken4er token4er / ne0 3trngToken4er1lne2;
21. 0'le 1token4er.'as+oreTokens122 *
22. 0ord.set1token4er.ne%tToken122;
23. output.collect10ord, one2;
24. 5
25. 5
26. 5
27.
28. publc statc class ,educe e%tends +ap,educe-ase mplements ,educer<Te%t, "ntWrtable, Te%t, "ntWrtable> *
29. publc vod reduce1Te%t key, "terator<"ntWrtable> values, #utputCollector<Te%t, "ntWrtable> output, ,eporter reporter2 t'ro0s
"#$%cepton *
30. nt sum / 6;
31. 0'le 1values.'as7e%t122 *
32. sum 8/ values.ne%t12.get12;
33. 5
34. output.collect1key, ne0 "ntWrtable1sum22;
35. 5
36. 5
37.
38. publc statc vod man13trng9: args2 t'ro0s $%cepton *
39. JobCon( con( / ne0 JobCon(1WordCount.class2;
40. con(.setJob7ame1;0ordcount;2;
41.
42. con(.set#utput<eyClass1Te%t.class2;
43. con(.set#utput=alueClass1"ntWrtable.class2;
44.
45. con(.set+apperClass1+ap.class2;
46. con(.setCombnerClass1,educe.class2;
47. con(.set,educerClass1,educe.class2;
48.
49. con(.set"nput>ormat1Te%t"nput>ormat.class2;
50. con(.set#utput>ormat1Te%t#utput>ormat.class2;
51.
52. >le"nput>ormat.set"nput)at's1con(, ne0 )at'1args96:22;
53. >le#utput>ormat.set#utput)at'1con(, ne0 )at'1args91:22;
54.
55. JobClent.runJob1con(2;
57. 5
58. 5
59.
$sa%e
Assumin- ?@A##)B?#+$ is the root o3 the instaation and ?@A##)B=$,3"#7 is the Hadoop
version instaed8 compie WordCount.!ava and create a jar2
C mkdr 0ordcountBclasses
C !avac Dclasspat' C*?@A##)B?#+$5E'adoopDC*?@A##)B=$,3"#75Dcore.!ar Dd
0ordcountBclasses WordCount.!ava
C !ar Dcv( EusrE!oeE0ordcount.!ar DC 0ordcountBclassesE .
Assumin- that2
EusrE!oeE0ordcountEnput ' input director4 in H"#!
EusrE!oeE0ordcountEoutput ' output director4 in H"#!
!ampe te1t'3ies as input2
C bnE'adoop d(s Dls EusrE!oeE0ordcountEnputE
EusrE!oeE0ordcountEnputE(le61
EusrE!oeE0ordcountEnputE(le62
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEnputE(le61
?ello World -ye World
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEnputE(le62
?ello ?adoop Foodbye ?adoop
(un the appication2
C bnE'adoop !ar EusrE!oeE0ordcount.!ar org.myorg.WordCount
EusrE!oeE0ordcountEnput EusrE!oeE0ordcountEoutput
Output2
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEoutputEpartD66666
-ye 1
Foodbye 1
?adoop 2
?ello 2
World 2
Wal&-t'rou%'
)he WordCount appication is /uite strai-ht'3orward.
)he +apper impementation 9ines 1?'7@:8 via the map method 9ines 1A'7B:8 processes one
ine at a time8 as provided ,4 the speci3ied Te%t"nput>ormat 9ine ?C:. .t then spits the
ine into tokens separated ,4 whitespaces8 via the 3trngToken4er8 and emits a ke4'
vaue pair o3 < <0ord>, 1>.
#or the -iven sampe input the 3irst map emits2
< ?ello, 1>
< World, 1>
< -ye, 1>
< World, 1>
)he second map emits2
< ?ello, 1>
< ?adoop, 1>
< Foodbye, 1>
< ?adoop, 1>
We; earn more a,out the num,er o3 maps spawned 3or a -iven jo,8 and how to contro
them in a 3ine'-rained manner8 a ,it ater in the tutoria.
WordCount aso speci3ies a combner 9ine ?@:. Hence8 the output o3 each map is passed
throu-h the oca com,iner 9which is same as the ,educer as per the jo, con3i-uration: 3or
oca a--re-ation8 a3ter ,ein- sorted on the keys.
)he output o3 the 3irst map2
< -ye, 1>
< ?ello, 1>
< World, 2>
)he output o3 the second map2
< Foodbye, 1>
< ?adoop, 2>
< ?ello, 1>
)he ,educer impementation 9ines 7A'D@:8 via the reduce method 9ines 7C'DB: just sums
up the vaues8 which are the occurence counts 3or each ke4 9i.e. words in this e1ampe:.
)hus the output o3 the jo, is2
< -ye, 1>
< Foodbye, 1>
< ?adoop, 2>
< ?ello, 2>
< World, 2>
)he run method speci3ies various 3acets o3 the jo,8 such as the input<output paths 9passed
via the command ine:8 ke4<vaue t4pes8 input<output 3ormats etc.8 in theJobCon(. .t then
cas the JobClent.runJob 9ine BB: to su,mit the and monitor its pro-ress.
We; earn more a,out JobCon(8 JobClent8 Tool and other inter3aces and casses a ,it
ater in the tutoria.
Map-Reduce - $ser Inter(aces
)his section provides a reasona,e amount o3 detai on ever4 user'3acin- aspect o3 the &ap'
(educe 3ramwork. )his shoud hep users impement8 con3i-ure and tune their jo,s in a 3ine'
-rained manner. However8 pease note that the javadoc 3or each cass<inter3ace remains the
most comprehensive documentation avaia,eE this is on4 meant to ,e a tutoria.
+et us 3irst take the +apper and ,educer inter3aces. Appications t4pica4 impement them
to provide the map and reduce methods.
We wi then discuss other core inter3aces
incudin- JobCon(8 JobClent8 )arttoner8 #utputCollector8 ,eporter8 "nput>orma
t8 #utput>ormat and others.
#ina48 we wi wrap up ,4 discussin- some use3u 3eatures o3 the 3ramework such as
the AstrbutedCac'e8 "solaton,unner etc.
Pa)load
Appications t4pica4 impement the +apper and ,educer inter3aces to provide
the map and reduce methods. )hese 3orm the core o3 the jo,.
Mapper
&apper maps input ke4<vaue pairs to a set o3 intermediate ke4<vaue pairs.
&aps are the individua tasks that trans3orm input records into intermediate records. )he
trans3ormed intermediate records do not need to ,e o3 the same t4pe as the input records.
A -iven input pair ma4 map to =ero or man4 output pairs.
)he Hadoop &ap'(educe 3ramework spawns one map task 3or each "nput3plt -enerated
,4 the "nput>ormat 3or the jo,.
Overa8 +apper impementations are passed the JobCon( 3or the jo, via
the 5o,Con3i-ura,e.con3i-ure95o,Con3: method and override it to initiai=e themseves. )he
3ramework then cas map9Writa,eCompara,e8 Writa,e8 OutputCoector8 (eporter: 3or
each ke4<vaue pair in the "nput3plt 3or that task. Appications can then override
the Cosea,e.cose9: method to per3orm an4 re/uired ceanup.
Output pairs do not need to ,e o3 the same t4pes as input pairs. A -iven input pair ma4
map to =ero or man4 output pairs. Output pairs are coected with cas
toOutputCoector.coect9Writa,eCompara,e8Writa,e:.
Appications can use the ,eporter to report pro-ress8 set appication'eve status messa-es
and update Counters8 or just indicate that the4 are aive.
A intermediate vaues associated with a -iven output ke4 are su,se/uent4 -rouped ,4 the
3ramework8 and passed to the ,educer9s: to determine the 3ina output. $sers can contro
the -roupin- ,4 speci34in- a Comparator via 5o,Con3.setOutputFe4ComparatorCass9Cass:.
)he +apper outputs are sorted and then partitioned per ,educer. )he tota num,er o3
partitions is the same as the num,er o3 reduce tasks 3or the jo,. $sers can contro which
ke4s 9and hence records: -o to which ,educer ,4 impementin- a custom )arttoner.
$sers can optiona4 speci34 a combner8 via 5o,Con3.setCom,inerCass9Cass:8 to per3orm
oca a--re-ation o3 the intermediate outputs8 which heps to cut down the amount o3 data
trans3erred 3rom the +apper to the ,educer.
)he intermediate8 sorted outputs are awa4s stored in 3ies o3 !e/uence#ie 3ormat.
Appications can contro i38 and how8 the intermediate outputs are to ,e compressed and
the CompressionCodec to ,e used via the JobCon(.
How Many Maps?
)he num,er o3 maps is usua4 driven ,4 the tota si=e o3 the inputs8 that is8 the tota
num,er o3 ,ocks o3 the input 3ies.
)he ri-ht eve o3 paraeism 3or maps seems to ,e around 10'100 maps per'node8 athou-h
it has ,een set up to D00 maps 3or ver4 cpu'i-ht map tasks. )ask setup takes awhie8 so it
is ,est i3 the maps take at east a minute to e1ecute.
)hus8 i3 4ou e1pect 10)> o3 input data and have a ,ocksi=e o3 12G+-8 4ou; end up with
A78000 maps8 uness set*um&ap)asks9int: 9which on4 provides a hint to the 3ramework: is
used to set it even hi-her.
Reducer
(educer reduces a set o3 intermediate vaues which share a ke4 to a smaer set o3 vaues.
)he num,er o3 reduces 3or the jo, is set ,4 the user via 5o,Con3.set*um(educe)asks9int:.
Overa8 ,educer impementations are passed the JobCon( 3or the jo, via
the 5o,Con3i-ura,e.con3i-ure95o,Con3: method and can override it to initiai=e themseves.
)he 3ramework then cas reduce9Writa,eCompara,e8 .terator8 OutputCoector8
(eporter: method 3or each <key, 1lst o( values2> pair in the -rouped inputs.
Appications can then override the Cosea,e.cose9: method to per3orm an4 re/uired
ceanup.
,educer has D primar4 phases2 shu33e8 sort and reduce.
Shuffle
.nput to the ,educer is the sorted output o3 the mappers. .n this phase the 3ramework
3etches the reevant partition o3 the output o3 a the mappers8 via H))P.
Sort
)he 3ramework -roups ,educer inputs ,4 ke4s 9since di33erent mappers ma4 have output
the same ke4: in this sta-e.
)he shu33e and sort phases occur simutaneous4E whie map'outputs are ,ein- 3etched the4
are mer-ed.
Secondary Sort
.3 e/uivaence rues 3or -roupin- the intermediate ke4s are re/uired to ,e di33erent 3rom
those 3or -roupin- ke4s ,e3ore reduction8 then one ma4 speci34
a Comparator via5o,Con3.setOutputGaue%roupin-Comparator9Cass:.
!ince 5o,Con3.setOutputFe4ComparatorCass9Cass: can ,e used to contro how
intermediate ke4s are -rouped8 these can ,e used in conjunction to simuate secondary sort
on values.
Reduce
.n this phase the reduce9Writa,eCompara,e8 .terator8 OutputCoector8 (eporter: method is
caed 3or each <key, 1lst o( values2> pair in the -rouped inputs.
)he output o3 the reduce task is t4pica4 written to
the #ie!4stem via OutputCoector.coect9Writa,eCompara,e8 Writa,e:.
Appications can use the ,eporter to report pro-ress8 set appication'eve status messa-es
and update Counters8 or just indicate that the4 are aive.
)he output o3 the ,educer is not sorted.
How Many Reduces?
)he ri-ht num,er o3 reduces seems to ,e 6.HI or 1.JI mutipied ,4 9Hno. of nodes>
I mapred.tasktracker.reduce.tasks.ma%mum:.
With 6.HI a o3 the reduces can aunch immediate4 and start trans3erin- map outputs as
the maps 3inish. With 1.JI the 3aster nodes wi 3inish their 3irst round o3 reduces and
aunch a second wave o3 reduces doin- a much ,etter jo, o3 oad ,aancin-.
.ncreasin- the num,er o3 reduces increases the 3ramework overhead8 ,ut increases oad
,aancin- and owers the cost o3 3aiures.
)he scain- 3actors a,ove are si-ht4 ess than whoe num,ers to reserve a 3ew reduce sots
in the 3ramework 3or specuative'tasks and 3aied tasks.
Reducer NONE
.t is e-a to set the num,er o3 reduce'tasks to zero i3 no reduction is desired.
.n this case the outputs o3 the map'tasks -o direct4 to the >le3ystem8 into the output
path set ,4 setOutputPath9Path:. )he 3ramework does not sort the map'outputs ,e3ore
writin- them out to the >le3ystem.
Partitioner
Partitioner partitions the ke4 space.
Partitioner contros the partitionin- o3 the ke4s o3 the intermediate map'outputs. )he ke4
9or a su,set o3 the ke4: is used to derive the partition8 t4pica4 ,4 a hash function. )he
tota num,er o3 partitions is the same as the num,er o3 reduce tasks 3or the jo,. Hence this
contros which o3 the m reduce tasks the intermediate ke4 9and hence the record: is sent to
3or reduction.
HashPartitioner is the de3aut )arttoner.
Reporter
(eporter is a 3aciit4 3or &ap'(educe appications to report pro-ress8 set appication'eve
status messa-es and update Counters.
+apper and ,educer impementations can use the ,eporter to report pro-ress or just
indicate that the4 are aive. .n scenarios where the appication takes a si-ni3icant amount o3
time to process individua ke4<vaue pairs8 this is crucia since the 3ramework mi-ht assume
that the task has timed'out and ki that task. Another wa4 to avoid this is to set the
con3i-uration parameter mapred.task.tmeout to a hi-h'enou-h vaue 9or even set it
to zero 3or no time'outs:.
Appications can aso update Counters usin- the ,eporter.
OutputCollector
OutputCoector is a -enerai=ation o3 the 3aciit4 provided ,4 the &ap'(educe 3ramework to
coect data output ,4 the +apper or the ,educer 9either the intermediate outputs or the
output o3 the jo,:.
Hadoop &ap'(educe comes ,unded with a i,rar4 o3 -enera4 use3u mappers8 reducers8
and partitioners.
*o+ Con(i%uration
5o,Con3 represents a &ap'(educe jo, con3i-uration.
JobCon( is the primar4 inter3ace 3or a user to descri,e a map'reduce jo, to the Hadoop
3ramework 3or e1ecution. )he 3ramework tries to 3aith3u4 e1ecute the jo, as descri,ed
,4 JobCon(8 however2
3 !ome con3i-uration parameters ma4 have ,een marked as 3ina ,4 administrators
and hence cannot ,e atered.
Whie some jo, parameters are strai-ht'3orward to set
9e.-. set*um(educe)asks9int::8 other parameters interact su,t4 with the rest o3 the
3ramework and<or jo, con3i-uration and are more compe1 to set
9e.-. set*um&ap)asks9int::.
JobCon( is t4pica4 used to speci34 the +apper8 com,iner 9i3
an4:8 )arttoner8 ,educer8 "nput>ormat and #utput>ormat impementations. JobCon(
aso indicates the set o3 input 3ies 9set.nputPaths95o,Con38 Path...: <add.nputPath95o,Con38
Path:: and 9set.nputPaths95o,Con38 !trin-: <add.nputPaths95o,Con38 !trin-:: and where the
output 3ies shoud ,e written 9setOutputPath9Path::.
Optiona48 JobCon( is used to speci34 other advanced 3acets o3 the jo, such as
the Comparator to ,e used8 3ies to ,e put in the AstrbutedCac'e8 whether intermediate
and<or jo, outputs are to ,e compressed 9and how:8 de,u--in- via user'provided scripts
9set&ap"e,u-!cript9!trin-:<set(educe"e,u-!cript9!trin-:: 8 whether jo, tasks can ,e
e1ecuted in a speculative manner 9set&ap!pecuative01ecution9,ooean::<
9set(educe!pecuative01ecution9,ooean:: 8 ma1imum num,er o3 attempts per task
9set&a1&apAttempts9int:<set&a1(educeAttempts9int:: 8 percenta-e o3 tasks 3aiure which
can ,e toerated ,4 the jo,
9set&a1&ap)ask#aiuresPercent9int:<set&a1(educe)ask#aiuresPercent9int:: etc.
O3 course8 users can use set9!trin-8 !trin-:<-et9!trin-8 !trin-: to set<-et ar,itrar4
parameters needed ,4 appications. However8 use the AstrbutedCac'e 3or ar-e amounts
o3 9read'on4: data.
Tas& Execution , Environment
)he TaskTracker e1ecutes the +apper< ,educer task as a chid process in a separate jvm.
)he chid'task inherits the environment o3 the parent TaskTracker. )he user can speci34
additiona options to the chid'jvm via the mapred.c'ld.!ava.optscon3i-uration
parameter in the JobCon( such as non'standard paths 3or the run'time inker to search
shared i,raries via DA!ava.lbrary.pat'/<> etc. .3
themapred.c'ld.!ava.opts contains the s4m,o @taskid@ it is interpoated with vaue
o3 taskd o3 the map<reduce task.
Here is an e1ampe with mutipe ar-uments and su,stitutions8 showin- jvm %C o--in-8 and
start o3 a passwordess 5G& 5&J a-ent so that it can connect with jconsoe and the ikes to
watch chid memor48 threads and -et thread dumps. .t aso sets the ma1imum heap'si=e o3
the chid jvm to B17&> and adds an additiona path to the!ava.lbrary.pat' o3 the chid'
jvm.
<property>
<name>mapred.c'ld.!ava.opts<Ename>
<value>
DKm%I12+ DA!ava.lbrary.pat'/E'omeEmycompanyElb DverboseLgc D
KloggcLEtmpEMtaskdM.gc
DAcom.sun.management.!m%remote.aut'entcate/(alse
DAcom.sun.management.!m%remote.ssl/(alse
<Evalue>
<Eproperty>
$sers<admins can aso speci34 the ma1imum virtua memor4 o3 the aunched chid'task
usin- mapred.c'ld.ulmt.
When the jo, starts8 the ocai=ed jo, director4 C
*mapred.local.dr5EtaskTrackerE!obcac'eEC!obdE has the 3oowin- directories2
A jo,'speci3ic shared director48 created at ocation C
*mapred.local.dr5EtaskTrackerE!obcac'eEC!obdE0orkE . )his director4 is e1posed
to the users throu-h !ob.local.dr . )he tasks can use this space as scratch space and
share 3ies amon- them. )he director4 can accessed throu-h api5o,Con3.-et5o,+oca"ir9:. .t
is avaia,e as !4stem propert4 aso. !o8users can
ca 3ystem.get)roperty1;!ob.local.dr;2E
A jars director48 which has the jo, jar 3ie and e1panded jar
A jo,.1m 3ie8 the -eneric jo, con3i-uration
0ach task has director4 taskDd which a-ain has the 3oowin- structure
o A jo,.1m 3ie8 task ocai=ed jo, con3i-uration
o A director4 3or intermediate output 3ies
o )he workin- director4 o3 the task. And work director4 has a temporar4
director4 to create temporar4 3ies
)he "istri,utedCache can aso ,e used as a rudimentar4 so3tware distri,ution mechanism
3or use in the map and<or reduce tasks. .t can ,e used to distri,ute ,oth jars and native
i,raries. )he "istri,utedCache.addArchive)oCassPath9Path8
Con3i-uration: or "istri,utedCache.add#ie)oCassPath9Path8 Con3i-uration: api can ,e used
to cache 3ies<jars and aso add them to the classpath o3 chid'jvm. !imiar4 the 3aciit4
provided ,4 the AstrbutedCac'e where'in it s4minks the cached 3ies into the workin-
director4 o3 the task can ,e used to distri,ute native i,raries and oad them. )he under4in-
detai is that chid'jvm awa4s has its current orking directory added to
the!ava.lbrary.pat' and hence the cached i,raries can ,e oaded
via !4stem.oad+i,rar4 or !4stem.oad.
*o+ #u+mission and Monitorin%
5o,Cient is the primar4 inter3ace ,4 which user'jo, interacts with the JobTracker.
JobClent provides 3aciities to su,mit jo,s8 track their pro-ress8 access component'tasks;
reports<o-s8 -et the &ap'(educe custer;s status in3ormation and so on.
)he jo, su,mission process invoves2
1. Checkin- the input and output speci3ications o3 the jo,.
7. Computin- the "nput3plt vaues 3or the jo,.
D. !ettin- up the re/uisite accountin- in3ormation 3or the AstrbutedCac'e o3 the
jo,8 i3 necessar4.
?. Cop4in- the jo,;s jar and con3i-uration to the map'reduce s4stem director4 on
the >le3ystem.
B. !u,mittin- the jo, to the JobTracker and optiona4 monitorin- it;s status.
5o, histor4 3ies are aso o--ed to user speci3ied
director4 'adoop.!ob.'story.user.locaton which de3auts to jo, output director4. )he
3ies are stored in KLo-s<histor4<K in the speci3ied director4. Hence8 ,4 de3aut the4 wi ,e in
mapred.output.dir<Lo-s<histor4. $ser can stop o--in- ,4 -ivin- the
vaue none 3or'adoop.!ob.'story.user.locaton
$ser can view the histor4 o-s summar4 in speci3ied director4 usin- the 3oowin- command
C bnE'adoop !ob D'story outputDdr
)his command wi print jo, detais8 3aied and kied tip detais.
&ore detais a,out the jo, such as success3u tasks and task attempts made 3or each task
can ,e viewed usin- the 3oowin- command
C bnE'adoop !ob D'story all outputDdr
$ser can use Output+o-#iter to 3iter o- 3ies 3rom the output director4 istin-.
*orma4 the user creates the appication8 descri,es various 3acets o3 the jo, via JobCon(8
and then uses the JobClent to su,mit the jo, and monitor its pro-ress.
Job Control
$sers ma4 need to chain map'reduce jo,s to accompish compe1 tasks which cannot ,e
done via a sin-e map'reduce jo,. )his is 3air4 eas4 since the output o3 the jo, t4pica4
-oes to distri,uted 3ie's4stem8 and the output8 in turn8 can ,e used as the input 3or the
ne1t jo,.
However8 this aso means that the onus on ensurin- jo,s are compete 9success<3aiure: ies
s/uare4 on the cients. .n such cases8 the various jo,'contro options are2
run5o,95o,Con3: 2 !u,mits the jo, and returns on4 a3ter the jo, has competed.
su,mit5o,95o,Con3: 2 On4 su,mits the jo,8 then po the returned hande to
the (unnin-5o, to /uer4 status and make scheduin- decisions.
5o,Con3.set5o,0nd*oti3ication$(.9!trin-: 2 !ets up a noti3ication upon jo,'
competion8 thus avoidin- poin-.
*o+ Input
.nput#ormat descri,es the input'speci3ication 3or a &ap'(educe jo,.
)he &ap'(educe 3ramework reies on the "nput>ormat o3 the jo, to2
1. Gaidate the input'speci3ication o3 the jo,.
7. !pit'up the input 3ie9s: into o-ica "nput3plt instances8 each o3 which is then
assi-ned to an individua +apper.
D. Provide the ,ecord,eader impementation used to -ean input records 3rom the
o-ica "nput3plt 3or processin- ,4 the +apper.
)he de3aut ,ehavior o3 3ie',ased "nput>ormat impementations8 t4pica4 su,'casses
o3 #ie.nput#ormat8 is to spit the input into logical "nput3plt instances ,ased on the tota
si=e8 in ,4tes8 o3 the input 3ies. However8 the >le3ystem ,ocksi=e o3 the input 3ies is
treated as an upper ,ound 3or input spits. A ower ,ound on the spit si=e can ,e set
via mapred.mn.splt.s4e.
Cear48 o-ica spits ,ased on input'si=e is insu33icient 3or man4 appications since record
,oundaries must ,e respected. .n such cases8 the appication shoud impement
a,ecord,eader8 who is responsi,e 3or respectin- record',oundaries and presents a record'
oriented view o3 the o-ica "nput3plt to the individua task.
)e1t.nput#ormat is the de3aut "nput>ormat.
.3 Te%t"nput>ormat is the "nput>ormat 3or a -iven jo,8 the 3ramework detects input'3ies
with the .gz and .lzo e1tensions and automatica4 decompresses them usin- the
appropriate CompressonCodec. However8 it must ,e noted that compressed 3ies with the
a,ove e1tensions cannot ,e split and each compressed 3ie is processed in its entiret4 ,4 a
sin-e mapper.
InputSplit
.nput!pit represents the data to ,e processed ,4 an individua +apper.
)4pica4 "nput3plt presents a ,4te'oriented view o3 the input8 and it is the responsi,iit4
o3 ,ecord,eader to process and present a record'oriented view.
#ie!pit is the de3aut "nput3plt. .t sets map.nput.(le to the path o3 the input 3ie 3or
the o-ica spit.
RecordReader
(ecord(eader reads <key, value> pairs 3rom an "nput3plt.
)4pica4 the ,ecord,eader converts the ,4te'oriented view o3 the input8 provided ,4
the "nput3plt8 and presents a record'oriented to the +apper impementations 3or
processin-. ,ecord,eader thus assumes the responsi,iit4 o3 processin- record ,oundaries
and presents the tasks with ke4s and vaues.
*o+ Output
Output#ormat descri,es the output'speci3ication 3or a &ap'(educe jo,.
)he &ap'(educe 3ramework reies on the #utput>ormat o3 the jo, to2
1. Gaidate the output'speci3ication o3 the jo,E 3or e1ampe8 check that the output
director4 doesn;t aread4 e1ist.
2. Provide the ,ecordWrter impementation used to write the output 3ies o3 the jo,.
Output 3ies are stored in a >le3ystem.
Te%t#utput>ormat is the de3aut #utput>ormat.
as! Side"Effect #iles
.n some appications8 component tasks need to create and<or write to side'3ies8 which di33er
3rom the actua jo,'output 3ies.
.n such cases there coud ,e issues with two instances o3 the
same +apper or ,educer runnin- simutaneous4 93or e1ampe8 specuative tasks: tr4in- to
open and<or write to the same 3ie 9path: on the >le3ystem. Hence the appication'writer
wi have to pick uni/ue names per task'attempt 9usin- the taskid8
sa4taskB266J6H221G12B6661BmB666666B6:8 not just per task.
)o avoid these issues the &ap'(educe 3ramework maintains a specia C
*mapred.output.dr5EBtemporaryEBC*taskd5 su,'director4 accessi,e viaC
*mapred.0ork.output.dr5 3or each task'attempt on the >le3ystem where the output
o3 the task'attempt is stored. On success3u competion o3 the task'attempt8 the 3ies in
the C*mapred.output.dr5EBtemporaryEBC*taskd5 9on4: are promoted to C
*mapred.output.dr5. O3 course8 the 3ramework discards the su,'director4 o3
unsuccess3u task'attempts. )his process is compete4 transparent to the appication.
)he appication'writer can take advanta-e o3 this 3eature ,4 creatin- an4 side'3ies re/uired
in C*mapred.0ork.output.dr5 durin- e1ecution o3 a task
via#ieOutput#ormat.-etWorkOutputPath9:8 and the 3ramework wi promote them simiar4
3or succes3u task'attempts8 thus eiminatin- the need to pick uni/ue paths per task'
attempt.
*ote2 )he vaue o3 C*mapred.0ork.output.dr5 durin- e1ecution o3 a particuar task'
attempt is actua4 C*mapred.output.dr5EBtemporaryEB*Ctaskd58 and this vaue is
set ,4 the map'reduce 3ramework. !o8 just create an4 side'3ies in the path returned
,4 #ieOutput#ormat.-etWorkOutputPath9: 3rom map<reduce task to take advanta-e o3 this
3eature.
)he entire discussion hods true 3or maps o3 jo,s with reducerM*O*0 9i.e. 0 reduces: since
output o3 the map8 in that case8 -oes direct4 to H"#!.
Record$riter
(ecordWriter writes the output <key, value> pairs to an output 3ie.
(ecordWriter impementations write the jo, outputs to the >le3ystem.
Ot'er $se(ul -eatures
Counters
Counters represent -o,a counters8 de3ined either ,4 the &ap'(educe 3ramework or
appications. 0ach Counter can ,e o3 an4 $num t4pe. Counters o3 a particuar $numare
,unched into -roups o3 t4pe Counters.Froup.
Appications can de3ine ar,itrar4 Counters 9o3 t4pe $num: and update them
via (eporter.incrCounter90num8 on-: in the map and<or reduce methods. )hese counters
are then -o,a4 a--re-ated ,4 the 3ramework.
%istributedCache
"istri,utedCache distri,utes appication'speci3ic8 ar-e8 read'on4 3ies e33icient4.
AstrbutedCac'e is a 3aciit4 provided ,4 the &ap'(educe 3ramework to cache 3ies 9te1t8
archives8 jars and so on: needed ,4 appications.
Appications speci34 the 3ies to ,e cached via urs 9hd3s2<< or http2<<: in the JobCon(.
)he AstrbutedCac'e assumes that the 3ies speci3ied via hd3s2<< urs are aread4 present
on the >le3ystem.
)he 3ramework wi cop4 the necessar4 3ies to the save node ,e3ore an4 tasks 3or the jo,
are e1ecuted on that node. .ts e33icienc4 stems 3rom the 3act that the 3ies are on4 copied
once per jo, and the a,iit4 to cache archives which are un'archived on the saves.
AstrbutedCac'e tracks the modi3ication timestamps o3 the cached 3ies. Cear4 the
cache 3ies shoud not ,e modi3ied ,4 the appication or e1terna4 whie the jo, is e1ecutin-.
AstrbutedCac'e can ,e used to distri,ute simpe8 read'on4 data<te1t 3ies and more
compe1 t4pes such as archives and jars. Archives 9=ip 3ies: are un!archived at the save
nodes. Optiona4 users can aso direct the AstrbutedCac'e to symlink the cached 3ie9s:
into the current 0orkng drectory o3 the task via
the"istri,utedCache.create!4mink9Con3i-uration: api. #ies have e"ecution permissions set.
ool
)he )oo inter3ace supports the handin- o3 -eneric Hadoop command'ine options.
Tool is the standard 3or an4 &ap'(educe too or appication. )he appication shoud
dee-ate the handin- o3 standard command'ine options
to %enericOptionsParser via)oo(unner.run9)oo8 !trin-NO: and on4 hande its custom
ar-uments.
)he -eneric Hadoop command'ine options are2
Dcon( <con(guraton (le>
DA <property/value>
D(s <localNnamenodeLport>
D!t <localN!obtrackerLport>
IsolationRunner
.soation(unner is a utiit4 to hep de,u- &ap'(educe pro-rams.
)o use the "solaton,unner8 3irst set keep.(aled.tasks.(les to true 9aso
see keep.tasks.(les.pattern:.
*e1t8 -o to the node on which the 3aied task ran and -o to the TaskTracker;s oca
director4 and run the "solaton,unner2
C cd <local pat'>EtaskTrackerEC*taskd5E0ork
C bnE'adoop org.apac'e.'adoop.mapred."solaton,unner ..E!ob.%ml
"solaton,unner wi run the 3aied task in a sin-e jvm8 which can ,e in the de,u--er8
over precise4 the same input.
%ebu&&in&
&ap<(educe 3ramework provides a 3aciit4 to run user'provided scripts 3or de,u--in-. When
map<reduce task 3ais8 user can run script 3or doin- post'processin- on task o-s i.e task;s
stdout8 stderr8 s4so- and jo,con3. )he stdout and stderr o3 the user'provided de,u- script
are printed on the dia-nostics. )hese outputs are aso dispa4ed on jo, $. on demand.
.n the 3oowin- sections we discuss how to su,mit de,u- script aon- with the jo,. #or
su,mittin- de,u- script8 3irst it has to distri,uted. )hen the script has to suppied in
Con3i-uration.
How to distribute script file'
)o distri,ute the de,u- script 3ie8 3irst cop4 the 3ie to the d3s. )he 3ie can ,e distri,uted ,4
settin- the propert4 Kmapred.cache.3iesK with vaue KpathKPKscript'nameK. .3 more than one
3ie has to ,e distri,uted8 the 3ies can ,e added as comma separated paths. )his propert4
can aso ,e set ,4
AP.s "istri,utedCache.addCache#ie9$(.8con3:and "istri,utedCache.setCache#ies9$(.s8con3
: where $(. is o3 the 3orm Khd3s2<<host2port<;a,soutepath;P;script'name;K. #or !treamin-8
the 3ie can ,e added throu-h command ine option 'cache#ie.
)he 3ies has to ,e s4minked in the current workin- director4 o3 o3 the task. )o create
s4mink 3or the 3ie8 the propert4 Kmapred.create.s4minkK is set to K4esK. )his can aso ,e
set ,4 "istri,utedCache.create!4m+ink9Con3i-uration: api.
How to sub(it script'
A /uick wa4 to su,mit de,u- script is to set vaues 3or the properties
Kmapred.map.task.de,u-.scriptK and Kmapred.reduce.task.de,u-.scriptK 3or de,u--in- map
task and reduce task respective4. )hese properties can aso ,e set ,4 usin-
AP.s 5o,Con3.set&ap"e,u-!cript9!trin-: and 5o,Con3.set(educe"e,u-!cript9!trin-: . #or
streamin-8 de,u- script can ,e su,mitted with command'ine options 'mapde,u-8
'reducede,u- 3or de,u--in- mapper and reducer respective4.
)he ar-uments o3 the script are task;s stdout8 stderr8 s4so- and jo,con3 3ies. )he de,u-
command8 run on the node where the map<reduce 3aied8 is2
Cscrpt Cstdout Cstderr Csyslog C!obcon(
Pipes pro-rams have the cQQ pro-ram name as a 3i3th ar-ument 3or the command. )hus 3or
the pipes pro-rams the command is
Cscrpt Cstdout Cstderr Csyslog C!obcon( Cprogram
%efault )eha*ior'
#or pipes8 a de3aut script is run to process core dumps under -d,8 prints stack trace and
-ives in3o a,out runnin- threads.
JobControl
5o,Contro is a utiit4 which encapsuates a set o3 &ap'(educe jo,s and their dependencies.
%ata Co(pression
Hadoop &ap'(educe provides 3aciities 3or the appication'writer to speci34 compression 3or
,oth intermediate map'outputs and the jo,'outputs i.e. output o3 the reduces. .t aso comes
,unded with CompressionCodec impementations 3or the =i, and =o compression
a-orithms. )he -=ip 3ie 3ormat is aso supported.
Hadoop aso provides native impementations o3 the a,ove compression codecs 3or reasons
o3 ,oth per3ormance 9=i,: and non'avaia,iit4 o3 5ava i,raries 9=o:. &ore detais on their
usa-e and avaia,iit4 are avaia,e here.
Inter(ediate Outputs
Appications can contro compression o3 intermediate map'outputs via
the 5o,Con3.setCompress&apOutput9,ooean: api and the CompressonCodec to ,e used
via the5o,Con3.set&apOutputCompressorCass9Cass: api. !ince the intermediate map'
outputs are awa4s stored in the !e/uence#ie 3ormat8
the !e/uence#ie.Compression)4pe9i.e. (0CO(" < >+OCF ' de3auts to ,$C#,A: can ,e
speci3ied via
the 5o,Con3.set&apOutputCompression)4pe9!e/uence#ie.Compression)4pe: api.
Job Outputs
Appications can contro compression o3 jo,'outputs via
the Output#ormat>ase.setCompressOutput95o,Con38 ,ooean: api and
the CompressonCodec to ,e used can ,e speci3ied via
the Output#ormat>ase.setOutputCompressorCass95o,Con38 Cass: api.
.3 the jo, outputs are to ,e stored in the !e/uence#ieOutput#ormat8 the
re/uired 3eOuence>le.CompressonType 9i.e. ,$C#,A < -.#C< ' de3auts to ,$C#,A:can
,e speci3ied via the !e/uence#ieOutput#ormat.setOutputCompression)4pe95o,Con38
!e/uence#ie.Compression)4pe: api.
Example: WordCount v.!"
Here is a more compete WordCount which uses man4 o3 the 3eatures provided ,4 the &ap'
(educe 3ramework we discussed so 3ar.
)his needs the H"#! to ,e up and runnin-8 especia4 3or the AstrbutedCac'e'reated
3eatures. Hence it on4 works with a pseudo'distri,uted or 3u4'distri,uted Hadoop
instaation.
#ource Code
WordCount.java
1. package org.myorg;
2.
3. mport !ava.o.&;
4. mport !ava.utl.&;
5.
6. mport org.apac'e.'adoop.(s.)at';
7. mport org.apac'e.'adoop.(lecac'e.AstrbutedCac'e;
8. mport org.apac'e.'adoop.con(.&;
9. mport org.apac'e.'adoop.o.&;
10. mport org.apac'e.'adoop.mapred.&;
11. mport org.apac'e.'adoop.utl.&;
12.
13. publc class WordCount e%tends Con(gured mplements Tool *
14.
15. publc statc class +ap e%tends +ap,educe-ase mplements +apper<.ongWrtable, Te%t, Te%t, "ntWrtable> *
16.
17. statc enum Counters * "7)PTBW#,A3 5
18.
19. prvate (nal statc "ntWrtable one / ne0 "ntWrtable112;
20. prvate Te%t 0ord / ne0 Te%t12;
21.
22. prvate boolean case3enstve / true;
23. prvate 3et<3trng> patternsTo3kp / ne0 ?as'3et<3trng>12;
24.
25. prvate long num,ecords / 6;
26. prvate 3trng nput>le;
27.
28. publc vod con(gure1JobCon( !ob2 *
29. case3enstve / !ob.get-oolean1;0ordcount.case.senstve;, true2;
30. nput>le / !ob.get1;map.nput.(le;2;
31.
32. ( 1!ob.get-oolean1;0ordcount.skp.patterns;, (alse22 *
33. )at'9: patterns>les / ne0 )at'96:;
34. try *
35. patterns>les / AstrbutedCac'e.get.ocalCac'e>les1!ob2;
36. 5 catc' 1"#$%cepton oe2 *
37. 3ystem.err.prntln1;Caug't e%cepton 0'le gettng cac'ed (lesL ; 8 3trngPtls.strng(y$%cepton1oe22;
38. 5
39. (or 1)at' patterns>le L patterns>les2 *
40. parse3kp>le1patterns>le2;
41. 5
42. 5
43. 5
44.
45. prvate vod parse3kp>le1)at' patterns>le2 *
46. try *
47. -u((ered,eader (s / ne0 -u((ered,eader1ne0 >le,eader1patterns>le.to3trng1222;
48. 3trng pattern / null;
49. 0'le 11pattern / (s.read.ne122 Q/ null2 *
50. patternsTo3kp.add1pattern2;
51. 5
52. 5 catc' 1"#$%cepton oe2 *
53. 3ystem.err.prntln1;Caug't e%cepton 0'le parsng t'e cac'ed (le R; 8 patterns>le 8 ;R L ; 8
3trngPtls.strng(y$%cepton1oe22;
54. 5
55. 5
56.
57. publc vod map1.ongWrtable key, Te%t value, #utputCollector<Te%t, "ntWrtable> output, ,eporter reporter2 t'ro0s "#$%cepton *
58. 3trng lne / 1case3enstve2 S value.to3trng12 L value.to3trng12.to.o0erCase12;
59.
60. (or 13trng pattern L patternsTo3kp2 *
61. lne / lne.replace@ll1pattern, ;;2;
62. 5
63.
64. 3trngToken4er token4er / ne0 3trngToken4er1lne2;
65. 0'le 1token4er.'as+oreTokens122 *
66. 0ord.set1token4er.ne%tToken122;
67. output.collect10ord, one2;
68. reporter.ncrCounter1Counters."7)PTBW#,A3, 12;
69. 5
70.
71. ( 1188num,ecords T 1662 // 62 *
72. reporter.set3tatus1;>ns'ed processng ; 8 num,ecords 8 ; records ; 8 ;(rom t'e nput (leL ; 8 nput>le2;
73. 5
74. 5
75. 5
76.
77. publc statc class ,educe e%tends +ap,educe-ase mplements ,educer<Te%t, "ntWrtable, Te%t, "ntWrtable> *
78. publc vod reduce1Te%t key, "terator<"ntWrtable> values, #utputCollector<Te%t, "ntWrtable> output, ,eporter reporter2 t'ro0s
"#$%cepton *
79. nt sum / 6;
80. 0'le 1values.'as7e%t122 *
81. sum 8/ values.ne%t12.get12;
82. 5
83. output.collect1key, ne0 "ntWrtable1sum22;
84. 5
85. 5
86.
87. publc nt run13trng9: args2 t'ro0s $%cepton *
88. JobCon( con( / ne0 JobCon(1getCon(12, WordCount.class2;
89. con(.setJob7ame1;0ordcount;2;
90.
91. con(.set#utput<eyClass1Te%t.class2;
92. con(.set#utput=alueClass1"ntWrtable.class2;
93.
94. con(.set+apperClass1+ap.class2;
95. con(.setCombnerClass1,educe.class2;
96. con(.set,educerClass1,educe.class2;
97.
98. con(.set"nput>ormat1Te%t"nput>ormat.class2;
99. con(.set#utput>ormat1Te%t#utput>ormat.class2;
100.
101. .st<3trng> ot'erBargs / ne0 @rray.st<3trng>12;
102. (or 1nt /6; < args.lengt'; 882 *
103. ( 1;Dskp;.eOuals1args9:22 *
104. AstrbutedCac'e.addCac'e>le1ne0 )at'1args988:2.toPr12, con(2;
105. con(.set-oolean1;0ordcount.skp.patterns;, true2;
106. 5 else *
107. ot'erBargs.add1args9:2;
108. 5
109. 5
110.
111. >le"nput>ormat.set"nput)at's1con(, ne0 )at'1ot'erBargs.get16222;
112. >le#utput>ormat.set#utput)at'1con(, ne0 )at'1ot'erBargs.get11222;
113.
114. JobClent.runJob1con(2;
115. return 6;
116. 5
117.
118. publc statc vod man13trng9: args2 t'ro0s $%cepton *
119. nt res / Tool,unner.run1ne0 Con(guraton12, ne0 WordCount12, args2;
120. 3ystem.e%t1res2;
121. 5
122. 5
123.
#ample Runs
!ampe te1t'3ies as input2
C bnE'adoop d(s Dls EusrE!oeE0ordcountEnputE
EusrE!oeE0ordcountEnputE(le61
EusrE!oeE0ordcountEnputE(le62
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEnputE(le61
?ello World, -ye WorldQ
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEnputE(le62
?ello ?adoop, Foodbye to 'adoop.
(un the appication2
C bnE'adoop !ar EusrE!oeE0ordcount.!ar org.myorg.WordCount
EusrE!oeE0ordcountEnput EusrE!oeE0ordcountEoutput
Output2
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEoutputEpartD66666
-ye 1
Foodbye 1
?adoop, 1
?ello 2
WorldQ 1
World, 1
'adoop. 1
to 1
*otice that the inputs di33er 3rom the 3irst version we ooked at8 and how the4 a33ect the
outputs.
*ow8 ets pu-'in a pattern'3ie which ists the word'patterns to ,e i-nored8 via
the AstrbutedCac'e.
C 'adoop d(s Dcat EuserE!oeE0ordcountEpatterns.t%t
U.
U,
UQ
to
(un it a-ain8 this time with more options2
C bnE'adoop !ar EusrE!oeE0ordcount.!ar org.myorg.WordCount
DA0ordcount.case.senstve/true EusrE!oeE0ordcountEnput
EusrE!oeE0ordcountEoutput Dskp EuserE!oeE0ordcountEpatterns.t%t
As e1pected8 the output2
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEoutputEpartD66666
-ye 1
Foodbye 1
?adoop 1
?ello 2
World 2
'adoop 1
(un it once more8 this time switch'o33 case'sensitivit42
C bnE'adoop !ar EusrE!oeE0ordcount.!ar org.myorg.WordCount
DA0ordcount.case.senstve/(alse EusrE!oeE0ordcountEnput
EusrE!oeE0ordcountEoutput Dskp EuserE!oeE0ordcountEpatterns.t%t
!ure enou-h8 the output2
C bnE'adoop d(s Dcat EusrE!oeE0ordcountEoutputEpartD66666
bye 1
goodbye 1
'adoop 2
'ello 2
0orld 2
Hi%'li%'ts
)he second version o3 WordCount improves upon the previous one ,4 usin- some 3eatures
o33ered ,4 the &ap'(educe 3ramework2
"emonstrates how appications can access con3i-uration parameters in
the con(gure method o3 the +apper 9and ,educer: impementations 9ines 7A'?D:.
"emonstrates how the AstrbutedCac'e can ,e used to distri,ute read'on4 data
needed ,4 the jo,s. Here it aows the user to speci34 word'patterns to skip whie countin-
9ine 10?:.
"emonstrates the utiit4 o3 the Tool inter3ace and the Fenerc#ptons)arser to
hande -eneric Hadoop command'ine options 9ines A7'11@8 11C:.
"emonstrates how appications can use Counters 9ine @A: and how the4 can set
appication'speci3ic status in3ormation via the ,eporter instance passed to
themap 9and reduce: method 9ine 77:.
#ava and #$I are trademarks or registered trademarks of %un &icrosystems' Inc. in the
(nited %tates and other countries.

+ast Pu,ished2 0B<71<700A 70201210
Cop4ri-ht R 7007 )he Apache !o3tware #oundation.

You might also like