GPFS performance session Sven Oehme oehmes@s!i"m!com 2 2014 IBM Corporation #1 cache reference 1 ns L2 cache reference 5 ns Acquire/release mutex 100 ns Main Memory reference 100 ns Send 2k byte over verbs 10000 ns Send 2k bytes over 1 !b"s net#ork $0000 ns %ead 1 M& sequentially from Memory 250000 ns %ead 1 M& sequentially from net#ork 5000000 ns 'isk seek 10000000 ns %ead 1 M& sequentially from disk 20000000 ns Send net#ork "acket from S())A*+ ,%A'- *+ S())A 150000000 ns Latency numbers you NEED to know * Latency numbers you NEED to know * $%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate * 2014 IBM Corporation Gi+a"it 100 M&/sec 10 !bit 1000 M&/sec .0 !bit /'% 0& .000 M&/sec 5$ !bit ,'% 0& 5$00 M&/sec !-1*2 2)0 Slot 3000 M&/sec !-1*3 2)0 Slot $000 M&/sec 1L SAS drive 1004 Seq %ead/5rite 100 M&/sec 1L SAS drive 1004 2M& %andom %ead/5rite 65 M&/sec SS' 1004 Seq %ead 300 M&/sec SS' 1004 Seq 5rite 200 M&/sec Bandwidth numbers you NEED to know * Bandwidth numbers you NEED to know * $%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate 4 2014 IBM Corporation ,# S-S &rive 100) 4. /an&om iops 100 10k SAS drive 1004 .k %andom io"s 200 SS' 1004 .k random reads 20000 SS' 1004 .k random #rites .000 IOPS numbers you NEED to know * IOPS numbers you NEED to know * $%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate 0 2014 IBM Corporation How to collect a GPS trace !or "er!ormance analysis How to collect a GPS trace !or "er!ormance analysis GPFS 3.5 TL3 provides a new low overhead Tracing facility in memory tracing We had a in emory tracing !efore" !#t it had still large overhead and was now replaced !y a new techni$#e To set %l#ster wide in emory tracing" r#n & mmtracectl ''set ''trace(def ''tracedev'write'mode(overwrite ''tracedev'overwrite'!#ffer'si)e(*g +o# can now t#rn on tracing & mmtracectl ''start Stop tracing" while leave the settings in place & mmtracectl ''stop ,r t#rn tracing off and reset to non in'memory tracing & mmtracectl ''off 1 2014 IBM Corporation #naly$e traces % create re"orts #naly$e traces % create re"orts -s an e.ampled my traces are stored in /var/log/gpfstraces/ -nd my GPFS clone directory is /.cat/oehmes/gpfs'clone 0ow r#n the following commands & 1 /.cat/oehmes/gpfs'clone/tools/trcio 'W 'T 'F 'L 'd /var/log/gpfstraces/ 'o trcio.o#tp#t 2eading directory /var/log/gpfstraces Fo#nd * files matching 34trcrpt.546 Processing * of *& trcrpt.*378*9.*:.5:.37.sonas79n*.g) !ad trace entry& ;sing invariant time cycle co#nter at <9==.==:777h) to calc#late timestamps !ad trace entry& >etected %P; r#nning at <935.777777 ?) at *383@33=85.987757 A*73@*:7=@:9=73<<<B. !ad trace entry& >etected %P; r#nning at <935.777777 ?) at *383@33=85.98775= A*73@*:7=@:9=<5*99B. !ad trace entry& >etected %P; r#nning at <9==.777777 ?) at *383@35=@<.3:78<8 A*73@::<87@358<3:3B. !ad trace entry& 0ote that this is a change in %P; speed. Aif prior T,> times are within a second of each other this is a sp#rio#s errorB !ad trace entry& >etected %P; r#nning at <9=@.777777 ?) at *383@35=@<.3:<399 A*73@::<87@8:*<9:7B. !ad trace entry& >etected %P; r#nning at <9==.777777 ?) at *383@9*@7:.3**:87 A*797**@:=38:93858B. !ad trace entry& >etected %P; r#nning at <3=:.777777 ?) at *383@9*@7:.3**:88 A*797**@:=38::7535B. Writing trcio.o#tp#t 2 2014 IBM Corporation #naly$e traces % look at the re"orts #naly$e traces % look at the re"orts Processed * trace files& 347.7777774" 4*3@=3.*59<8=4" 4trcrpt.*378*9.*:.5:.37.sonas79n*.g)46 Total elapsedTime& *3@=3.*59 sec Total Time %o#nt ,ps/sec ''Time'per'operation'Amilli'secB'' '%#m#l.'ed. AsecondsB min avg =7C ma. AmsB pcnt '''''''''' '''''' '''''''' ''''''' ''''''' ''''''' ''''''' ''''''''''''4 <9*5.:59 *9:7<3 *7.5 7.738 *:.593 <3.88* *57.953 *@.39: 3:C read 454 :7:.7<= 33@85 <.9 7.7** *8.@=7 *=.**9 *978.=57 <<.:<< 9C send 454 37.8:8 =@3 7.* 7.73= 3*.<== ==.39< <<8.@*: ==.397 *7C write 454 *:.85@ *73< 7.* 7.783 *:.<3@ 38.<*7 <55.89: 35.77= *5C nsd 454 7.<7: 85@ 7.* 7.77* 7.<8< 7.*3< *3.=5< =.77: *C dm#t. 454 7.*5: 55@5 7.9 7.773 7.7<@ 7.733 7.59* 7.7<8 37C recv 454 7.7*8 3:89: <.: 7.777 7.777 7.77* 7.7*3 7.77* <<C tsc 454 7.775 98: 7.7 7.773 7.7** 7.7<3 7.73@ 7.7*3 <3C vnop 454 7.77* 58@ 7.7 7.777 7.77* 7.77< 7.77: 7.77< 35C locD 454 Total Time %o#nt ,ps/sec ''Time'per'operation'Amilli'secB'' '%#m#l.'ed. AsecondsB min avg =7C ma. AmsB pcnt '''''''''' '''''' '''''''' ''''''' ''''''' ''''''' ''''''' '''''''''''' *:95.@5= @@*88 :.3 7.8*< *@.::5 <9.<95 **5.:7@ *=.395 97C read 4vdisDE#f4 :@:.@*: 59@<9 3.= 7.738 *<.5<@ *=.*3= *57.953 *5.<=9 3<C read 4#nDnown4 393.8:9 <8@=@ <.7 7.7<< *<.3<< *=.9<@ =3.==: *5.9<= 3*C send 4nspdsg2eadWrite4 <:<.795 37@ 7.7 :@9.<=* @57.8=5 *3@9.@9* *978.=57 875.=5: 3=C send 4nspdsg>iscover4 @<.=:< <=8: 7.< *=.*@< <8.@88 33.*8< *77.9=8 <8.<@@ 93C read 4data4 <3.35@ <77 7.7 5=.@59 **:.8=7 *=9.5=: <<8.@*: **@.*7= 35C write 4data4 *:.85@ *73< 7.* 7.783 *:.<3@ 38.<*7 <55.89: 35.77= *5C nsd 4process2e$#est4 5.95* 37@ 7.7 **.78< *8.:=@ <7.3:< @3.@9< *8.*7@ 9*C write 4vdisDFGLog<4 7.89: 3:= 7.7 7.797 <.7<3 7.**3 58.:59 <:.7*: <C write 4vdisDFWLog4 7.87: 35 7.7 *8.*<* <7.*8: <7.5*@ <5.*=5 <7.<@* 9@C write 4vdisD2G>esc4 7.3@8 *< 7.7 <:.==8 3<.<:7 38.737 38.73: 3:.<39 57C write 4log>ata4 7.*:< <9 7.7 7.779 :.85< =.=55 *3.=5< =.588 <=C dm#t. 4GE#fFree#te. HGdisDScr#!WorDerThreadI4 ..... 3 2014 IBM Corporation sdstat Plu&in !or GPS sdstat Plu&in !or GPS With GPFS 3.5 PTF*< we will add a new sample code pl#gin for sdatst to the GPFS rpms on Lin#. Jn the meanwhile who has access to the git repository can #se it " files are & ./ts/#til/dstatKgpfsops.py.dstat.7.: ./ts/#til/dstatKgpfsops.py.dstat.7.8 The e.tension of the file A7.: and 7.8B are for the < incompati!le pl#gin versions of dstat. 7.: will worD on all older Lin#. version prior to 2?FL :.* and version 7.8 will worD on all newer versions 7.: and 7.8 are the versions of dstat reported !y dstat version The version of the pl#gin needs to !e copied into /#sr/share/dstat/ Aon 2?FL :.LB and renamed to dstatKgpfsops.py liDe cp ./ts/#til/dstatKgpfsops.py.dstat.7.8 /#sr/share/dstat/dstatKgpfsops.py -fter that yo# can add the pl#gin to the dstat o#tp#t !y r#nning & dstat 'c 'n 'd ' gpfsops ''nocolor This will show cp# " networD" disD and GPFS defa#lt stats on a single line at * second gran#larity Jn order to ena!le vfs statistics yo# need to r#n & mmfsadm vfsstats ena!le ,n each node in the cl#ster Aor add to mmfs#p file in /vsr/mmfs/etc/B
4 2014 IBM Corporation sdstat Plu&in !or GPS sdstat Plu&in !or GPS >stat class to display selected gpfs performance co#nters ret#rned !y the mmpmon MvfsKsM" MiocKsM" MvioKsM" Mvfl#shKsM" and MlrocKsM commands. The set of co#nters displayed can !e c#stomi)ed via environment varia!les& >ST-TKGPFSKW?-T Selects which of the five mmpmon commands to display. Jt is a comma separated list of any of the following& MvfsM& show mmpmon MvfsKsM co#nters MiocM& show mmpmon MiocKsM co#nters related to 0S> client J/, MnsdM& show mmpmon MiocKsM co#nters related to 0S> server J/, MvioM& show mmpmon MvioKsM co#nters Mvfl#shM& show mmpmon Mvfl#shKsM co#nters MlrocM& show mmpmon MlrocKsM co#nters MallM& e$#ivalent to specifying all of the a!ove F.ample& >ST-TKGPFSKW?-T(vfs"lroc dstat ' gpfsops will display co#nters for mmpmon MvfsKsM and MlrocM commands. The defa#lt setting is Mvfs"iocM" i.e." !y defa#lt only MvfsKsM and 0S> client related MiocKsM co#nters are displayed. For more details on f#rther c#stomi)ation see the dstatsKgpfsops.py file 10 2014 IBM Corporation sdstat Plu&in !or GPS sdstat Plu&in !or GPS N#st show GFS level %o#nters & 1 >ST-TKGPFSKW?-T(vfs dstat 'c 'n 'd ' gpfsops ''nocolor W-20J0G& ,ption ' is deprecated" please #se ''gpfsops instead /#sr/!in/dstat&*:8<& >eprecationWarning& os.popen3 is deprecated. ;se the s#!process mod#le. pipes3cmd6 ( os.popen3Acmd" 4t4" 7B ''''total'cp#'#sage'''' 'net/total' 'dsD/total' '''''''''''''''''''''''''''gpfs'vfs'ops'''''''''''''''''''''''''' #sr sys idl wai hi$ si$O recv sendO read writO cr del op/cl rd wr tr#nc fsync looD# gattr sattr other 7 7 =@ * 7 7O 7 7 O *3 95O 7 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O ==7E 5939EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O 897E 9==:EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O :@9E 95=7EO 7 *:DO 7 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O :<@E 57*<EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O =*:E 539:EO 7 :7DO 7 7 7 7 7 7 7 7 7 7 7 N#st show the v>JSP co#nters & 1 >ST-TKGPFSKW?-T(vio dstat 'c 'n 'd ' gpfsops ''nocolor W-20J0G& ,ption ' is deprecated" please #se ''gpfsops instead /#sr/!in/dstat&*:8<& >eprecationWarning& os.popen3 is deprecated. ;se the s#!process mod#le. pipes3cmd6 ( os.popen3Acmd" 4t4" 7B ''''total'cp#'#sage'''' 'net/total' 'dsD/total' ''''''''''''''''''''''''''gpfs'vio''''''''''''''''''''''''' #sr sys idl wai hi$ si$O recv sendO read writO%l2ea %lShW %ldW %lPFT %lFTW Fl;pW FlPFT igrt Scr#! LgWr 7 7 =@ * 7 7O 7 7 O *3 95O 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O*959E 5=@:EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O 5:9E 5*87EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O :@9E 9=77EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O @:9E 9:59EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O 897E 9=@7EO 7 :9DO 7 7 7 7 7 7 7 7 7 7 7 7 *77 7 7 7O :@7E 9589EO 7 7 O 7 7 7 7 7 7 7 7 7 7 11 2014 IBM Corporation Per! to" % 'ool to !ind the hi&h (P) contender Per! to" % 'ool to !ind the hi&h (P) contender If you start perf top withou parameters, it gives you a top CPU consuming processes of the system in real time and show a relative % compared to others. <3.7:C mmfsd 3.6 rs>*7T<K@K=KlowKvectorKcDs#mAvoid55" %P:9State5" intB
7.:3C 3Dernel6 3D6 fp#t ... 12 2014 IBM Corporation Per! to" % 'ool to !ind the hi&h (P) contender Per! to" % 'ool to !ind the hi&h (P) contender +o# can f#rther )oom into a process Ain this e.ample mmfsdB and see a !reaDdown of cp# chewing f#nctions & Samples& <:=P of event 4cycles4" Fvent co#nt Aappro..B& *<:8373=*8*5" Thread& mmfsdA3999<<B" >S,& mmfsd
3=.5*C 3.6 rs>*7T<K@K=KlowKvectorKcDs#mAvoid55" %P:9State5" intB < GN chec!sum code <@.38C 3.6 rs>*7T<K@K=KhighKvectorKcDs#mAvoid55" %P:9State5" intB *.=<C 3.6 GTracD>esc&&cleanE#ffers-ndEitmapsAGJ,2e$#est5B *.39C 3.6 G>ataE#f&&vPre!#ildE#fferTrailerAintB *.<@C 3.6 0otGlo!al#te.%lass&&ac$#ireAB *.<5C 3.6 %hecDs#m*:&&calc*:Avoid const5" intB *.77C 3.6 GTracD>esc&&!#ildE#fferTrailersAE#fEitmap constQ" G>ataE#f55B 7.=*C 3.6 J,E#ndle&&$#e#eJ,E#fferAG>ataE#f5" int" int" intB 7.@:C 3.6 GJ,2e$#est&&performPromotedFTWriteAB 7.@5C 3.6 ver!s&&ver!sServerKiAint" 2pc%onte.t5" 0ode-ddr" int" #nsigned int" int" int" nsd2dma2mrKs5" int" iovec5" long long" long longB 7.@9C 3.6 G>ataE#f&&vGet>ata-ddr-t,ffsetAint" intB const 7.@<C 3.6 %h#nDTa!&&find%h#nDAchar5" #nsigned longB 7.@*C 3.6 G>ataE#f&&vE#ildE#fferTrailerAintB 7.88C 3.6 ver!s&&ver!s>toThreadKiAintB 7.89C 3.6 G>ataE#f&&v?oldAB 7.:=C 3.6 Jncremental%hecDs#mState&&icD-cc#m#lateAvoid const5" intB 7.::C 3.6 GTracD>esc&&prepareToE#ildTrailersAGJ,2e$#est5B 7.::C 3.6 GTracD>esc&&vt;pdateTrailerGersionsAGJ,2e$#est5B 7.::C 3.6 Th%ond&&waitAint" char const5B 7.:3C 3.6 GJ,2e$#est&&vio2eshapeFreeE#ffersAintB 7.:3C 3.6 Gem?andle&&mGetG-ddrAB const 7.:7C 3.6 GTracD>esc&&vtProcessWriteJ,E#ndleStat#sAJ,E#ndle5" GJ,2e$#est5" E#fEitmap5B ... 1* 2014 IBM Corporation Per!ormance data Per!ormance data " word of caution # The achieved n#m!ers depends on the right %lient config#ration and good Jnterconnect and can vary !etween environments. They sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the technical capa!ilities of the Prod#ct in good conditions Non of the following Performance numbers should be reused for sales or contract purposes. Some of the numbers produced are a result of very advanced tuning and while achievable, not very easy to recreate at customer systems without the same level of effort 14 2014 IBM Corporation 'est Setu" 'est Setu" 10 5*0006M* Server each 7ith 11 GB of Memor8 91 +" Pa+epoo(: 1 F;/ Port 1 5 1 core CP< 1 GSS24=21 &epen&in+ on the test! 2 F;/ Ports connecte& per Server GPFS *!0!0!2 G- co&e (eve( Me((ano5 *2 Port F;/ s7itch 45 105 10 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice %reating a single *7 G!yte File from one %lient #sing a GF0'< F>2 JE card 1 /#sr/local/!in/gpfsperf create se$ 'n *7G 'r @m /i!m/fs<'@m/test'*7g'write /#sr/local/!in/gpfsperf create se$ /i!m/fs<'@m/test'*7g'write recSi)e @ nEytes *7G fileSi)e *7G nProcesses * nThreadsPerProcess * file cache fl#shed !efore test not #sing data shipping not #sing direct J/, offsets accessed will cycle thro#gh the same file segment not #sing shared memory !#ffer not releasing !yte'range toDen after open no fsync at end of test >ata rate was $%&'()).*+ ,-ytes.sec" iops was 3=@.=5" thread #tili)ation 7.=@9 2ecord si)e& @3@@:7@ !ytes" *78389*@<97 !ytes to transfer" *78389*@<97 !ytes transferred %P; #tili)ation& #ser 3.:@C" sys 3.@8C" idle =<.95C" wait 7.77C Why didn4t it r#n at 5.: GE/sec R GF0 < 11 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice 2eading a single !locD random from this *7 G!yte File while it is not cached anymore 1 /#sr/local/!in/gpfsperf read rand 'n @m 'r @m /i!m/fs<'@m/test'*7g'write /#sr/local/!in/gpfsperf read rand /i!m/fs<'@m/test'*7g'write recSi)e @ nEytes @ fileSi)e *7G nProcesses * nThreadsPerProcess * file cache fl#shed !efore test not #sing data shipping not #sing direct J/, offsets accessed will cycle thro#gh the same file segment not #sing shared memory !#ffer not releasing !yte'range toDen after open >ata rate was %%/$%%.*% ,-ytes.sec" iops was <:.@=" thread #tili)ation 7.=@= 2ecord si)e& @3@@:7@ !ytes" @3@@:7@ !ytes to transfer" @3@@:7@ !ytes transferred %P; #tili)ation& #ser 7.77C" sys 7.77C" idle *77.77C" wait 7.77C 3rootSclients.sonascl*: mpi61 mmdiag ''iohist ((( mmdiag& iohist ((( J/, history& J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server ''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' ''''''''''''''' *7&79&95.=8=798 2 data *7&975<8=<**5< *:3@9 33.:9: cli %7-8797<&5*F*E*<% *=<.*:8.9.< 12 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice 2eading @ !locDs se$#entially from this *7 G!yte File while it is not cached anymore 1 /#sr/local/!in/gpfsperf read se$ 'n :9m 'r @m /i!m/fs<'@m/test'*7g'write /#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write recSi)e @ nEytes :9 fileSi)e *7G nProcesses * nThreadsPerProcess * file cache fl#shed !efore test not #sing data shipping not #sing direct J/, offsets accessed will cycle thro#gh the same file segment not #sing shared memory !#ffer not releasing !yte'range toDen after open >ata rate was *$$)$$.%+ ,-ytes.sec" iops was :5.*@" thread #tili)ation 7.=7* 2ecord si)e& @3@@:7@ !ytes" :8*7@@:9 !ytes to transfer" :8*7@@:9 !ytes transferred %P; #tili)ation& #ser 7.:@C" sys 7.:@C" idle =@.:9C" wait 7.77C 3rootSclients.sonascl*: mpi61 mmdiag ''iohist ((( mmdiag& iohist ((( J/, history& J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server ''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' ''''''''''''''' *7&7:&<9.::9@8@ 2 data **&:=758:5@@@7 *:3@9 <8.@78 cli %7-8797<&5*F*E*<> *=<.*:8.9.< *7&7:&<9.::9@8@ 2 data *7&@*9*:355@97 *:3@9 39.38< cli %7-8797<&5*F*E*<% *=<.*:8.9.< *7&7:&<9.879*== 2 data 8&*53:@33*<:97 *:3@9 <:.@@9 cli %7-8797*&5*F*E**% *=<.*:8.9.* *7&7:&<9.87*=:8 2 data *<&*55<333@:9=: *:3@9 3<.588 cli %7-8797<&5*F*E*<F *=<.*:8.9.< *7&7:&<9.879<*7 2 data @&@:*573:5*@9 *:3@9 3<.*== cli %7-8797*&5*F*E**> *=<.*:8.9.* *7&7:&<9.838<37 2 data =&*873*5*@:*8: *:3@9 37.5<8 cli %7-8797*&5*F*E**F *=<.*:8.9.* *7&7:&<9.89*5@7 2 data @&8*3*==<@@3< *:3@9 37.59< cli %7-8797*&5*F*E**> *=<.*:8.9.* *7&7:&<9.838<38 2 data *7&*99*3*57:*8: *:3@9 38.35@ cli %7-8797<&5*F*E*<% *=<.*:8.9.< *7&7:&<9.83=<33 2 data *<&=888@783:7 *:3@9 3:.385 cli %7-8797<&5*F*E*<F *=<.*:8.9.< *7&7:&<9.83=<73 2 data **&=*9<@==578< *:3@9 38.:78 cli %7-8797<&5*F*E*<> *=<.*:8.9.< *7&7:&<9.89*5@@ 2 data 8&*8*****8779@ *:3@9 97.997 cli %7-8797*&5*F*E**% *=<.*:8.9.* 13 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice 2eading @ !locDs se$#entially from this *7 G!yte File while it is 01I22 cached 1 /#sr/local/!in/gpfsperf read se$ 'n :9m 'r @m /i!m/fs<'@m/test'*7g'write /#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write recSi)e @ nEytes :9 fileSi)e *7G nProcesses * nThreadsPerProcess * file cache fl#shed !efore test not #sing data shipping not #sing direct J/, offsets accessed will cycle thro#gh the same file segment not #sing shared memory !#ffer not releasing !yte'range toDen after open >ata rate was $%'&(&*.3& ,-ytes.sec" iops was 97*.*9" thread #tili)ation 7.=@7 2ecord si)e& @3@@:7@ !ytes" :8*7@@:9 !ytes to transfer" :8*7@@:9 !ytes transferred %P; #tili)ation& #ser 7.77C" sys 9.7@C" idle =5.=<C" wait 7.77C 3rootSclients.sonascl*: mpi61 mmdiag ''iohist ((( mmdiag& iohist ((( J/, history& J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server ''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' ''''''''''''''' 1 14 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice 2eading the whole file se$#entially 1 /#sr/local/!in/gpfsperf read se$ 'n *7g 'r @m /i!m/fs<'@m/test'*7g'write /#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write recSi)e @ nEytes *7G fileSi)e *7G nProcesses * nThreadsPerProcess * file cache fl#shed !efore test not #sing data shipping not #sing direct J/, offsets accessed will cycle thro#gh the same file segment not #sing shared memory !#ffer not releasing !yte'range toDen after open >ata rate was %$$/$3&.*& ,-ytes.sec" iops was <@9.98" thread #tili)ation *.777 2ecord si)e& @3@@:7@ !ytes" *78389*@<97 !ytes to transfer" *78389*@<97 !ytes transferred %P; #tili)ation& #ser 3.7*C" sys 3.<9C" idle =3.85C" wait 7.77C ((( mmdiag& iohist ((( J/, history& J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server ''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' ''''''''''''''' *7&39&7<.395=<5 2 data @&*:8*3*<57:@@ *:3@9 39.::@ cli %7-8797*&5*F*E**> *=<.*:8.9.* *7&39&7<.395=<5 2 data =&*78:@3@<3:*: *:3@9 38.939 cli %7-8797*&5*F*E**F *=<.*:8.9.* *7&39&7<.3@8**< 2 data **&:=758:5@@@7 *:3@9 37.35: cli %7-8797<&5*F*E*<> *=<.*:8.9.< *7&39&7<.3@9*7@ 2 data *7&@*9*:355@97 *:3@9 35.<:5 cli %7-8797<&5*F*E*<% *=<.*:8.9.< *7&39&7<.3@873* 2 data *<&*55<333@:9=: *:3@9 39.38: cli %7-8797<&5*F*E*<F *=<.*:8.9.< *7&39&7<.9<<<@3 2 data @&@:*573:5*@9 *:3@9 3*.*<3 cli %7-8797*&5*F*E**> *=<.*:8.9.* *7&39&7<.9<983* 2 data *7&*99*3*57:*8: *:3@9 <=.9=5 cli %7-8797<&5*F*E*<% *=<.*:8.9.< *7&39&7<.9<<<5* 2 data 8&*53:@33*<:97 *:3@9 3<.935 cli %7-8797*&5*F*E**% *=<.*:8.9.* *7&39&7<.9<983* 2 data =&*873*5*@:*8: *:3@9 39.79= cli %7-8797*&5*F*E**F *=<.*:8.9.* ..... 20 2014 IBM Corporation #""ly these numbers to "ractice #""ly these numbers to "ractice See that the data was prefetched which is why the response time per re$#est is lower & mmfsadm d#mp iohist J/, history& J/, start time 2W E#f type disD§or0#m nSec time ms tag* tag< >isD ;J> typ 0S> server conte.t thread ''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' ''''''''' ''''''''' '''''''''''''''''' ''' ''''''''''''''' ''''''''' '''''''''' *7&39&98.*9@5@< 2 data =&*78:@3@<3:*: *:3@9 37.:** <<=5@7@ * %7-8797*&5*F*E*37 cli *=<.*:8.9.* Prefetch Prefetch4or!er1hread *7&39&98.*9@5=7 2 data @&*:8*3*<57:@@ *:3@9 5*.*@7 <<=5@7@ 7 %7-8797*&5*F*E*<- cli *=<.*:8.9.* E?andler FileElocD2eadFetch?andlerThread *7&39&98.<79@@7 2 data **&:=758:5@@@7 *:3@9 <8.@@8 <<=5@7@ 3 %7-8797*&5*F*E*<> cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<7<59= 2 data *7&@*9*:355@97 *:3@9 3:.39@ <<=5@7@ < %7-8797*&5*F*E*<F cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<79@@@ 2 data *<&*55<333@:9=: *:3@9 39.7*8 <<=5@7@ 9 %7-8797*&5*F*E*<% cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<99735 2 data *7&*99*3*57:*8: *:3@9 3<.@:: <<=5@7@ @ %7-8797*&5*F*E*<F cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<9*@3= 2 data 8&*53:@33*<:97 *:3@9 35.83< <<=5@7@ 5 %7-8797*&5*F*E*<@ cli *=<.*:8.9.* Prefetch PrefetchWorDerThread *7&39&98.<9*@3= 2 data @&@:*573:5*@9 *:3@9 38.598 <<=5@7@ : %7-8797*&5*F*E*<- cli *=<.*:8.9.* Prefetch PrefetchWorDerThread *7&39&98.<9:5:8 2 data *<&=888@783:7 *:3@9 33.:39 <<=5@7@ *7 %7-8797*&5*F*E*<% cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<9:5:8 2 data **&=*9<@==578< *:3@9 3:.@8* <<=5@7@ = %7-8797*&5*F*E*<> cli *=<.*:8.9.< Prefetch PrefetchWorDerThread *7&39&98.<93=9: 2 data =&*873*5*@:*8: *:3@9 93.5=* <<=5@7@ 8 %7-8797*&5*F*E*37 cli *=<.*:8.9.* Prefetch PrefetchWorDerThread 21 2014 IBM Corporation Benchmark Benchmark e*ecution e*ecution and and results results Operation 1m .m 1$m !SS2$*#rite 7M&/sec8 3956:30 11302:19 1.960:.0 !SS2$*read 7M&/sec8 $9;6:5; 13915:3$ 15193:61 !SS2.*#rite 7M&/sec8 3023:23 6699:2$ 111.;:36 !SS2.*read 7M&/sec8 .9;6:02 9515:$$ 13;65:60 ior -i 2 -p -d 10 -w -r -e -t 16m -b 32G -o /ibm/fs2-16m/shared/ior//iorfile -i N repetitions -- number of repetitions of test -d N interTestDelay -- delay between reps in seconds -w writeFile -- write file -r readFile -- read existing file -e fsync -- perform fsync upon POSIX write close -t N transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g) -b N blockSize -- contiguous bytes to write per task (e.g.: 8, 4k, 2m, 1g) -o S testFile -- full name for test " word of caution # The achieved n#m!ers depends on the right %lient config#ration and good Jnterconnect and can vary !etween environments. They sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the technical capa!ilities of the Prod#ct in good conditions 22 2014 IBM Corporation +ead +ead Benchmark Benchmark " word of caution # The achieved n#m!ers depends on the right %lient config#ration and good Jnterconnect and can vary !etween environments. They sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the technical capa!ilities of the Prod#ct in good conditions 2* 2014 IBM Corporation ,rite Benchmark ,rite Benchmark " word of caution # The achieved n#m!ers depends on the right %lient config#ration and good Jnterconnect and can vary !etween environments. They sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the technical capa!ilities of the Prod#ct in good conditions 24 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained General Parameter odern Servers have m#ltiple emory regions that are attached to a given socDet. !y defa#lt Lin#. allocates data for a given process from only * 0;- 2egion This Parameter tells GPFS to ro#nd ro!in across all regions to not r#n into a o#t of memory condition when yo# reached the limit of one of the regions while the remaining still have plenty of memory left. mmchconfig numa5emoryInterleave6yes page pool defines the amo#nt of physical memory that sho#ld !e pinned !y GPFS at start#p. it is #sed in vario#s places of the code" !#t from a Performance perspective its re$#ired to cache data and metadata o!Tects Aindirect !locDs" directory !locDsB. mmchconfig pagepool6$'g >efines the ma.im#m n#m!er of E#fferdescriptors. for data !locD Af#ll !locD or fragmentBor directory !locD yo# want to hold in the cache yo# need to have e.actly * mmchconfig ma78uffer9escs6%m Percentage of page pool #sed for file prefetching needs to !e less than the defa#lt of <7C since most of the page pool was given to G02. mmchconfig prefetchPct6* -llow largest possi!le GPFS !locD si)e and G02 vdisD tracD si)e mmchconfig ma7-loc!si:e6(&m 0#m!er of recent J,s whose target address and response times are recorded. >efa#lt 5*<. mmchconfig io;istory0i:e6&+! defines no of #lticlass / 0on'critical worDer threads to !e started mmchconfig ma7General1hreads6(%'/ ma.EpS affects the depth of prefetching for se$#ential file access. Jt sho#ld !e set at least as large as the ma.im#m e.pected hardware !andwidth. mmchconfig ma758p06(&/// 20 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained ;ouse!eeping . cache related settings syncJntervalStrict defines if we sho#ld only follow the syncJnterval Adefa#lt 37B val#e rather than the main interval of the ,S triggered sync " which happens on lin#. every 5 seconds. this has a very !ig positive impact on worDloads with !#ffered writes. mmchconfig syncInterval0trict6yes These are all a!o#t cleaning MfilesM so ,penFile o!Tects can !e stolen and re'#sed. To steal an ,penFile o!Tect the whole file Adata Q metadataB m#st !e fl#shed. fl#shed>ataTarget& no of ,penFile o!Tects where data have !een fl#shed already fl#shedJnodeTarget& no of ,penFile o!Tects where data Q metadata have !een fl#shed ma.File%leaners& no threads fl#shing data and/or metada mmchconfig flushed9ata1arget6(/%+ mmchconfig flushedInode1arget6(/%+ mmchconfig ma7<ileCleaners6(/%+ These are cleaning data !#ffers" so sync doesn4t have to fl#sh data !locDs mmchconfig ma78ufferCleaners6(/%+ 0#m!er of GPFS log !#ffers. ?aving lots of these allows the log to a!sor! !#rsts of log appends. For systems with large page pools A* G or moreB" log !#ffers are the si)e of the metadata !locD si)e" and there is a separate set of s#ch !#ffers for each file system. >efa#lt 3. mmchconfig log8ufferCount6%/ GPFS log fl#sh controls. When the log !ecomes logWrapThresholdPct" the log fl#sh code is activated to fl#sh dirty o!Tects so the log records that descri!e their #pdates can !e discarded. This percentage defa#lts to 57C" and altho#gh there is some code to allow changing it" modifying this val#e is not s#pported !y mmchconfig. Log wrap will start logWrapThreads fl#sh threads Adefa#lt @B" which will fl#sh eno#gh dirty o!Tects so the recovery start position can !e moved forward !y logWrap-mo#ntPct percent Adefa#lt *7CB. mmchconfig log4rap"mountPct6% mmchconfig log4rap1hreads6(%' 21 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained 0#m!er of active allocation regions for disD allocation. Larger n#m!ers can improve allocation performance" !#t high n#m!ers sho#ld not !e #sed for large cl#sters. >efa#lt is 9. mmchconfig ma7"llocegionsPerNode6$% Si)e of the pool of threads that completes file deletions in the !acDgro#nd. >efa#lt is 9. mmchconfig ma78ac!ground9eletion1hreads6(& a.im#m n#m!er of threads that prefetch inode toDens of deleted files to speed #p file creates. >efa#lt is @. mmchconfig ma7Inode9eallocPrefetch6(%' a.im#m n#m!er of sim#ltaneo#s local GPFS re$#ests. >efa#lt 9@. mmchconfig wor!er(1hreads6(/%+ ma.FilesTo%ache sho#ld !e set fairly large to assist with local worDload. Jt can !e set very large in small client cl#sters" !#t sho#ld remain small on clients in large cl#sters to avoid e.cessive memory #se on the toDen servers. The stat cache is not effective on Lin#." so it sho#ld always !e small. mmchconfig ma7<iles1oCache6(%'! mmchconfig ma70tatCache6*(% a.im#m n#m!er of threads that prefetch inode toDens of deleted files to speed #p file creates. >efa#lt is @. mmchconfig ma7Inode9eallocPrefetch6(%' Pre'steal some page pool space to red#ce the latency of ac$#iring a free !#ffer.preSteal%o#nt is the option to specify a hard n#m!er vs Pct. the way it worDs is if set to *7777 " 5777 go to 3<D" <577 to *:D" *<57 to @D " .... mmchconfig pre0tealCount6(/// mmchconfig pre0tealPct6( 22 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained syncEacDgro#ndThreads define how many threads in parallel are allowed to r#n to fl#sh data d#ring reg#lar sync intervals. >efa#lt *:. syncWorDerThreads no of threads in parallele to fl#sh data d#ring e.plicit sync Async command" or crsnapshot" or #nmo#nt" ...B mmchconfig sync8ac!ground1hreads6&+ mmchconfig sync4or!er1hreads6%*& These Settings infl#ence the inode Prefetch !ehavio#r for Mls 'lM JnodePrefectFirst>ir!locD set to MyesM to have inode prefetch read the first !locD of each s#!dir as well. >efa#lts to no. JnodePrefetchThreshold defines how many stat4s we wait for !efore start prefetching inodes" defa#lt is 5" maDe it smaller to start inode prefetch sooner. JnodePrefetchWindow define how close together in time the stat4s have to !e to trigger inode prefetch" defa#lt is 7.5 seconds which means the 5 stat4s all have to within half a second of each other" otherwise we4ll ignore them. yo# need to maDe it larger to trigger inode prefetch even if stat4s are coming in more slowly #nits are in milli seconds. e.g." setting it to <577 will maDe the window !e <.5 seconds mmchconfig InodePrefect<irst9ir-loc!6yes mmchconfig InodePrefetch1hreshold6* mmchconfig InodePrefetch4indow6*// General n#m!er of inode prefetch threads to #se. >efa#lt @. mmchconfig wor!er$1hreads6$% pitWorDerThreadsPer0ode specify how m#ch threads do restripe" data movement" etc ... >efa#lt is threadsPer0ode ( J0A*:" An#m!er,f>isDs 5 9B/n#m!er,f0odes U *BB so *:" or less if there are fewer than a!o#t fo#r L;0s mmchconfig pit4or!er1hreadsPerNode6(& 23 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained Prefetch-ggressiveness defines how aggressive to prefetch data 7 means never prefetch * means prefetch on <nd access if se$#ential < means prefetch on *st access at offset 7 or <nd se$#ential access anywhere else 3 means prefetch on *st access anywhere Jn 3.3" the defa#lt was 3 Aprefetch,nFirst-ccessB" which means it wo#ld always prefetch immediately" even if the first access is in the middle of the file. Jn GPFS 3.9" the defa#lt is < Aprefetch0ormalB" which means if yo# start reading at the !eginning of the file" it will start prefetching immediately" !#t if yo# start reading somewhere in the middle of the file" it waits #ntil the second read to confirm that the access is se$#ential !efore it starts prefetching. With the setting of * Aprefetch,nSecond-ccessB" it will wait for a second read" even if the first read was at the !eginning of the file. since 3.5 yo# can specify read and write aggressiveness independent. mmchconfig prefetch"ggressiveness6% mmchconfig prefetch"ggressivenessead6=( mmchconfig prefetch"ggressiveness4rite6=( ignorePrefetchL;0%o#nt tells the 0S> client to not limit the n#m!ers of re$#ests !ased on the n#m!er of visi!le L;04s Aas they can have a large n#m!er of physical disDs !ehind themB and rather limit !y the ma. to n#m!er of !#ffers and prefetch threads.>efa#lts to no mmchconfig ignorePrefetch2UNCount6yes 24 2014 IBM Corporation GPS Parameters e*"lained GPS Parameters e*"lained Communication elated Parameter tscWorDerPool defines no of threads per class of receive worDers mmchconfig tsc4or!erPool6&+ nsdJnlineWritea. defines the ma.im#m allowed single io si)e to #se Jnline writes.>efa#lts to *D mmchconfig nsdInline4rite5a76$%! This needs to !e set larger than the defa#lt for server nodes that may have connections to many clients" since it indirectly controls the n#m!er of T%P connections managed !y each receiver thread. mmchconfig ma7eceiver1hreads6$% 2>- Port config#ration mmchconfig ver-sPorts6>ml7+?/.( ml7+?/.% ml7+?(.( ml7+?(.%> ena!le 2>- in general" if this is set to disa!le all 2>- comm#nication is sh#t off mmchconfig ver-sdma6ena-le defines minim#m si)e of a PacDet to #se 2>- " also see nsdJnlineWritea. mmchconfig ver-sdma5in8ytes6(&! T#rns ver!sSend on" a low level JE inline transfer method mmchconfig ver-sdma0end6yes a. n#m!er of o#tstanding transfers at a time per connection mmchconfig ver-sdmasPerConnection6%*& a. n#m!er of o#tstanding transfers at a time for the entire node mmchconfig ver-sdmasPerNode6(/%+ ?ow m#ch dedicated P-gepool for ver!s comm#nication mmchconfig ver-s0end8uffer5emory586(/%+