Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

EMC World May 2010

Navisphere Analyzer
Purpose of the script: EMC Navisphere Analyzer allows you to view storage system performance statistics in various
types of charts. These charts can help you find and anticipate bottlenecks in the disk storage component of a computer
system. Todays session will take a look at the Navisphere Analyzer User Interface. In particular, it will be covering the
different views available and provide some basic starting point in checking if your existing configuration is being stressed
or working in a well utilized manner. This script is designed for use with Navisphere Manager 6.x software.

Please do not alter any of the workstation or CLARiiON Storage System configuration details unless instructed
to do so by these instructions or by a member of the EMC presentation team.
In this session there are primarily two exercises covering archive retrieval, viewing and on-array real time analysis.
In addition to the instructions for these exercises, you will find more exercises and reference material in this handout.
If you have time to do so, please explore those additional sections during this session.

Before you begin:


Fill in the following information:
a. Assigned Array from the Desktop ICON array.txt on your laptop: Array _________ SP ___
b. Storage System IP address to use ___.___.___.___
c. Proceed to exercise A

Exercise A, NAR file viewing offline


This exercise is to direct you into checking the status of data logging on your array, retrieving an
archive file containing performance data, and then looking at that data using Navisphere off-array
software. This is primarily a walk-through exercise to get familiarity with the steps involved in
performance analysis. The second exercise will cover more details about the metrics you are looking
at.
The NAR file is from a test environment where a total of 12 tests were run. You will clearly be able to
segment the statistics into 12 areas. The odd numbered tests were using a single thread to each LUN
and the even numbered tests were using 4 threads per LUN. The essence of this exercise is to get
familiarity with the interface, as well as identify that increasing load on the array has various effects

What you do

Notes & observations

Start Internet Explorer <Enter the IP


address of your assigned managed
node ( SP IP address ) into the
address window of the browser>

The process of getting started with Navisphere Manager is simple.


You begin by pointing your browser at the storage systems IP
address.

EMC World May 2010


Login <Enter the user as emcw and
password emcw>

You will be presented with the standard Navisphere view of the


Domain you logged into.
<Select Tools >Analyzer -> Data
Logging>
Check the logger is running and that
periodic archiving is as you want it. If
checked it saves the archives every 5
hours with the default 120 second
sampling, or 2.5 hours if using 60
second sampling.

If you do not have Analyzer installed


on the array, you can still invoke
logging but this will be limited to 7
days of periodic archiving and will
create encrypted archives for service
use.

<Select Cancel>
If running Pre-version 24 of array
code, the logging feature operated
differently so you will have to
manually enable statistics logging at
the SP level.

With Rel-24 and above, the logger will automatically enable or


disable statistics logging as required.
Statistics logging is the process whereby the array will collect
statistics for each object within the storage system. The logger is
required to facilitate collection of those statistics, however if the
logger isnt running, you can view a subset of statistics using the SP
Properties view within Navisphere, or collect some raw statistics
using secure CLI commands.

EMC World May 2010


<Select Tools ->Analyzer -> Archive
-> Retrieve>

The dialogue will present the current


repository contents for the selected
SP.
<Scroll down to find the file called
emcw_2010_xxx-xx.nar
Select that file and click on Retrieve>
Note the location where the file will be
stored.
You will see the status of the operation
in the lower pane as the file is
uploaded to your workstation.
<Close this dialogue with Done, then
close the current browser instance>

The newest file listed could be up to 5.5 hours old so you may need to
Create New to force the logger to create a new archive containing
recent statistical data from its buffer.
You have the option of retrieving archives from SP-A or SP-B, and
although they should contain almost identical data, it is worth
retrieving from both SPs in case theres any problem with viewing
one of the files, or one SP may have been rebooted during an archive
and will miss samples during that time.

EMC World May 2010

What you do

Script

We are now going to use off-array


Navisphere to view the archive we
retrieved in the previous operation.

We could use the array to view the archive, however usual practice is
to view archives independent to having an array resource available

Recommended software components


for off-array operations to enable
viewing Analyzer data.
Ensure the Navisphere Management
Server service is running (Start,
Settings, Control Panel,
Administrative Tools, Services).
Service is called NaviGovenor

Note; this is only required for Off Array management and offline
Analyzer Archive file viewing.

<Start Off-Array UI - double click


desktop ICON labeled
OffArrayUI>
If not available, you can explicitly run the off array management UI by
selecting START, RUN and pointing to the following
link;"C:\ProgramFiles\EMC\ManagementUI\6.29.x.x\WebContent
\start.html"
<Enter Management Server IP
address 127.0.0.1 and use default
port of 80/443>

127.0.0.1 is the localhost address.


Alternatively you can use the IP
address assigned by the DHCP
server.

Login <Enter the user as emcw and


password emcw>

You will be presented with the standard Navisphere view of the


Domain you logged into.

EMC World May 2010

What you do

Script

<Select Tools >Analyzer ->


Customize>

Customize is only
required once for the
off-array
environment.

In the General TAB view, check that


the Advanced box is ticked.

Customize is also
available for the
array environment so
when you set an
array option, it will
remain set for
anyone logging into
the array for viewing
real-time data
covered in the next
exercise.

In the Archives TAB view, check the


default path for archives, and the
Performance Survey for the initial
view, and check the Initially Check
All Tree Objects box is ticked.
When you have many objects you may
wish to be more granular on
selections and chose not to initially
check all objects.

In the Survey Charts TAB view,


check the Utilization, Response Time
and Average Queue Length are
selected and the values shown for
each are present.

Normally we might
suggest a threshold
of 10 samples but for
this exercise well
use 4 for the offarray archive file
were looking at.

<Click OK> to use these

settings.

Note: When viewing the analyzer


standard performance detail view,
there are 4 windows in the view.
These are the object (top left), value
(bottom left), plot (top right) and plot
item list (bottom right). To display
the values available for a given
object, you must select that object.
Selecting a plot item will highlight the
plot associated with that selection.
This will be useful when selecting
many items to view.

Well see this view later in the exercise.


5

EMC World May 2010

What you do

Script

<Select Tools >Analyzer -> Archive


-> Open>

You can merge nar files to


cover a longer period of time
however when opening a large
nar file, it can speed up the
interface by focusing on a
shorter time period.
The merge option is referenced
later in this paper.

Open file emcw2010_1_xx-xx.nar .


Use the default time points. Select the
file in location C:\Documents and
Settings\emcperf.
Leave the default start and end time
for this exercise however when doing
this on a NAR file from your own
storage system, you may want to
narrow the time display to make it
easier to view specific activity.
<Click OK>

It is not recommended to
merge SP-A and SP-B nar files
together as they should be very
similar data anyway

You should now have the


Performance Survey View if you setup
the default open view as shown in
prior steps.
<Scroll in this view to see if anything
is highlighted by red boxes, indicating
a threshold set in the configuration
has been exceeded>.
A RED or YELLOW Utilization box
maybe an indication for concern,
especially if the Response Time
and/or Queue Length are also RED.
Make a note of any suspected LUNs in
the space below;

Make some notes on what you see in the Survey Chart.

____________________________

EMC World May 2010

What you do

Script

<With the pointer in the Utilization


display for LUN 50, either double
click, or right click and select
Performance Detail. You will now see
a graph showing LUN 50 Utilization>
From this display, we want to check
some things out;
SP Utilization they should be about
the same if the load is well balanced.

<You need to expand the LUN


using the + to see the SP check
box check this to add SP detail>

Uncheck the LUN to increase the scaling of the SP Utilization if


necessary. Is the load between the two SPs balanced?

<In the Performance Detail View,


right click on the SP and select
Properties. Check that under the
Cache Tab, Read and Write cache are
both enabled, and check the cache
page size>

Check total memory allocated to


read and write cache and that both
are enabled. A reference to
allocation of cache memory can be
found in the CLARiiON Best
Practices Guide, although typically
recommended to reserve up to 20%
of total available cache to read
cache, and the rest for write cache.

Now we want to check the cache dirty


pages these are pages of data held
in memory during writing. If these are
very high, it could be an indication of
a problem we need to work on.

If the Dirty Pages (%) were


consistently high, this maybe
an issue wed need to look at
closer. Maybe the watermark
settings would need
changing? Maybe the cache
allocation would need
changing? You need to
consider the write load on the
array and duration, combined
with distribution at the disk
level on the backend; after all,
the disks govern the speed at
which we can de-stage data
from dirty pages. Adding
more spindles to a particular
application can help de-stage
data quicker for write
intensive applications.

<make sure you have the SP selected


with the pointer i.e. SP A is shown
highlighted. Now scroll in the lower
left window down to the Dirty Pages
(%) property check the box>
Write cache works on a policy of
watermarks. Here we see the dirty pages
around 60% to 80% which indicate the
watermark processing is working well.
You can see how the watermarks are set
by selecting the SP Tab then right
clicking the array and selecting
Performance Overview view.

<Now uncheck both SPs and the


Dirty Pages box>

EMC World May 2010

What you do

Script

Now lets look at the LUN details.


<Click once on LUN 50, then in lower
left window, un-check the utilization
then check both the Read Bandwidth
and Write Bandwidth boxes.>
Simultaneous reading and writing on
a single LUN can be challenging.
Lets look at more detail.
<Now select the property Forced
Flushes/s>
If none seen, good reason to re-check
that cache is on, although none is also
indicating write cache not being
worked too hard.

Although we dont have any forced flushes here, write throughput is


the reported write cache hits combined with any forced flushes i.e. a
write causing page(s) of write cache to be flushed to make room for
the write do not count in the write cache hit total (unless the write size
satisfies the write-aside and bypasses cache more performance
architecture understanding required if that wasnt understood).

Lets just check the LUN properties.


<Right click on LUN 50 and select
Properties>
You want to know RAID type, number
of disks and user capacity.
Also check the stripe Element Size is
as expected 128 is normal for
striped raid as per the CLARiiON
Best Practices Guide.

Under the Cache TAB you may see the read and write cache enabled
boxes empty this can indicate cache wasnt enabled for the LUN but
in this instance it was. Always check the current code release notes for
known issues with the interface.

You should also check what LUNs


share the same disks to see if multiple
hot LUNs are due to disk contention
on the backend.
<In the Performance Detail view,
click on the Storage Pool TAB at the
top then you can expand each RAID
Group to see what LUNs share the
same disks>

As you will see, some of the suspected RED LUNs shown in the
Performance Summary view are sharing disks in the second half of the
test

EMC World May 2010

What you do

Script

<Select the LUN TAB at the top of


the window>
<Now deselect Forced Flushes/s,
Read Bandwidth and Write
Bandwidth>
<Select Read Throughput I/O/Sec
and Read Cache Hits/s>
None would indicate either random
access or reads too big for pre-fetch.
Check the IO size to the LUN.
<Right click on LUN 50 and select IO
Size Distribution Summary>
<In the view, you can quickly select
all values by right clicking in the left
pane and click on Select All
Values>
You can see here all IOs are small (4KB), both read and write this
indicates a totally random profile as no read cache hits were seen.
<Close the IO Distribution Summary
window> and <de-select Read
Throughput I/O/Sec and Read Cache
Hits/s>

As we saw from the


IO Distribution
Summary the writes
were small and no Full
Stripe Writes taking
place; this suggests the
writes are also random
as no full stripe
coalescing in cache is
taking place.

<With LUN 50 selected, select the


Full Stripe Writes/s check box>
<None of the LUNs are doing FSWs
in this archive>
Another useful view would be the IO
Size distribution detail.

We could get some


write IO coalescing
taking place resulting
in larger than 4KB
writes at the disk layer
we would need to
check write IO size at
the disk to validate
that

<Right click LUN 50 and select IO


Size Distribution Detail>
<Select Read and Write IO size of
4KB only as we confirmed that as the
only IO size used by checking the IO
Size Distribution Summary view
previously>

This view is useful to see the read/write ratio over time.

EMC World May 2010

What you do

Script

<Close the IO Distribution Detail


window. Also, with LUN 50 selected,
uncheck the Full Stripe Writes/s box.
Also, uncheck the LUN 50 box as
well>

High disk utilization means


we are working the disks
well. Low utilization would
indicate additional load
could be placed on the
drives with consistent
service and response times

Now, expand LUN 50, we can see 5


disk drives.
<Select the last disk check box, then
in the parameter window, select
Utilization and Average Seek
Distance>

What do we see here; disk seeks are a few GB indicating a moderate


level of randomness. Uncheck Utilization to get a better view of the
seek distance (or zoom in).
<Uncheck Utilization and Average
Seek Distance>

Some 32KB writes


maybe seen that
could be
protection bits
being set rather
than coalesced
user data

<Select LUN 50 and then check


Write Size>
As you can see here, over time the disk
IO size tracks the LUN IO size again,
indicating random workload with little
or no coalescing of data. If disk IOs
were bigger, a good indication of
coalescing taking place always check
you have write cache enabled.

Dont forget to check you have write cache enabled (LUN and Array).

<Uncheck Write Size and also


uncheck LUN 50>
<Check both of the first 2 disks in
LUN 50 and then check the Total
Throughput for these disks>

The disks at varying times are working very hard. We can see them
reaching over 350 IOPs per disk. Now, for small random IO we have a
rule of thumb (ROT) stating a 15K rpm disk can be used for 180 IOPs
for mixed random load with good response time. When running them
at higher loads we can expect an impact in the response time observed.

10

EMC World May 2010

What you do

Script

<Uncheck Total Throughput and also


uncheck the second disk drive>
<With the highlighted first disk dive,
check the Queue, Average Busy
Queue Length and Service Time
boxes>
Average Busy Queue compared to the
regular Queue can give an indication
of burstyness however at the disk
level, activity includes de-staging
writes from cache that can arrive in
bursts.

Always check the release notes for


known issues relating to accuracy of
statistics.

The disk service time is how long the disk is taking to service each
request. If IO gets queued at the disk, the disk response time is then a
factor of the service time multiplied by the queue depth. Therefore, the
higher the queue, the longer the response time.
A point to remember though is that writes will typically be serviced by
cache and have a very fast response time. Reads will have a more
directly impacted response time as disk queues increase. Of course,
this all depends on IO size and also how writes are being de-staged
from cache and how efficient cache is working to optimize that
process.

<Uncheck Service Time>


<Uncheck Queue>
<Check Response Time>

You can see the Response Time follows the average busy queue depth
as the service time was pretty stable.
<Uncheck all objects>
Now select the SP TAB, click on SP-A
and select Total Throughput.
You can see that as the load increases,
so does the overall capability of the
array i.e. more threads per test, higher
IOPs. More LUNs tested, more IOPs.
The second set of tests where you see
the IOPs starting around 270
increasing steadily to 1286 are where
both SPs are being used, so the
aggregate IOPs would be higher still.

Those peaks seen at the start of tests are normally writes being
absorbed into the protected write cache until watermark processing
starts to write data to disks on the back end.

11

EMC World May 2010

What you do

Script

<Uncheck SP A and Total


throughput>
Select the LUN Tab.

Now check LUNs 60 and 61. These


are both using the same disks and you
can see the impact when both LUNs
are under load as the LUN response
time increases.
Select the disks for LUN 60.
This results in fewer aggregate IOs
across both SPs due to an increased
queue as well as small increase in
seek distance at the drive level.

The other small peaks seen here are associated with disk
statistics and the SNiiFER process where there is no host load
accessing the disks. This is a process that is validating data
availability in background (performing 512KB read operations
at 1 IO/s. Take a look at disk read size and you will see.

12

EMC World May 2010

What you do

Script

Summary of Exercise A
We got you to look at LUN 50 as it
was used in all 12 tests. The first test
area you see we get moderate
throughput at the disks but when we
increase the threads accessing the
same disks, we get much better work
from them. The detrimental effect of
driving a higher load is an increase in
response time to the application due
to the increased queuing at the disks
(go back and take a look if you have
time).
Now, as we add more load to more
disks, the same effect can be seen
between the single thread tests and
the 4 thread tests i.e. per disk IOPs is
higher if we have more processes
accessing the LUNs. In all tests where
we have a single thread per LUN the
per disk IOPs is low compared to the
4 thread tests. This highlights some
key performance notes;
Concurrency, when using small IO
sizes, is essential for good
performance i.e. multiple
threads/processes.
Also, as we observed in the SP
statistics, overall array performance
scales with how many LUNs are being
accessed concurrently, so it was clear
to see as more LUNs were busy, the
overall throughput increased.

Here we see the distinct 6 areas of the first 6 tests performed on one
Storage Processor. Each lower level is showing the single thread
performance for 1, 2 & 3 LUNs. The higher peaks are representing
throughput when 4 concurrent threads per LUN are generating IO.

Do not expect to get maximum


performance from an array unless you
have the necessary disk count to
service the load. Please reference the
CLARiiON Performance and
Availability: Applied Best Practices
on scaling capability guidance for
each array type.
When finished Exercise-A, close down
the off-array browser and proceed to
Exercise-B

13

EMC World May 2010

What you do
Script
Exercise B Analyzer Statistics Viewing Real-Time
This exercise is to direct you around some of the views in Analyzer while the array is under a
simulated load from a Windows server. Youll be directed to look at some of the key statistics
that indicate if a system is functioning within acceptable parameters. This exercise is to extend
your experience and expand upon some descriptions of those statistics you are looking at select
and deselect components and statistics to overlay graphs but consider the scale of selections
such that high IO/s on the same graph as disk queue length will not be easy to distinguish queue
variation. If you have time youll be directed to look at some specific statistics in order to
determine where there is a problem with the current load on the array.

What you do
Start Internet Explorer <Enter the IP
address of your assigned managed
node ( SP IP address ) into the
address window of the browser

Script
Repeat as in the first steps in exercise-A

Login <Enter the user as emcw and


password emcw>

You will be presented with the standard Navisphere view of the


Domain you logged into.

Also refer back to Exercise-A for


customize options required to be set
on the array

You have already looked at the logging mechanism so now we want to


start viewing real-time statistics.

If only the Local Domain is shown,


expand the view by clicking on the +
by the Domain icon.

Performance statistics can be viewed for individual components or you


can select the storage system and then view a selection of components.
To get this window, you can select the array, SP, raid group, Thin
Pool, storage group, LUN, Thin LUN or disk to choose which analyzer
view to look at. Here well select the array to present all objects
available.

<Right Click on your array and move


the pointer over the Analyzer
selection to see the expanded list of
options>
<Select Performance Survey>

The Performance survey view will start to plot current statistics based
on a 60 second sample period please wait until you have at least two
plots to continue i.e. wait at least 2 minutes for the plots to show.

14

EMC World May 2010

What you do

Script

The objective here is to watch the real


time view develop and start to look at
some of the performance statistics
that are being logged.

If you setup the survey


chart thresholds as
instructed earlier, you will
start to see green, yellow or

You have to wait for 2 samples to get


data plotted. Each sample is an
average of statistics between each
sample except for write cache dirty
pages that are an absolute value at
the sample point.
Well have a look at how to view some
of the key statistics used in analyzing
an array performance.

red boxes appear. These give


you an indication of where
to start looking for possible
performance issues.

Exercise-A gave you familiarity with the interface and through this
exercise well expand upon what some statistics mean.

Utilization LUNs, SPs and Disks


Expand the LUN
component to
reveal the disks
and the storage
processor that
currently owns
this LUN

In the Performance Survey view you


can double click on a graph to open
the Performance Detail view for that
statistic. Then you can select more
components to view, that will be
placed on the same graph.
Try it pick one of the utilization
graphs in the Performance Survey
View and double click on it.
SP Properties

SP properties view is limited


within a nar file compared to
the same view when
connected to an array in realtime as displayed here.

Right click on the SP and select


properties. Ensure Cache is allocated
and enabled.
Total Size indicates possible
maximum look at the Read Cache
Size and Write Cache Size for
allocated cache.

15

EMC World May 2010

Dirty Pages
Dirty pages are protected write cache
data that hasnt been committed to
disk yet.
To see appropriate value selections
available in the lower left part of the
detail view, you must select a
component item in the upper left part
of the view. Dirty Pages will only be
an available option when you have
clicked on a Storage Processor (SP).
Dirty pages that peak at 99% indicate
cache saturation resulting in force
flushing that can hurt performance.
Well look at LUN force flushes later
in this exercise.

To help when
viewing a graph
plot, you can click
on the legend item
in the lower right
window pane and it
will highlight that
statistic in the
graph. Also, you
can customize the
graph views by
right-clicking on
the graph and
selecting Chart
Configuration

LUN Bandwidth
Remember to
uncheck
previous
viewed
selections to
change the
graph scale,
unless you
need to see
how one
statistic plots
against
another one.

Selecting both read and write


bandwidth tells us about the load on
the LUN however you will need to
check IO sizes and data locality to
determine if the values seen are
expected based on the load.
We can check locality by looking at
seek distances at the drive level later
in this exercise.

LUN Forced Flushes


Its very
important to
see if any
forced flushes
taking place.

Forced flushes are an indication of


write cache saturation if you have
many forced flushes taking place, this
will impact the system as seen by
increased SP utilization as well as
increased response times.
Although dirty pages may not have
shown being at 100% that statistic is
an absolute value at the sample time.
If you are seeing forced flushes taking
place then that indicates the cache
pages were 100% dirty at some point.

16

EMC World May 2010


LUN Read I/O/sec
Read hits are when a
host read comes in
and the data is already
in cache.
Remember also that
pre-fetch activity may
span sample periods
i.e. pre-fetched data in
one period may not be
read until the next
period.

Looking at the IO/sec and Read


Cache Hits/Misses, you can tell if the
read pre-fetching is working.
A high ratio of read cache hits per
second to LUN Read IO/sec is a good
indicator of pre-fetching working. You
can directly see this ratio by looking
at the Used Prefetches %.

LUN IO Size distribution summary


This will enable you to determine
where your host IO sizes fit.

In the lower left


pane, you can
choose to show the
I/O rate at each
size. Default is I/O
count that means
the total I/Os for
this sample period.

Right click LUN-2 in the Detail View,


select IO Size Distribution Summary,
then in that view, you can select all
values by right clicking in the value
pane of the window on the left.
Right-click / Select All / Values

This is a histogram where each column represents IO in the range


from that size to the next size -1 block e.g. in the view here, we see a
value for reads that are 8KB and above, but lower than 16KB in size.
LUN Write Size and Full Stripe Writes
Another method to
detect sequential
write access is
comparing disk
write IO size with
LUN write IO size
i.e. cache
coalesces smaller
IO into fewer
larger IO when destaging data

You can view these back in the


Performance Detail view to see over
time, if coalesced writes are resulting
in full stripe writes to the LUN. This
indicates that write cache is working
well and some writes are sequential.
If no Full Stripe Writes are seen,
writes to this LUN are more random
and small, with little or no locality.

Looking at the average LUN IO size for read or write in the detail
view can be misleading as it will be an average and a low write IO rate
will not be accurately shown. You really need to use the IO
Distribution Summary for the LUN to see the IO distribution.

17

EMC World May 2010

Disks Average Seek Distance &


Utilization
This will give an indication of data
locality and if the disk is working
hard. Be aware that a disk that shows
100% utilized isnt necessarily bad as
the sample rate just indicates the disk
was never idle, and is reported from
the highest SP (the other SP may have
some more usage, up to 100% also).
You could look at the disk Average
Busy Queue Length and compare
with the Queue Length. If the
Average is bigger than the reported
Queue Length, this maybe an
indication of bursty activity.
LUN & Disks write size

Cached writes may also result in bursty activity at the disk


level as write cache flushes data. The trick is to not let that
activity lead you to think your host activity is bursty when it
isnt.

This can give an indication that


coalescing is taking place in write
cache such that disk writes are bigger
than the LUN writes.
If we see the LUN and disk write size
is the same, typically this implies the
writes are very random and not
coalescing in cache to become larger
IOs or write cache is not being
used.
LUN 50 will show this but LUN 2
doesnt can you explain why?
Tip: check the LUN 2 IO Distribution
and write IO rate

CLARiiON cache is great at optimizing back-end disk access,


particularly of benefit to Raid 5 and Raid 6 options that have write
penalties associated with small block random write activity.

18

EMC World May 2010

Performance Overview View


Select the SP TAB in the performance
detail view then right click on the
array;
<Select Performance Overview>

The best overview of the


cache configuration and only
place you can see the
watermark settings is in the
Overview screen

In the overview view you can see more


detailed properties of cache together
with 3 key statistics for the overall
array- Throughput, Bandwidth and
Dirty Pages. One particularly useful
detail is the watermark settings as
they are not visible anywhere else
when looking at an Analyzer NAR file
offline to the array it came from.

Dont be fooled by settings


that may have been changed
during the logging period
though i.e. it may show
cache as enabled or disabled
here, but you should verify
that with read cache statistics
and dirty pages in the other
views

The cache states shown here will be


set when the logger started the
current nar/naz file logging so be sure
to determine actual settings from
measured metrics.
Dirty pages on each SP indicates
write cache is enabled at the array
level.
Read cache hits, pre-fetch bandwidth
are some indicators that read cache is
enabled.

Watermark settings are used to intelligently flush write cache pages


out to disks on the backend and keep a level of write cache available
for bursts of activity.

Raid Group / Thin Pool


Select the Storage Pool TAB in the
Performance Detail view. Careful not
to get confused here as the Raid
Group and Thin Pool statistics are
derived from disk statistics, not LUN
statistics. Thus, the values will depend
on all activity for all LUNs within that
raid group or all Thin LUNs in a Thin
Pool.
Thin Pools arent covered in any
specific detail here although disk
statistics are logged and can be
analyzed in the same way as a regular
Raid Group.

This reference was more for information as were not going to be


looking at raid group statistics specifically for the exercise. The
Storage Pool TAB is the only method to analyze disk activity within a
Thin Pool. Thin Pools do have regular LUNs that are considered
private and hidden from view, including Analyzer.

19

EMC World May 2010

The 2 exercises have explored the options and views available to you, with some explanation and
guidance on what the statistics mean and how they help in characterizing your IO.
This following section, should you have time to look at it, will guide you to look at the loads on a
specific set of LUNs sharing the same set of disks.

You have explored the views, now the task is to analyze a specific area where we have an issue.
Now, please explore the interface and
look at the following attributes for this
load on the array with a focus on Raid
Group 0, LUNs 50 & 51. Look at the
following for each of these LUNs and
see if you can draw any conclusions
(make notes on the worksheet table at the back of
this handout);

LUN read and write throughput


LUN read and write IO size
LUN Response Time
LUN Queue Length
LUN IO sizes
Disks read and write throughput
Disk seek distance
Disk IO sizes
Disk Queue Length
Disk response time
Define the IO profile associated with
each LUN and think about what they
can be. There is an area where we do
have an issue that we want to fix.

Look at the profiles for these two LUNs. How are they different and
what would be a suggestion on improving performance?

Using this hint, think about what helps sequential operations


and what could also hurt it.

If it is sequential are we seeing pre-fetching and a high prefetch used rate? We do need to know what the application is
Hint; one of them is doing large
sequential reads could be a video trying to do, then correlate that with characterized IO on the
storage system to see if it is doing what we think it should be
or data warehousing application.
doing.
So, what about the other LUNs that
showed up red and/or yellow boxes?
As explained during the overview, the
red/yellow boxes give an indication of
where to investigate and not
indicative of an absolute problem.

Raid Group 4 has multiple LUNs with different IO characteristics.


Check some of the metrics and see if you can conclude anything about
this raid group. Dont spend too much time on this task as the prior
elements focusing on LUNs 50 and 51 are the main points of this
exercise. If you have time, you might want to take some notes on
LUNs 2 and 3 statistics for reference (are they busy? Is that bad?)

20

EMC World May 2010

Additional reference notes


Although these exercises cover the
performance statistics from host
accessing LUNs in the array, there
maybe additional load generated
internal to the array. This load could
include those shown opposite;

SnapView Snapshot sessions


SnapView clones
MirrorView/S activity
MirrorView/A activity
SAN Copy activity
Raid Group rebuild activity
Hot spare equalize activity
LUN migration operations

Typically, layered application IO will


not be logged at the LUN level but
will be visible at the disk level. If you
understand what is taking place, once
accustomed to the user interface and
operational characteristics of layered
applications, you can determine what
disk activity relates to host access or
internally generated IO.

Background zeroing for bind operations


Background verifying

In the metric selection


window, since Release-26 of
code, you will see options for
Optimal and Nonoptimal
metrics for LUNs. These are
used when you have LUNs
using the ALUA failover mode
(mode=4) of operation.

Sometimes, you will observe a blip in


the statistics i.e. a value for a statistic
outside normal range. To overcome
this being a nuisance, you can either
restart the plotting or adjust the
scaling of the graph plots by zooming
in or setting the Chart Configuration
Axes options in the graph view.

Typically, you would see


Optimal values when accessed
from the current owning SP,
and Nonoptimal values when
accessed from the non-owning
SP so a slightly longer path.
When not running in ALUA
mode, selecting either the
regular metric or the metricOptimal will display the same
values.

Real-time viewing of Analyzer


statistics isnt the preferred method
due to the requirement to be there at
the time, as well as the additional
impact to the array in presenting the
information. Typically you would
look at a captured Analyzer NAR file
as covered in Exercise-A.

21

EMC World May 2010

Supplemental, Command Line NAR file retrieval and export capabilities


This is to direct you into the capabilities to script NAR file retrieval for lights out performance
statistics gathering. As the Navisphere archive file collects data covering the previous 5 hours of
statistics, the capability to script retrieval of the NAR file is useful when you want statistics for a
period of activity and youre unable to retrieve the file in the normal way using the Navisphere
GUI interface e.g. statistics logged on Saturday would need to be retrieved sometime Sunday or
they would be overwritten by Monday. With the release of revision 24 and later revisions of
code, the Analyzer Archive facility allows the automatic archiving of Analyzer files on the array
itself for later retrieval via the GUI or CLI process, and retained for a much longer period than
the previous 5 hours (or 25 hours for older code archives). Remember though that if Periodic
Archiving is not enabled, you will only grab the prior 5 hours of data by default when you
retrieve the archive.
<Ensure NaviCLI utility is installed.
This is easily done if the Navisphere
CLI directory is present. Here you can
double click the shortcut on the
desktop called NaviCLI>
This will start a command window
that will go to the default installation
directory c:\Program
Files\EMC\Navisphere CLI
(Username and password will be
emcw for the following commands)
<Retrieve the Navisphere archive files
using the following command;
naviseccli user <username> password <password> -scope 0
address <SP IP> analyzer archive
-all
Be careful with this command as you
may have many archives to download
when selecting all and it could take a
long time to complete. By omitting
the all you are presented with a
selection list where you can select
one or more archives to retrieve.

Scope will be 1 if the account details used are local and not global.
Do not do this here but you can reset the statistical data by using the
following command if you are looking to collect data for a specific
test period only and you are not interested in previous collected data;
naviseccli user <username> -password <pwd> -scope <01>
-address <SP IP> analyzer logging -reset
The username and password can be omitted if you have setup the
security file for NaviSeccli.
The username used here does not have privileges to reset data logging
on the arrays being used.
Note; the desktop shortcut used here is not created for you during
installation. You have to do that yourself if you want that shortcut
available on your own systems.
Prior to release 24 you would need to use the java archiveretrieve
command to get the archive from the array;
java jar archiveretrieve.jar User <username> -Password
<password> -Scope 0 Address <array IP> File archive_emc.nar
Location C:\program files\emc\Navisphere cli Overwrite 1 Retry 2
v>

22

EMC World May 2010

Now you can follow these steps and


open the retrieved NAR file using the
on-array or off-array capability, or
you can convert the NAR file data to
CSV format for import into Excel.
You can use the following command
to do this;
< naviseccli user <username> password <password> -scope 0
address <SP IP> analyzer archivedump -data test.nar -out
test.csv -object s,l,d >
You can also filter the output to only
get specific statistics like read
throughput adding the qualifier
format rio
This example outputs stats for SPs,
LUNs and disks (-object s,l,d). To get
stats for metaluns, etc, please consult
the document; Navisphere Analyzer
Administrator's Guide.pdf

If Excel format is required you can use the archivedump command to


covert the NAR file data to a format readable by Excel, typically CSV.

Some more qualifiers for the format command are as follows


(separate with a comma if used);
Utilization (%)
Response Time (ms)
Dirty Pages (%)

u
rt
dp

For other qualifiers please consult the Admin Guide.


If you leave off the qualifier object, it will output all statistics for all
objects.
The Navisphere UI has an Analyzer dump wizard that guides you
through device and attributes selection prior to dumping to a CSV file.

Start Excel and select open file and


browse to the c:\Program
Files\EMC\Navisphere CLI directory
and select file type as CSV, then open
the sample.csv file you created in the
last step to view the statistics as
presented in Excel.
If Excel 2007, use the INSERT TAB to
display graphing options.
If youre not too familiar with Excel
but would like to plot a graph showing
one of the statistics over time, you can
easily do this by selecting a column by
clicking on the header letter, then
once the column is highlighted, click
on the chart wizard icon in the tool
bar, select line as chart type, then
click next to see what the chart would
look like. You can then customize it as
required.

Please note that each device selected in the dump command, like
SP and LUNs will be listed down the left column, so selecting
an entire column to plot would actually plot all SP stats followed
by all LUN stats, and so on. You would need to be more
selective and manipulate the data when plotting graphs in a
logical manner.

23

EMC World May 2010

You can try the archivedump


command and be more specific on
some qualifiers shown previously.

naviseccli user <username> -password <password> -scope 0


address <SP IP> analyzer -archivedump -data test1.nar -out
test1.csv -object s format u,dp

You could also have a go at the dump


wizard from the Analyzer drop down
in the Navisphere Manager GUI
using off-array Navisphere.

This will output test1.csv containing SP statistics of utilization and


write cache dirty pages.
Another option is archivemerge; used to merge multiple NAR files
together. We dont use that in this session but remember that this is
useful if you want to view data access trends that span more than the
typical NAR file size of 5 hours.
It is not necessary to merge nar files from both SPs as each SP has the
same data.

The array based archivedump wizard


provides an easy way to dump specific
statistics associated with individual
devices rather than using the CLI
method shown above.
With either on-array or off-array,
select Tools, Analyzer, Archive, Dump
Then select where the source file is
located and follow the wizard to select
objects to dump and what statistics
you require.

24

EMC World May 2010

Supplemental, Thin LUN Analysis


This is to highlight some differences in metrics available for Thin LUNs in a CLARiiON
environment and the way in which we view them.
There is a read/write load running to LUN201 on the array. This is a Thin LUN provisioned
from a pool of 3 disks.
Thin Pools in a CLARiiON have a private structure that isnt visible in the Navisphere
interface. This structure has private LUNs that Thin LUNs utilize in 1GB increments. With the
experience gained from the primary exercises you can take a look at the active Thin LUN and
how to observe IO to both it and the Thin Pool disks.
Check Thin LUN properties.
<Right click on LUN 201 and select
Properties>
<Here you can see the Pool
properties that this LUN is serviced
from and the Thin LUN virtual size
and actual consumed capacity from
the pool>

<When selecting the Thin LUN, you


do not see cache operations
associated with that LUN as these are
associated with the private LUNs
servicing the IO to the Pool and those
are hidden from view >

These are metrics you will not see when


selecting Thin LUNs to analyze. This
may change in a future release, but for
now; you have to look at the Thin Pool
disk characteristics to determine whats
happening in the Pool as a whole.
Regular IO metrics like throughput,
bandwidth, and response time are
available for each Thin LUN.

<Unlike regular LUNs in the LUN


TAB view, you only see the SP a Thin
LUN is assigned to. To see the disks
servicing the Thin LUN and its Pool,
you have to select the Storage Pool
TAB >

There are no specific


instructions on what to
investigate here although if
time, compare the disks
within the Pool and how
those align with the Thin
LUN characteristics.

<Select the Storage Pool TAB>


<In the view, you can expand the Pool
to see the disks servicing the total
Pool load. You cannot see the private
LUNs that are hidden in the Pool>
End of exercises

25

EMC World May 2010

Supplemental notes
The following operations are executed at the disk level to provide data integrity features associated with
redundant RAID types as well as consistency of data stripes that could be at risk due to media issues.
Background zero; Before user data can be written to the physical disks within a LUN, the area has to
undergo a zero operation. New disks are initially supplied in a zero state where data can be written to the
disks immediately after binding LUNs, however if the disks have been used before i.e. bound and
unbound, they have to be re-zeroed.
You can zero the disks using a naviseccli command in readiness for grouping and binding LUNs later on
or the array will zero the disks when you create new LUNs on them. This zero operation results in
512KB SCSI write-same commands to the disks in a sequential manner, unless the array has to zero-ondemand an area the user is writing too that is in the queue but hasnt been zeroed yet. There is some
other small activity on the disks during zeroing as checkpoint operations keep track of progress.
Typically with no access to the LUNs any zeroing will complete in a matter of a few hours although a
busy array and activity to the disks being zeroed will delay completion. Also, the 512KB write-same
command will not consume back end bandwidth but will affect disk load and utilization.
Background verify; This operation is validating consistency of data protection at the disk level and is
automatically performed on newly created LUNs. The IO profile at the disk level is 64KB reads and like
zeroing, can take hours to complete and is also governed by array and disk activity.
Background zero, zero-on-demand, and background verify operations exhibit relatively large IO sizes
that can affect ones analysis of the array. Also, if considering user testing its worth noting these
operations may affect the performance the array can present due to the parallel action of user data
access and these preliminary operations.
Also be aware these operations run in a sequential manner for any given raid group(RG) e.g. if you bind
5 LUNs on a RG 0 through 4, LUN 0 will start to zero and when complete will perform a background
verify. This is followed by the second LUN in that RG. Each LUN will zero then verify until all newly
created LUNs complete that process. Thereafter the only regular IO you will see at the disk level due to
internal operations will be SNiiFFER where you will see approximately 1 IO per second at 512KB in size
to each disk in a RG. SNiiFFER is a data checking operation that cycles through every block in every
LUN in the array to ensure data availability, even for data you might not have touched for months/years.
Any data inconsistency detected through SNiiFFER will automatically invoke recovery and remap of
affected blocks. RGs will run through zero, verify and SNiiFFER operations independent to each-other.
Zeroing will have the most effect on performance so consider this when testing. Verify may have a small
effect and SNiiFFER will have a negligible effect on performance.
Always check disk stats to see what IO sizes are taking place at that level. With a RG idle, disk activity
showing 512KB writes indicate zeroing, 64KB reads indicate verifying and 512KB reads indicate sniffing.

{end}

26

EMC World May 2010


Worksheet use as needed during exercises.
LUN ID
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time

50

51

LUN ID
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time

27

You might also like