Professional Documents
Culture Documents
Navisphere Analyzer Hands On Script EMC World 2010
Navisphere Analyzer Hands On Script EMC World 2010
Navisphere Analyzer
Purpose of the script: EMC Navisphere Analyzer allows you to view storage system performance statistics in various
types of charts. These charts can help you find and anticipate bottlenecks in the disk storage component of a computer
system. Todays session will take a look at the Navisphere Analyzer User Interface. In particular, it will be covering the
different views available and provide some basic starting point in checking if your existing configuration is being stressed
or working in a well utilized manner. This script is designed for use with Navisphere Manager 6.x software.
Please do not alter any of the workstation or CLARiiON Storage System configuration details unless instructed
to do so by these instructions or by a member of the EMC presentation team.
In this session there are primarily two exercises covering archive retrieval, viewing and on-array real time analysis.
In addition to the instructions for these exercises, you will find more exercises and reference material in this handout.
If you have time to do so, please explore those additional sections during this session.
What you do
<Select Cancel>
If running Pre-version 24 of array
code, the logging feature operated
differently so you will have to
manually enable statistics logging at
the SP level.
The newest file listed could be up to 5.5 hours old so you may need to
Create New to force the logger to create a new archive containing
recent statistical data from its buffer.
You have the option of retrieving archives from SP-A or SP-B, and
although they should contain almost identical data, it is worth
retrieving from both SPs in case theres any problem with viewing
one of the files, or one SP may have been rebooted during an archive
and will miss samples during that time.
What you do
Script
We could use the array to view the archive, however usual practice is
to view archives independent to having an array resource available
Note; this is only required for Off Array management and offline
Analyzer Archive file viewing.
What you do
Script
Customize is only
required once for the
off-array
environment.
Customize is also
available for the
array environment so
when you set an
array option, it will
remain set for
anyone logging into
the array for viewing
real-time data
covered in the next
exercise.
Normally we might
suggest a threshold
of 10 samples but for
this exercise well
use 4 for the offarray archive file
were looking at.
settings.
What you do
Script
It is not recommended to
merge SP-A and SP-B nar files
together as they should be very
similar data anyway
____________________________
What you do
Script
What you do
Script
Under the Cache TAB you may see the read and write cache enabled
boxes empty this can indicate cache wasnt enabled for the LUN but
in this instance it was. Always check the current code release notes for
known issues with the interface.
As you will see, some of the suspected RED LUNs shown in the
Performance Summary view are sharing disks in the second half of the
test
What you do
Script
What you do
Script
Dont forget to check you have write cache enabled (LUN and Array).
The disks at varying times are working very hard. We can see them
reaching over 350 IOPs per disk. Now, for small random IO we have a
rule of thumb (ROT) stating a 15K rpm disk can be used for 180 IOPs
for mixed random load with good response time. When running them
at higher loads we can expect an impact in the response time observed.
10
What you do
Script
The disk service time is how long the disk is taking to service each
request. If IO gets queued at the disk, the disk response time is then a
factor of the service time multiplied by the queue depth. Therefore, the
higher the queue, the longer the response time.
A point to remember though is that writes will typically be serviced by
cache and have a very fast response time. Reads will have a more
directly impacted response time as disk queues increase. Of course,
this all depends on IO size and also how writes are being de-staged
from cache and how efficient cache is working to optimize that
process.
You can see the Response Time follows the average busy queue depth
as the service time was pretty stable.
<Uncheck all objects>
Now select the SP TAB, click on SP-A
and select Total Throughput.
You can see that as the load increases,
so does the overall capability of the
array i.e. more threads per test, higher
IOPs. More LUNs tested, more IOPs.
The second set of tests where you see
the IOPs starting around 270
increasing steadily to 1286 are where
both SPs are being used, so the
aggregate IOPs would be higher still.
Those peaks seen at the start of tests are normally writes being
absorbed into the protected write cache until watermark processing
starts to write data to disks on the back end.
11
What you do
Script
The other small peaks seen here are associated with disk
statistics and the SNiiFER process where there is no host load
accessing the disks. This is a process that is validating data
availability in background (performing 512KB read operations
at 1 IO/s. Take a look at disk read size and you will see.
12
What you do
Script
Summary of Exercise A
We got you to look at LUN 50 as it
was used in all 12 tests. The first test
area you see we get moderate
throughput at the disks but when we
increase the threads accessing the
same disks, we get much better work
from them. The detrimental effect of
driving a higher load is an increase in
response time to the application due
to the increased queuing at the disks
(go back and take a look if you have
time).
Now, as we add more load to more
disks, the same effect can be seen
between the single thread tests and
the 4 thread tests i.e. per disk IOPs is
higher if we have more processes
accessing the LUNs. In all tests where
we have a single thread per LUN the
per disk IOPs is low compared to the
4 thread tests. This highlights some
key performance notes;
Concurrency, when using small IO
sizes, is essential for good
performance i.e. multiple
threads/processes.
Also, as we observed in the SP
statistics, overall array performance
scales with how many LUNs are being
accessed concurrently, so it was clear
to see as more LUNs were busy, the
overall throughput increased.
Here we see the distinct 6 areas of the first 6 tests performed on one
Storage Processor. Each lower level is showing the single thread
performance for 1, 2 & 3 LUNs. The higher peaks are representing
throughput when 4 concurrent threads per LUN are generating IO.
13
What you do
Script
Exercise B Analyzer Statistics Viewing Real-Time
This exercise is to direct you around some of the views in Analyzer while the array is under a
simulated load from a Windows server. Youll be directed to look at some of the key statistics
that indicate if a system is functioning within acceptable parameters. This exercise is to extend
your experience and expand upon some descriptions of those statistics you are looking at select
and deselect components and statistics to overlay graphs but consider the scale of selections
such that high IO/s on the same graph as disk queue length will not be easy to distinguish queue
variation. If you have time youll be directed to look at some specific statistics in order to
determine where there is a problem with the current load on the array.
What you do
Start Internet Explorer <Enter the IP
address of your assigned managed
node ( SP IP address ) into the
address window of the browser
Script
Repeat as in the first steps in exercise-A
The Performance survey view will start to plot current statistics based
on a 60 second sample period please wait until you have at least two
plots to continue i.e. wait at least 2 minutes for the plots to show.
14
What you do
Script
Exercise-A gave you familiarity with the interface and through this
exercise well expand upon what some statistics mean.
15
Dirty Pages
Dirty pages are protected write cache
data that hasnt been committed to
disk yet.
To see appropriate value selections
available in the lower left part of the
detail view, you must select a
component item in the upper left part
of the view. Dirty Pages will only be
an available option when you have
clicked on a Storage Processor (SP).
Dirty pages that peak at 99% indicate
cache saturation resulting in force
flushing that can hurt performance.
Well look at LUN force flushes later
in this exercise.
To help when
viewing a graph
plot, you can click
on the legend item
in the lower right
window pane and it
will highlight that
statistic in the
graph. Also, you
can customize the
graph views by
right-clicking on
the graph and
selecting Chart
Configuration
LUN Bandwidth
Remember to
uncheck
previous
viewed
selections to
change the
graph scale,
unless you
need to see
how one
statistic plots
against
another one.
16
Looking at the average LUN IO size for read or write in the detail
view can be misleading as it will be an average and a low write IO rate
will not be accurately shown. You really need to use the IO
Distribution Summary for the LUN to see the IO distribution.
17
18
19
The 2 exercises have explored the options and views available to you, with some explanation and
guidance on what the statistics mean and how they help in characterizing your IO.
This following section, should you have time to look at it, will guide you to look at the loads on a
specific set of LUNs sharing the same set of disks.
You have explored the views, now the task is to analyze a specific area where we have an issue.
Now, please explore the interface and
look at the following attributes for this
load on the array with a focus on Raid
Group 0, LUNs 50 & 51. Look at the
following for each of these LUNs and
see if you can draw any conclusions
(make notes on the worksheet table at the back of
this handout);
Look at the profiles for these two LUNs. How are they different and
what would be a suggestion on improving performance?
If it is sequential are we seeing pre-fetching and a high prefetch used rate? We do need to know what the application is
Hint; one of them is doing large
sequential reads could be a video trying to do, then correlate that with characterized IO on the
storage system to see if it is doing what we think it should be
or data warehousing application.
doing.
So, what about the other LUNs that
showed up red and/or yellow boxes?
As explained during the overview, the
red/yellow boxes give an indication of
where to investigate and not
indicative of an absolute problem.
20
21
Scope will be 1 if the account details used are local and not global.
Do not do this here but you can reset the statistical data by using the
following command if you are looking to collect data for a specific
test period only and you are not interested in previous collected data;
naviseccli user <username> -password <pwd> -scope <01>
-address <SP IP> analyzer logging -reset
The username and password can be omitted if you have setup the
security file for NaviSeccli.
The username used here does not have privileges to reset data logging
on the arrays being used.
Note; the desktop shortcut used here is not created for you during
installation. You have to do that yourself if you want that shortcut
available on your own systems.
Prior to release 24 you would need to use the java archiveretrieve
command to get the archive from the array;
java jar archiveretrieve.jar User <username> -Password
<password> -Scope 0 Address <array IP> File archive_emc.nar
Location C:\program files\emc\Navisphere cli Overwrite 1 Retry 2
v>
22
u
rt
dp
Please note that each device selected in the dump command, like
SP and LUNs will be listed down the left column, so selecting
an entire column to plot would actually plot all SP stats followed
by all LUN stats, and so on. You would need to be more
selective and manipulate the data when plotting graphs in a
logical manner.
23
24
25
Supplemental notes
The following operations are executed at the disk level to provide data integrity features associated with
redundant RAID types as well as consistency of data stripes that could be at risk due to media issues.
Background zero; Before user data can be written to the physical disks within a LUN, the area has to
undergo a zero operation. New disks are initially supplied in a zero state where data can be written to the
disks immediately after binding LUNs, however if the disks have been used before i.e. bound and
unbound, they have to be re-zeroed.
You can zero the disks using a naviseccli command in readiness for grouping and binding LUNs later on
or the array will zero the disks when you create new LUNs on them. This zero operation results in
512KB SCSI write-same commands to the disks in a sequential manner, unless the array has to zero-ondemand an area the user is writing too that is in the queue but hasnt been zeroed yet. There is some
other small activity on the disks during zeroing as checkpoint operations keep track of progress.
Typically with no access to the LUNs any zeroing will complete in a matter of a few hours although a
busy array and activity to the disks being zeroed will delay completion. Also, the 512KB write-same
command will not consume back end bandwidth but will affect disk load and utilization.
Background verify; This operation is validating consistency of data protection at the disk level and is
automatically performed on newly created LUNs. The IO profile at the disk level is 64KB reads and like
zeroing, can take hours to complete and is also governed by array and disk activity.
Background zero, zero-on-demand, and background verify operations exhibit relatively large IO sizes
that can affect ones analysis of the array. Also, if considering user testing its worth noting these
operations may affect the performance the array can present due to the parallel action of user data
access and these preliminary operations.
Also be aware these operations run in a sequential manner for any given raid group(RG) e.g. if you bind
5 LUNs on a RG 0 through 4, LUN 0 will start to zero and when complete will perform a background
verify. This is followed by the second LUN in that RG. Each LUN will zero then verify until all newly
created LUNs complete that process. Thereafter the only regular IO you will see at the disk level due to
internal operations will be SNiiFFER where you will see approximately 1 IO per second at 512KB in size
to each disk in a RG. SNiiFFER is a data checking operation that cycles through every block in every
LUN in the array to ensure data availability, even for data you might not have touched for months/years.
Any data inconsistency detected through SNiiFFER will automatically invoke recovery and remap of
affected blocks. RGs will run through zero, verify and SNiiFFER operations independent to each-other.
Zeroing will have the most effect on performance so consider this when testing. Verify may have a small
effect and SNiiFFER will have a negligible effect on performance.
Always check disk stats to see what IO sizes are taking place at that level. With a RG idle, disk activity
showing 512KB writes indicate zeroing, 64KB reads indicate verifying and 512KB reads indicate sniffing.
{end}
26
50
51
LUN ID
Owner SP
SP Utilization
LUN Read IOPs
LUN Read size
LUN Write IOPs
LUN Write size
LUN Read MB/s
LUN Write MB/s
LUN response time
LUN Queue
Disk Read IOPs
Disk Read size
Disk Write IOPs
Disk Write size
Disk Queue
Average disk seek
Disk response time
27