Professional Documents
Culture Documents
Mainframe zOS CPO - Hints and Tips-3
Mainframe zOS CPO - Hints and Tips-3
Mainframe zOS CPO - Hints and Tips-3
Natabar Sahoo
natabarss@yahoo.com
V01
April 2023
(This book is dedicated to my friend and my younger brother ‘Late Harish Malu’.
He was the smartest and most brilliant technician that I have worked with in my
life, and lost his life in Covid-19)
Disclaimer
All the details described in this book are the author’s personal opinion and
experiences. IBM manuals, Redbooks, and many other presentations and articles
on the internet have been my sources of information. I have included some
extracts from IBM manuals and other sources for easy reference and I have tried
to include the source from which the information is collected. The information like
Rule of Thumb (RoT) was a compilation from many sources and they might have
changed over time. You may use this book as a reference, but you need to
develop your own process, techniques, and approach for performing your
Capacity, Performance and Optimization (CPO) function and most importantly
use the latest information against your mainframe configuration.
1
Acknowledgements
I would sincerely like to thank the following people for their time to review this
book and provide their valuable feedback.
Sushanta Dash who currently leads the CPO team at Kyndryl, India. He has a
very rich experience of doing performance tuning under databases.
Satish Nagarajan who has rich experiences as a zOS systems programmer and
is currently leading the CPO team in a large mainframe installation.
My daughter Parama Sahoo, a doctor with an MBA, who has helped me in editing
this book. She has experience on performing capacity planning and post-merger
systemic integration in hospitals.
2
Contents
Disclaimer ............................................................................................................ 1
Comments and Feedback ................................................................................... 1
Acknowledgements ............................................................................................. 2
Introduction .......................................................................................................... 8
Capacity, Performance and Optimization (CPO) function – strategic view ........ 13
CPO as a culture in the organization ............................................................. 13
Existence of mainframe zOS CPO team ........................................................ 14
Direction from Senior Management Team (SMT) .......................................... 14
Data analytics - tools for data analysis........................................................... 15
Data analysis automation ............................................................................... 15
Real time performance monitoring ................................................................. 15
Service Level Agreement (SLA) ..................................................................... 16
Few basics and terminologies ........................................................................... 17
zOS CPO ....................................................................................................... 17
Understanding mainframe MIPS & CPU sec – Layman view ......................... 18
MSU-hour ....................................................................................................... 18
Rolling 4 Hour Average (R4HA) MSU – A short description ........................... 19
Introduction ................................................................................................. 19
MIPS ........................................................................................................... 20
CPU Second ............................................................................................... 20
Service Unit ................................................................................................ 20
MSU (Million Service Units per hour).......................................................... 21
Understanding MSU (Hardware and Software) .......................................... 21
Computation of Rolling 4 Hour Average (R4HA) MSU ............................... 22
An example of R4HA computation for one LPAR ....................................... 23
Mainframe Processors ................................................................................... 24
Mainframe Applications .................................................................................. 24
The Program and MIPS .............................................................................. 24
3
Creating Optimized and Efficient Programs ............................................... 25
Application optimization – An example of DD DUMMY ............................... 27
HiperDispatch ................................................................................................ 28
Relative Nest Intensity (RNI) .......................................................................... 34
LSPR workload categories ......................................................................... 34
Hints and Tips ................................................................................................... 40
Cost - The biggest challenge and my simple solution .................................... 40
Manage hardware ....................................................................................... 41
Optimize Software suite .............................................................................. 42
Get maximum CPU cycles out of the hardware .......................................... 42
Reduce CPU wastage ................................................................................ 43
Optimize, Optimize, and Optimize .............................................................. 44
Nothing is right or wrong ............................................................................. 46
CPO Team ..................................................................................................... 47
CPO Objective ............................................................................................ 47
CPO Benefits .............................................................................................. 48
Capacity Management ................................................................................... 49
Short-term capacity forecast ....................................................................... 50
Long-term forecast ...................................................................................... 50
Ad hoc forecast ........................................................................................... 50
Capacity provisioning ................................................................................. 50
Resources covered under capacity plan..................................................... 51
Capacity Reporting ..................................................................................... 52
Performance Management ............................................................................ 52
Capacity Optimization .................................................................................... 53
Objective ..................................................................................................... 54
Optimization Tasks ..................................................................................... 54
Workload..................................................................................................... 54
Input ............................................................................................................ 55
Savings ....................................................................................................... 55
4
Optimization outcome ................................................................................. 55
Savings calculation example ...................................................................... 56
Computation of cost .................................................................................... 56
CPO data analysis – A systematic approach .................................................. 58
Workload utilization bucketing .................................................................... 58
Workload classification for easy Chargeback ............................................. 59
Metrics for CPO analysis ............................................................................ 59
CEC CPU Analysis ..................................................................................... 61
LPAR CPU analysis .................................................................................... 61
Workload CPU Analysis.............................................................................. 62
Address space CPU Analysis ..................................................................... 62
Analysis approach ...................................................................................... 63
Tricks of performing data analysis .............................................................. 63
Major analysis outcome .............................................................................. 68
Tools and data source ....................................................................................... 69
SMF................................................................................................................ 69
What is SMF ............................................................................................... 69
Major Uses .................................................................................................. 69
Who Needs the Data .................................................................................. 70
Reporting .................................................................................................... 71
Sources of CPU Information in SMF ........................................................... 71
Data Collection ........................................................................................... 72
Presentation................................................................................................ 72
Tools and products processing SMF data .................................................. 73
RMF Data and SMF Records ..................................................................... 75
RMF ............................................................................................................... 76
WLM ............................................................................................................... 77
Important Source of Information for CPO....................................................... 82
Post processor ‘Summary’ Report .............................................................. 82
RMFMON III ‘CPC’ Panel ........................................................................... 84
5
RMFMON III ‘SYSINFO’ Panel ................................................................... 85
RMFMON III ‘PROCU’ Panel ...................................................................... 86
Picture source: RMF Report Analysis SC34-2665-50 page 148 ....................... 86
SDSF ‘JM’ line command ........................................................................... 87
Postprocessor ‘WLM Activity’ report ........................................................... 88
RMFMON III Enclave panel ........................................................................ 89
Others ......................................................................................................... 89
Create my own database for critical data ................................................... 90
Capacity monitoring ........................................................................................... 93
Real Time Performance (RTP)....................................................................... 93
Data in Real Time ....................................................................................... 93
SMF Record 99 and 113 ........................................................................... 103
Systems Configuration .................................................................................... 110
Manage resource configuration ................................................................... 110
CPO Health Check ....................................................................................... 112
Rule of Thumbs (RoT).................................................................................. 112
CPO Process Document – a sample ................................................................ 115
Name, contents, and change history............................................................ 115
Introduction .................................................................................................. 115
Definitions and terminologies ....................................................................... 116
Objective ...................................................................................................... 116
Mainframe environments ............................................................................. 116
CPO activities .............................................................................................. 116
Capacity and performance data collection ................................................... 117
Capacity and performance data processing ................................................ 117
Capacity and performance monitoring tools ................................................ 117
Capacity forecast and provisioning .............................................................. 117
Capacity provisioning ................................................................................... 117
Resource capacity considered under capacity forecast ............................... 118
Performance management .......................................................................... 118
6
Capacity and performance analysis ............................................................. 119
Track and manage CPU capacity growth..................................................... 119
Track and manage CPU resources (MSU-hour) .......................................... 120
Optimization ................................................................................................. 121
Capacity tools............................................................................................... 121
Capacity and performance reporting ............................................................ 121
Incidents and problems ................................................................................ 122
Projects and initiatives ................................................................................. 123
Operational activities – on demand............................................................... 123
Conclusion ....................................................................................................... 124
7
Introduction
In this book, I have documented some of the ‘Hints and Tips’ on ‘Mainframe zOS
Capacity, Performance and Optimization (CPO)’ based on my personal
experiences gained through working in this area for a long time. I will not describe
much technical details as you will find plenty of reading materials from different
sources like IBM Manuals, IBM Redbooks, non-IBM product manuals, share
presentations, Sheryl Watson’s Tuning Letters, and many articles published on
the internet. But real-life experience is rarely shared and vanishes after an expert
stops working and loses regular contact and interaction with the community,
friends, and colleagues. Also, knowledge not used regularly does not get a
chance to get updated with the most current information and very often fades
away over time with different priorities in life because of non-usage of the same.
• Understaffing
• No inspiration or motivation to do so
• Lack of ownership
• Fear of sharing knowledge because someone may take away your job
I do not blame anybody for this. This is the general culture in the industry now.
However, when in an organization, I always consider two major aspects of my
8
work: the first and most imperative is that I work for myself, and the second is that
I work for an organization. And so, I have responsibilities towards both these
aspects. When I work for myself, I keep my career growth in my mind and
someday I may leave this organization and join another. However, I have a
specific role in this organization and the organization should not suffer when I
leave. So, it should be a part of my personal responsibilities to document what I
do and most importantly consciously build my successor from the first day that I
am in the organization. Even if you do not leave the organization, one day or
another you will stop working (so called ‘retirement’ from your job) and at that
time, the organization should be able to run seamlessly. Of course, a person
contributes to the organization in multiple ways and it is virtually impossible to
clone all of these contributions (especially the human factors), but at least the
normal activities should not have a gap.
Documenting the experiences is not an easy task, as the experiences gained over
many years of working is exorbitant and will result in writing many books. The
challenge lies in knowing what to document and what not to. So, I call this book
‘Hints and Tips’ as it is very generic and can be taken as a guideline and reference
while working and building on it. To me, different professionals work differently
using different techniques while working on the same issue. Also, the same
professionals can use different techniques on the same issue in different
contexts. Therefore, you have to build your own style of working keeping speed,
efficiency, and accuracy in view, while developing the skill of multi-tasking
This book may not describe information that is completely new to you. However,
this is a compilation of my experiences that I believe will be essential to anyone
in these roles. I have practically applied these experiences, and no matter how
large or small the benefits may be — to me, all of them are important and
integrated together to meet the larger goal of the CPO function.
I have worked in the IT field for 40 years including 33 years as a zOS systems
programmer. In my mainframe career, I have worked on CPO for nearly 14 years.
When I started working on mainframe in 1990, it was 3090/120S machine with a
simple configuration of one processor, ~10 MIPS with 32MB central storage and
25GB of DASD storage. At that time, I spent quite a bit of my time in tuning the
LOGON proc and working with application developers in tuning the DASD storage
space allocation. Thus, the CPO activity has started from the very beginning of
my mainframe career and has remained the focus throughout. I very genuinely
9
love CPO and always enjoy exploring and finding opportunities to optimize
anything and everything that comes to my mind.
In mainframe zOS, you will mostly hear about ‘Capacity and Performance
Management (CPM)’. In most organizations, CPM management is performed
either be by a dedicated CPM team, or as an integrated function of the zOS
systems programmers. However, I have added the function ‘Optimization’ to this
function because this is an integral part of CPM. With this, I change the function
name from ‘CPM’ to ‘CPO’. In general, optimization receives the least focus due
to lack of skill, expertise, time, and priority. But I suggest that Optimization activity
should be considered as an integral function of the mainframe infrastructure and
I am sure that the mainframe organization will realize a real benefit out of this
function. With my experience, we dedicated two human resources for performing
optimization and the cost of these two resources have been self-financed by the
$$ savings realized by the organization through the optimization performed by
them. And most importantly we have delayed or avoided the planned upgrade to
support the normal year to year growth.
One thing that always has surprised me and I have never found a realistic answer
to that: in 1990, if a single processor machine having10 MIPS was able to run
MVS/ESA V3/V4 with 100+ developers developing CICS/DB2 applications
concurrently - then, why do I need a minimum of 50 to 70 MIPS today just to keep
the operating system active without any workload? What also surprises me is that,
if the organization had ‘M’ amount of MIPS, how has the demand grown to ‘1.nM’
in just in 5 years when the growth in business has been relatively very minimal.
In my experiences, the increase in CPU demand is everywhere owing to:
• Data Warehousing
10
• Development of complex DB2 SQLs
• And so on.
Whenever we talk about Mainframe, the one obvious thing that comes to the
forefront is ‘Cost’. We can debate about the “Cost” of mainframe for hours without
reaching a conclusion. However, I undoubtedly want to stress on the cost we pay
to IBM and other vendors (Green $$) and the cost chargeback to business entities
inside organization (Blue $$).
A large portion of mainframe cost goes towards the IBM monthly license charge
(MLC). IBM has different techniques of charging customers. The newest
methodology is ‘Tailored Fit Pricing (TFP) which is based on the CPU
consumption measured in MSU-hours burnt over a month. Therefore, most of my
‘Hints and Tips’ will be directed towards saving CPU consumption in order to help
the organization run and manage mainframe zOS cost-effectively. My focus will
be on ‘zOS’ only and the approach can be extended to other areas.
11
actions to control it, because we at least have one to two hours of time to
act on it.
12
Capacity, Performance and Optimization
(CPO) function – strategic view
As I have observed, CPO was an area of least focus by many organizations until
the early 2000s. However, since then, most organizations have implemented this
function by creating a dedicated team or delegating this function to respective
SMEs. The current CPO teams mostly focus on resolving issues related to
Capacity and Performance, and perform capacity planning. However, they very
rarely deep dive to find the root cause related to capacity and performance, and
work towards a sustainable solution to help the organization in running mainframe
installations effectively and efficiently. Moreover, the initiative to focus on
‘Optimization’ is almost Zero.
The CPO is not only the responsibility of the zOS technical infrastructure teams
such as zOS systems programmers or the CPO team. It is a shared responsibility
of all teams under the mainframe organization including the senior management
team (SMT), mainframe development community, zOS infrastructure subject
matter experts (SMEs), batch planners and the zOS operations team (Console
and Batch). Most importantly, the business community must collaborate by
providing their mainframe infrastructure requirements well in advance (wherever
13
possible) and challenge the mainframe infrastructure and development teams to
deliver the best optimized solution to meet their demands.
There should be a dedicated CPO team directly under the zOS infrastructure. The
main responsibility of this team will be to analyse the zOS resource utilization
(primarily CPU) data and provide recommendations for ‘Capacity’, ‘Performance’
and ‘Optimization’. Execution and deployment of the recommendations takes an
enormous amount of time, so this should be a responsibility of the infrastructure
SMEs (Subject Matter Experts), planners, operations team and the development
community. Assuming this de-centralized approach is accepted, I recommend a
maximum of two to three experienced SMEs in the CPO team with strong
analytical and presentation skills. However, the roles and responsibilities of the
CPO functions need to be clearly documented, agreed, and approved under the
mainframe organization.
There are three critical steps that the SMT should employ to integrate CPO as a
culture in the organization:
2. Define a clear goal to develop and deploy Zero defect optimized code
These steps will allow the SMT to show a strong commitment and provide a
direction to CPO function within the organization.
14
Data analytics - tools for data analysis
To make the CPO resource utilization analysis effective, we need to make the
most out of the information from the SMF data. But getting the required
information out of SMF records is not easy. There should be tools to process the
SMF data to make it handy for analysis. These tools can either be developed in-
house, or you can choose one available in the market. The right data should be
available at the right time to perform the right analysis and take the right decision.
The interfaces available for real time performance monitoring and analysis are
very limited. Using IBM products like SDSF, RMFMON II, RMFMON III,
OMEGAMON or any other non-IBM products, the navigation and correlation of
information is not so user-friendly. To find simple information to address an issue,
you may have to navigate through many screens under multiple products.
Furthermore, without the availability of historical data, it is difficult to infer whether
the behavior is normal or abnormal. So, we mostly depend on the post processor
data for in-depth analysis. However, if we can innovate some way to collect the
required data for our real time monitoring, compile them centrally and present
them in a simple way to detect the anomaly easily, then we will be able to detect
15
a lot of issues proactively. I will write on RTP in more detail in one of the chapters
below.
In many organizations, due to the assumption that support will only be needed by
internal customers - there are often no SLAs, or very few SLAs which are often
ignored during workload processing. However, even if your only customers are
internal - it is important to establish a detailed SLA at various levels. These
indicators are critical to gauge/measure all functions under mainframe
infrastructure and will help you to provide support at the agreed level of SLA. For
example;
SLA management provides a great number of inputs to the CPO team for capacity
planning, performance management, and to create optimization initiatives.
16
Few basics and terminologies
zOS CPO
17
Understanding mainframe MIPS & CPU sec – Layman view
Then, what is this CPU second? The CPU second is tied up with a processor. For
example, in sysplex1, we have 20 GP processors, then this gives sysplex1 a total
20 CPU sec in every second of operating time. This is where we do the
multiprocessing, that means, if there is a demand, sysplex1 can process 20 tasks
concurrently at any point in time. Aside from this, we have other specialty engines
like zIIP, IFL etc. which also process a similar, or larger. number of instructions
per processor every second by executing the applicable workload.
MSU-hour
IBM Sub-Capacity Reporting Tool (SCRT) uses the following base calculation to
compute MSU consumption per hour:
• SMF70EDT is the effective dispatch time for the LPAR. If dedicated or ‘wait
complete’ processors are in use, this value may be adjusted as appropriate
Ref: https://www.ibm.com/downloads/cas/YMW2JWP4
18
Rolling 4 Hour Average (R4HA) MSU – A short description
Many organizations still use the R4HA methodology for IBM billing. So, I have
provided a short description to bring some clarity to the R4HA.
Introduction
IBM Monthly License Charge (MLC) software is priced based on peak MSUs
(Millions of Service Units) used per month and not on machine capacity
• Where the average MSU’s consumed is calculated for all rolling 4-hour
periods within a month,
• And the peak R4HA MSU is used to determine the monthly licensing
charges.
See below for a brief overview of key CPU measurements and billing terms.
These sections briefly describe some of the CPU measurement metrics used in
the computation of R4HA MSU. However, for your convenience, you may also go
directly to the section ‘Computation of Rolling 4 Hour Average (R4HA) MSU’ on
the next page.
19
MIPS
MIPS (Millions of Instructions per Second), is probably the most common unit
used when talking about mainframe capacity. When mainframes were still young,
manufacturers could measure MIPS capacity by repeatedly running a small
standard routine. However, MIPS has not been a meaningful measurement for
decades. IBM mainframes have a huge number of instructions; some are simple
and quick, and others are complicated and slow. For example, one application
using five million simple instructions will use a lot less CPU than one using five
million complicated ones. In addition, the number of instructions available is
increasing with each new mainframe processor type. In summary, MIPS is an
indication of the speed of a processor and is very much workload dependent. IBM
publishes the MIPS rating of its various processor models, but there is no single
MIPS number for any CEC and no tool that reports MIPS numbers.
CPU Second
Service Unit
20
MSU (Million Service Units per hour)
MSU was created from Service Units. MSU measures the rate of CPU usage, but
can also refer to the capacity of the processor model, e.g., a processor with an
MSU rating of 100 can process up to 100 million service units per hour.
The original idea of MSUs was to use it as an indicator of CPC capacity. MSU is
calculated as:
MSU = The SU/Sec for the model * number of engines * 3600 / 1,000,000
Example: The z15 701 processor model has a rating of: 103225.8065
SU/Sec * 1 * 3600 / 1,000,000 = 371.6 or 372 MSUs
MSU is primarily intended for software licensing. Most vendors scale their
software licensing fees to an MSU rating. IBMs sub-capacity licensing also uses
MSU when calculating the final bill.
To help customers reduce software licensing charges, post the introduction of the
z990 processor type, IBM began tweaking the MSU capacity of processors (in
order to lure users to a newer machine). IBM publishes the MSU ratings of the
various processor models based on their internal measurements and
adjustments, which are only used for software license charging.
When IBM started altering the MSU number as a way of discounting software
cost, it gave rise to two definitions for MSU:
“Hardware” MSU – calculated using the original formula above, this is the basis
for measuring and reporting in some of the mainframe usage.
“Software” MSU – also based on the calculation above, but adjusted depending
on the CPU type and model and is a fixed value for a specific model. SW MSU is
used as the basis for software charging and LPAR capping and is used in
reporting some of the CPU measurements.
21
Example: The z15 701 is rated at 253 “Software” MSUs against 372 “Hardware”
MSU, as calculated above.
At the beginning of each month the SCRT output report, for the previous month,
is sent to IBM (for billing).
22
An example of R4HA computation for one LPAR
The following table describes the computation of R4HA MSU for one LPAR.
However, for actual billing, the data for all the CECs in all data centers over the
whole month are used for the computation of overall peak R4HA MSU.
23
Mainframe Processors
Integrated Facility for Linux (IFL) – This processor executes Linux workload.
Mainframe Applications
24
3. Fetch the operand
6. Go to next instruction
With this, the processor is in an infinite loop just performing ‘Fetch and Execute’.
Who tells the processor what to execute? The simple answer is ‘our programs’.
We write the programs in COBOL, PL/I, Assembler, Java etc., compile and link
edit to create load modules. When executed (e.g., batch job or CICS transaction),
it calls for systems services, which are again programs, supplied by the operating
system and software products. During processing, it executes a certain number
of instructions that are accounted for as: how many MIPS the program or
transactions have consumed. If our program can somehow result in executing a
smaller number of instructions, this will account to less MIPS and directly result
in saving $$$. So, the big question is, do we have a control over our programs to
execute a smaller number of instructions? The answer is certainly YES. We are
the creator of the program and we have options available at each step (coding,
compiling, link editing etc.) to optimize our program. This is called the art of
creating optimized and efficient programs, which will result in executing a smaller
number of instructions in total to process the same amount of data and generate
the same output.
25
• Handling database tables efficiently
And I am sure that you will have many experts around you from whom you can
get a lot of hints and tips in optimizing your programs.
• Is your application code optimized? – i.e. can you perform the same work
with lesser number of instructions
26
Application testing Considerations
I am sure you all know DD DUMMY coded in your JCL. The use of the DUMMY
parameter is mostly done in testing a program. When testing is finished and you
want input or output operations performed on the data set, replace the DD
DUMMY statement with a DD statement that fully defines the data set. The
DUMMY parameter specifies that no device is to be allocated to this DD statement
and that no disposition processing or I/O is to be done. Many times, we use the
DUMMY parameter if we know that we will not need a file in a job step. But this is
not a good practice especially for output files. Using DUMMY for an input file e.g.,
“//SYSIN DD DUMMY” is perfectly fine. But, when we use DUMMY for the output
file, we might end up doing the full processing in the program, and just eliminating
I/O to the output file by coding DUMMY in the JCL.
A classic example: There is a business need not to generate the output file
anymore. But, instead of updating the program, we simply code DD DUMMY
in the JCL. We undoubtedly met our processing objective and saved some IO
and Disk/Tape space. But we did not reduce the processing time in the
program itself, which continues to burn CPU and therefore does not drive
down the CPU consumption and save $$.
Our suggestion: Please do look for DD DUMMY coded in your JCL especially
for output files, review the program, make necessary updates to the program
to eliminate any processing done against this file.
27
HiperDispatch
To get the maximum out of the hardware, it is very important to understand your
system hardware configuration and the HiperDispatch function.
I will describe some basics here. You can read a lot about HiperDispatch from
different sources. To make my description simpler, I have taken some description
ASIS from the IBM Manuals.
The above picture (Ref Page 29 of Redbook sg248850) shows that a z15 holds
up to five CEC drawers. Your installation configuration might have xxx number of
processors (a combination of GP and specialty engines) spread over multiple
drawers. When you assign the LPs to LPARs in the CEC, the LPs allocation to a
specific LPAR may spread to different drawers.
28
The above picture shows the drawer cache structure in z14 and z15. (Ref page
95 in Redbook sg248851).
6. Go to next instruction
So, when the data is fetched, it all depends on where the data resides:
L1/L2/L3/L4 cache, or memory? The data must be in L1 for the processor to use.
So, if the data resides in cache other than L1, then it must be fetched to L1. So,
how deep into the shared cache and memory hierarchy (“nest”) must the
processor go to retrieve data not present in L1 cache? This is critical/important
because, access time increases significantly for each level of cache accessed,
thereby increasing the processor wait time.
29
The following picture shows the relative cost of fetching the data from different
levels of cache or memory (taken from a presentation by Robert Vaupel in 2010).
While configuring the LPARs, we assign the number of logical processors (LPs)
and the amount of memory to be allocated to the LPAR. When the LPAR is
activated, PR/SM builds logical processors and allocates memory for the LPAR.
PR/SM is aware of the processor drawer structure on ‘znn’ servers. The processor
unit assignment of characterized PUs is done at POR time, i.e., when the CEC is
initialized. The initial assignment rules keep PUs of the same characterization
type grouped as much as possible in relation to PU chips and CPC drawer
boundaries to optimize shared cache usage. This initial PU assignment, which is
done at POR, can be dynamically rearranged by an LPAR by swapping an active
30
core to a core in a different PU chip in a different CPC drawer or cluster to improve
system performance.
PR/SM is able to give operating systems two cache advantages. First, PR/SM
can provide information about where logical CPUs are placed in the physical
topology. Second, PR/SM can place logical CPUs to increase cache benefits.
In vertical polarization mode PR/SM maps logical CPUs to real CPUs as closely
to one another as possible and moves these mappings as little as possible.
31
• Vertical High (VH): Equivalent to a physical processor effectively dedicated
to the LPAR
With the HiperDispatch feature enabled, the following table illustrates the
allocation of VH, VM and VL processor assignment based on the weight (In this
example, we have z14 720 model CEC with 20 physical GPs).
LPAR1 400 9 40 8 8 0 1
LPAR3 60 2 6 1.2 0 2 0
If the decimal number of the processor guarantee is ≥ 0.5 = the vertical high (VH)
processors will be the integer number and there will be 1 vertical medium (VM)
If the decimal number of the processor guarantee is < 0.5 = the vertical high (VH)
processors will be the integer number minus one and there will be 2 vertical
mediums (VM)
32
Best practices:
The LPAR time slice is sensitive to the number of logical CPs. Having more logical
processors may drive your time slice to a smaller interval for the vertical medium
and vertical low logical processors.
As I learnt in the past from IBM performance experts, the sum of all logical shared
processors should not be more than triple the number of physical processors.
Otherwise, the LPAR management time to reassign the PUs to the logical CPUs
can increase to an unacceptable level.
Work will run most efficiently if you run within your defined weight, using vertical
highs and vertical mediums to support the workload and avoid use of vertical lows
except for occasional workload spikes. If the workload in the LPAR relies upon
vertical lows for throughput you may want to change the weight to match actual
usage.
2. Workloads should run mostly on vertical high (VH) and vertical medium
(VM) processors;
3. Vertical low (VL) processors should be used only for occasional workload
spikes
4. The number of vertical low processors should be limited to the ones really
needed to reduce the risk of the vertical ‘short CP’ effect. That means, if
there is a demand in a LPAR with more number of VL processors and CEC
has got free capacity, then the VH weight will be distributed over the VL
processors thereby causing the performance impact known as ‘Short CP’
effect.
33
The short CP effect can lead to poor response time especially for CPU intensive
workloads that can be stranded on a logical processor that will not run again for
a long time. It can also cause a waste of cycles on each running processor
spinning for system locks held by the not running processors.
To understand RNI better, I have extracted the following from the document
‘LSPR for IBM z SC28-1187-24’, page-29. I suggest you refer to the latest LSPR
manual because the technology changes very fast.
Introduction
34
Instruction Path Length
A transaction or job will need to execute a set of instructions to complete its task.
These instructions are composed of various paths through the operating system,
subsystems and application. The total count of instructions executed across these
software components is referred to as the transaction or job path length. Clearly,
the path length will be different for each transaction or job depending on the
complexity of the task(s) that must be performed. For a particular transaction or
job, the application path length tends to stay the same presuming the transaction
or job is asked to perform the same task each time. However, the path length
associated with the operating system or subsystem may vary based on a number
of factors including:
a) Competition with other tasks in the system for shared resources – as the
total number of tasks grows, more instructions are needed to manage the
resources;
Instruction Complexity
The type of instructions and the sequence in which they are executed will interact
with the design of a micro-processor to affect a performance component we can
define as “instruction complexity.” There are many design alternatives that affect
this component such as: cycle time (GHz), instruction architecture, pipeline,
superscalar, out-of-order execution, branch prediction and others. As workloads
are moved between micro-processors with different designs, performance will
likely vary. However, once on a processor this component tends to be quite similar
across all models of that processor.
35
component such as: cache size, latencies (sensitive to distance from the micro-
processor), number of levels, MESI (management) protocol, controllers,
switches, number and bandwidth of data buses and others. Some of the cache(s)
are “private” to the micro-processor which means only that micro-processor may
access them. Other cache(s) are shared by multiple micro-processors. We will
define the term memory “nest” for a System z processor to refer to the shared
caches and memory along with the data buses that interconnect them.
Workload capacity performance will be quite sensitive to how deep into the
memory hierarchy the processor must go to retrieve the workload’s instructions
and data for execution. Best performance occurs when the instructions and data
are found in the cache(s) nearest the processor so that little time is spent waiting
prior to execution; as instructions and data must be retrieved from farther out in
the hierarchy, the processor spends more time waiting for their arrival.
The most performance sensitive area of the memory hierarchy is the activity to
the memory nest, namely, the distribution of activity to the shared caches and
memory. We introduce a new term, “Relative Nest Intensity (RNI)” to indicate the
level of activity to this part of the memory hierarchy. Using data from CPU MF,
the RNI of the workload running in an LPAR may be calculated. The higher the
RNI, the deeper into the memory hierarchy the processor must go to retrieve the
instructions and data for that workload.
Many factors influence the performance of a workload. However, for the most part
what these factors are influencing is the RNI of the workload. It is the interaction
of all these factors that result in a net RNI for the workload which in turn directly
relates to the performance of the workload.
36
The traditional factors that have been used to categorize workloads in the past
are listed along with their RNI tendency in following figure.
It should be emphasized that these are simply tendencies and not absolutes. For
example, a workload may have a low IO rate, intensive CPU use, and a high
locality of reference – all factors that suggest a low RNI. But what if it is competing
with many other applications within the same LPAR and many other LPARs on
the processor which tend to push it toward a higher RNI? It is the net effect of the
interaction of all these factors that determines the RNI of the workload which in
turn greatly influences its performance.
Note that there is little one can do to affect most of these factors. An application
type is whatever is necessary to do the job. Data reference pattern and CPU
usage tend to be inherent in the nature of the application. LPAR configuration and
application mix are mostly a function of what needs to be supported on a system.
IO rate can be influenced somewhat through buffer pool tuning.
However, one factor that can be affected, software configuration tuning, is often
overlooked but can have a direct impact on RNI. Here we refer to the number of
address spaces (such as CICS AORs or batch initiators) that are needed to
support a workload. This factor has always existed but its sensitivity is higher with
today’s high frequency microprocessors. Spreading the same workload over a
larger number of address spaces than necessary can raise a workload’s RNI as
the working set of instructions and data from each address space increases the
competition for the processor caches. Tuning to reduce the number of
simultaneously active address spaces to the proper number needed to support a
workload can reduce RNI and improve performance. In the LSPR, we tune the
number of address spaces for each processor type and Nway configuration to be
consistent with what is needed to support the workload. Thus, the LSPR workload
capacity ratios reflect a presumed level of software configuration tuning. This
suggests that re-tuning the software configuration of a production workload as it
37
moves to a bigger or faster processor may be needed to achieve the published
LSPR ratios.
The RNI of a workload may be calculated using CPU MF data. For z10, three
factors are used:
These percentages are multiplied by weighting factors and the result divided by
100. The formula for z10 is:
z10 RNI=(1.0xL2LP+2.4xL2RP+7.5xMEMP)/100.
Tools available from IBM (zPCR) and several vendors can extract these factors
from CPU MF data. For z196 and zEC12 the CPU MF factors needed are:
The RNI of a workload may be calculated using CPU MF data. For IBM z16, four
factors are used:
38
• L4RP: percentage of L1 misses sourced from a remote drawer VL4 cache
• MEMP: percentage of L1 misses sourced from memory
39
Hints and Tips
I describe my experiences here. They are not described in any particular order.
You can refer to any topic based on your interest.
Running the mainframe platform effectively and efficiently, timely upgrading and
patching with minimal or no downtime, identifying and fixing all issues with no
downtime, implementing automation, and creating a robust Disaster Recovery
infrastructure and policy, are mostly the objectives and goals of mainframe zOS
infrastructure team.
But the biggest challenge that still remains in any mainframe organization is the
‘cost’; i.e. the Green $$ paid to IBM and the vendors.
• HW maintenance
• DR infrastructure
• Office space
• People
But the biggest variable cost comes from the monthly SW License charges which
is based on the MSU-hour consumption under TFP.
40
Manage hardware
Make a long-term plan for physical HW provisioning and upgrade. for at least the
next three to five years considering the following:
• Current utilization
• Encryption requirements
You must ensure adequate Disaster Recovery (DR) capacity based on the
organizational policy such as to support 100% or 90% or 80% of normal
production workload, whether test systems are required during DR or not, and
whether some of the applications can remain down or not.
41
You should clearly document the hardware requirement over a period of time, the
minimum, maximum and good to have, and review it with the SMT and get their
approval. And most importantly, highlight any deviations proactively.
This is not easy. However, we must have a list of software running under our zOS.
zOS suite of software is integrated, and you can only control licensed software
such as SDSF, RMF and others, which are controlled through Parmlib member
IFAPRDxx member. Focus on the other products from IBM and Non-IBM vendors
to justify why we need the product in our system, do a cost-benefit analysis and
always look for the alternatives, if available. You must have a process to manage
the EoS (End of Service) and EoL (End of Life) software. Also review the software
products providing duplicate functionalities with other products with minimal
benefit. For example, if we are using the entire Omegamon product suite, it could
have duplication with other specific products providing similar functions for
Network, DB2.
You must find innovative ways of getting the maximum out of the available
hardware. I have experienced the following:
• Configure the weights and LPs to allocate only Vertical High (VH) engines
to production systems which provides almost 10 to 20% more MIPS per
engine
42
Reduce CPU wastage
• Shutdown very rarely used STCs, especially in Test systems and start only
when it is needed
• Do not restart failed jobs, unless you fix the issue. Do not allow to fail again
• Track and avoid S322 abends (job abend after consuming allocated CPU
time). Implement job restart technique to continue the remaining
processing after job fail
• Distribute workload in the various LPARs. For example, avoid starting HSM
processing or DB2 archive at the same time in all LPARs running in the
same CEC
• Use WLM managed initiators unless you need JES managed initiators
because of installation need, and if needed, start the required number only
• Eliminate CPU intensive customizations, and use only when required, e.g.
trace data collection
43
• Reduce volume of test data in test systems – Rather create your own test
data instead of copying a chunk of production data
This list can be very long based on our observation, data utilization analysis and
mix of workload running in our systems.
My technique is very simple. Identify the abnormalities that result in wastage and
take actions to fix it.
But, how to identify these abnormalities? For this you need to establish what is
normal. Only then will you be able to identify the abnormal. You need to start
somewhere.
My suggestion;
• To analyse the CPU usage of all tasks running in the system, the system
tasks, online CICS transactions, and the batch jobs.
• You must establish a historical trend over at least last 13 months and
determine a CPU usage pattern such as:
o The average CPU utilization per day of GRS and other system tasks
44
o Usage pattern of DB2 and MQ based on the variation of workload
processing
Then, use this historical data as a baseline for now and do a step-growth analysis
to identify the abnormalities, if any. If you find any opportunities for optimization,
create tasks and assign it to the respective SMEs for implementation. This will be
a repetitive analysis. You can refresh your baseline from time to time, if found to
be new normal.
And the most common CPU wastage is in writing inefficient SQL. I have noticed
opportunities to tune most SQLs.
45
Nothing is right or wrong
Please note that: nothing is right or wrong. If you want to understand it from a
different perspective, everything is right and everything is wrong.
Generally, the mainframe platform exists in the organization since a long time and
lots of so-called legacy has been built into the systems. Special systems
configurations through Exits and Usermods, in-house developed programs
having hook to system components, in-house developed programs to perform
specific functions and so on.
In 1994, I had visited an organization, who had developed CICS programs in the
assembler and had a highly customized VTAM, which resulted to be a different
component altogether and was supported by a specific staff member in IBM.
When the staff retired from IBM, the organization was in very big trouble in
supporting the customized VTAM. They did not have any choice other than
exploring moving the applications to standard CICS and COBOL so that they can
use standard VTAM functions and get rid of the in-house customization
dependency. This was a huge technical debt and cost to manage.
When we build (install, patch, and customize) the software, we most often carry
forward all customizations that were there previously. This is perfectly fine in the
short-term, because we do not want to introduce any issues to the systems and
applications after the build. But, with the advancement of time, and zOS and
language features - products have provided many rich features which, I strongly
feel, need to be analysed for applicability. It is also key to prepare a long-term
plan to take advantage of these new features. Also, there should be a focus to
review the old EXITs, USERMODs, in-house developed tools, developed and
used for log time, to reduce the technical debts inside the mainframe organization.
At the same time, from the CPO point of view, everything is wrong unless proven
right. So, create a very systematic approach to review and analyse each and
every component (e.g., zOS and product configuration parameters, all in-house
developed tools, application codes for CICS transactions and batch jobs and all
housekeeping tasks like SMF data collection and processing, systems and
database backup, HSM space management, zOS DR configuration, etc.) and
look for all possible optimizations till you are convinced that this is normal. Then
take it as a baseline and move forward. When you complete the cycle of reviewing
all tasks in the systems, please go back in the loop again. To conserve time and
46
make the review and analysis process effective, make use of automation tools.
And most importantly, keep your eyes and ears open and note all the changes to
the system, such as: upgrade of a single product, enhancement to a specific
business application, implementation of new business applications. Also, keep an
eye on all systems incidents and problems. This helps generate a large number
of CPO opportunities.
CPO Team
I strongly suggest at least a two or three member dedicated CPO team under the
zOS infrastructure to perform analysis and identify various opportunities for
optimization. Then follow the CPO process in your organization to execute the
optimizations by the CPO team itself, or offload or assign the same to the different
SME areas (systems programmers, application development teams, planners,
operation team etc.) for execution and deployment. The CPO team must provide
a periodic update to the SMT on the number of initiatives, cumulative savings so
far against the target, and the required support from the SMT to push forward the
optimizations.
CPO Objective
47
• Manage CBU (Capacity Back Up) to meet capacity demand for business
continuity
CPO Benefits
CPO benefits are manyfold and it is very difficult to list them all. But, in my view,
the major benefits as follows:
• Effort and action in saving every CPU cycle thereby reducing CPU
utilization and increased processing times were realized
48
• Determine inefficiencies in the system, middleware, I/O, and application
performance.
Capacity Management
All three layers are important for the organization and I have included most of
them in this book. However, I will focus more on the ‘Component Capacity
management’ which we all know as ‘Resource Capacity Management’. The major
task under Resource Capacity Management is to provision an adequate resource
capacity to meet all business processing demands i.e., to provision the right
amount of resources at the right time and at the right price to keep operations
running smoothly.
49
In my opinion, the capacity plan should be a dynamic document, prepared yearly
with thorough review and final approval from the senior management team. And
the actual resource utilization should be tracked, and the capacity plan may
undergo ad hoc changes depending on the variation in business requirements
and the variation in usage pattern based on multiple operational factors.
Capacity forecast for the production environment to meet the demand for month
end, quarter end, year-end, seasonal and special processing.
Long-term forecast
Long-term forecast is performed using the utilization trend for at least the previous
12 months and considering the agreed BAU growth, known projects and
optimization savings. The long-term plan must forecast the capacity for the next
three to five years.
Ad hoc forecast
Capacity forecast for projects and other onetime requirements is done based on
demand and the same is taken into the short-term and long-term forecasts, as
applicable. In general, ad hoc capacity forecast is performed on demand to
address the following;
Capacity provisioning
Based on the capacity forecast, the CPO team should make sure that the required
capacity is available to meet the demand. The CPU capacity provisioning
consider the following aspects;
50
• Provisioning of enough physical CPU capacity to support the absolute
online peak demand to meet the critical workload SLA.
• Manage the overall hourly average CPU utilization below nn% (suggest
90%)
The following resources are considered for the forecasting. In general, the
forecast is performed for GP CPU as this is the most variable resource
component. However, the other components should be reviewed from time to
time and mostly on demand - for example during a major hardware/software
upgrade, major project implementation and to address performance issues.
▪ GP CPU
▪ zIIP CPU
▪ Crypto processor
▪ Central storage
▪ DASD storage
51
Capacity Reporting
Performance Management
• Performance reporting
52
Capacity Optimization
53
Objective
• Define a specific target for saving xxxx MSU-hour per financial year
Optimization Tasks
These tasks are at very high level, but the CPO team can compile the list of tasks
at very detailed level to work on.
Workload
• Batch jobs – focus first on all jobs consuming more than 60 sec CPU time
and then go down the list.
54
• CICS transactions (make a smart strategy for optimization as the number
and volume is too high)
• TSO users – Focus only on the high consumers and implement time-out for
idling users
Input
Savings
Optimization outcome
55
• May not have direct reduction in CPU consumption but results in system
performance improvement – workload balancing
• So, make it generic, all savings are averaged over the whole month –
Nothing is ignored and be conservative in your calculation
• The cost is computed over the whole year (cost per MSU-hour = xxxx $$
per year)
I calculate CPU savings by calculating the average usage at least two weeks
before implementation, and for at least two weeks of stable execution after
implementation. As the implementation could be in steps:
o Usage before =
o Usage after =
o Difference =
o Yearly savings =
Computation of cost
I suggest calculating the cost based on the MSU-hour unit rate used for IBM billing
and internal chargeback. You need to follow the standards used in your
organization.
#GP engines = 4
56
Per engine MIPS = 1486
Note: Based on your machine configuration, you can compute the values.
Based on the optimized task and frequency of execution you can compute the
amount of MSU-hours saved during a day/month/year and multiply it by the MSU-
hour unit rate to compute the cost savings.
Please be careful of the computation if you have different models of CECs such
as a combination of 5xx, 6xx, 7xx. But this is easy to manage if you are using a
vendor tool or a tool developed in-house.
Optimization based on CEC models is also possible. For example, if you have a
z14-704 CEC with 808 MSU. But, a z14-511 has 835 MSU and 11 engines. If a
single processor speed is not very important for you, then exchanging a z14-511
over a z14-704 will give much more L1 cache and parallelism in the system.
57
CPO data analysis – A systematic approach
So, I strongly suggest collecting utilization at a more granular level using the
report classes. Then, establish mathematical logics for proper bucketization. For
example, if all your DB2 started tasks are classified under Workload DB2 except
IRLM tasks, then strip off the IRLM utilization through a report class and add it to
utilization under the DB2 Workload to get a complete utilization by DB2 started
tasks.
Please do take the help of report classes to avoid defining too many service
classes.
Please make sure to classify all tasks running in the system. If anything is left out,
then they go under service class SYSOTHER and run with discretionary goal,
which is not good. So, it is important to identify the tasks not classified and classify
them appropriately.
58
Workload classification for easy Chargeback
• CEC
• LPAR
• Uncaptured
• Workload
• Service class
• Address space
I recommend a drill down approach for easy analysis. For example, the CEC
utilization should be a stack of all the LPARs utilization in the CEC. When we click
on any LPAR, it should drill down to stack chart of all workload utilization in the
LPAR. You can establish your strategy for performing drill down analysis.
59
• Batch workload
• #Jobs
• CPU
• EXCP
• Online workload
• MQ
• WAS
• CF CPU utilization
60
• GP CPU granularity - TCB, SRB, RCT, IIT, HST – You will be required to
perform CPU analysis at such a granular level in some specific cases only
if it is required
▪ LPAR
o Full preemption
61
o Getmain/Freemain activity (recommend cell pools)
o Interrupts
o SLIP processing
• CPU Consumption at the WLM Service Class and Service Class Period
Level
▪ Batch window
62
Analysis approach
• Trend analysis
• Utilization pattern
• Comparison of same day from week to week and same month over year to
year
63
2. For effective data analysis, focus on the context for which you are doing
the data analysis. And so, it is important to use the right data required for
the analysis.
b. If you are doing data analysis for the online period, you may split the
window in to two: prime online and normal online.
c. If you are doing data analysis for the batch window, you may split the
window into two: critical batch and non-critical batch.
d. You may split the workload to online and batch throughout the day
so that you can have a better picture of your online and batch
workload distribution.
f. If your charge is based on the R4HA MSU, then focus on the peak
utilization window. Move all non-critical workloads running during the
peak period to a low utilization window, if possible. Move some of the
non-critical workloads or weekly tasks to the weekend. Run some
heavy CPU intensive jobs or database clean up jobs to the weekend.
64
workload and finally to the time window. Verify, if there were any
problems recorded by the zOS systems programmers in the
respective systems and time windows. If not, then you have to work
with the respective SME or owner of the task for further analysis. It
may be necessary to follow up with the vendor of the product for
further assistance.
i. From the SMF records, you get data averaged over the SMF interval,
5 minutes, 10 minutes, 15 minutes etc. If you need to get more
granular data to learn the variation in usage during this interval, you
may take help from other sources like RMFMON III, OMEGAMON or
any other tool.
j. If you notice that, RMFMON III data is very vital for analysing some
specific issues, please keep a backup your RMFMON VSAM
datasets as the data may get wrapped around soon depending on
the number of datasets defined to store the historical data.
k. During your analysis, if you use different tools like SDSF, RMFMON
II, RMFMON III etc. please keep a snapshot of any interesting
observations, which will help you a lot during the later analysis.
i. SMF interval
ii. Hour(s)
vii. Weekends
viii. Month
ix. Quarter
65
x. Year
o. You may use the cumulative usage data such as CPU sec, MSU-
hour depending on the different scenario.
q. You should not have any doubts in your mind after doing the
analysis. You must have taken all related considerations nto account
and have the proper justifications based on the data. As I mentioned,
performance and optimization analysis is an art which is based on
common sense. So, when you go with your recommendations to
people, you will get many questions and challenges and you should
be ready and equipped with your data and observations.
66
s. Never hesitate to ask for help and assistance from other SMEs. zOS
is an ocean, and we are not supposed to know everything. But many
things are required to aid us in our analysis. People who work around
us have many years of varied experience. They can assist us in
many ways and provide invaluable guidance in performing our tasks
better.
3. You must analyse the CF CPU utilization, subchannel utilization, Sync and
Async service time, Sync requests converted to Async (CHNGD) and take
appropriate actions for any abnormality against your defined threshold. You
must allocate sufficient Central storage to the CF LPARs and monitor the
amount of storage allocated to the structures. You must maintain enough
free storage in the LPAR to support structures moved from other CF LPARs
during issues, maintenance, and DR.
Note: Please keep special focus on structures moved from this CF LPAR
to other CF LPAR during issues, DR or CF maintenance and make sure to
move them back to the home LPAR, especially when moved to remote CF.
4. I have not touched much upon the issue relating to central storage.
Nowadays, central storage is cheap, and we can be generous in allocating
more central storage to LPARs to maintain a healthy 30 to 40% available
frame queue. However, if you have a central storage constraint, you must
monitor it closely and analyse the central storage usage very frequently.
You must keep an eye on the local page dataset usage and include it as
one of the items under daily health check.
5. I also have not discussed much on IO issues under DASD and Tape
because channel speed is very high today and we provision adequate
DASD and Tape storage to cater to our need. But, as a part of CPO activity,
we must track the DASD and Tape utilization data to ensure availability of
enough buffer available for at least next one year. We should not come to
a situation requiring any emergency purchase of DASD and Tape storage.
Please do keep an eye on the average DASD and Tape response time and
IO volume from the RMFMON SUMMARY post processor report.
67
Major analysis outcome
• Optimization opportunities
• Hardware/software recommendations
• Management reports
68
Tools and data source
Here I describe some common tools used by the CPO function.
SMF
For CPO, we need the right data to perform the right analysis, which is majorly
the CPU usage. zOS captures and collects the CPU utilization at various levels
in the SMF records. The structure of each type of SMF record is complex by
nature. So, we need a tool to format the record to usable data required for
analysis. Many of the organizations use SAS/MXG to process the SMF data or
use a tool on the non-IBM platform where the SMF data is sent for formatting.
What is SMF
Major Uses
69
• Performance tuning (e.g., devices, jobs, network, data sets, WLM)
• Capacity planning
• Optimization activities
• Configuration analysis
• Problem identification
• Jobs scheduling
• Management reporting
• CPO team (daily, weekly and monthly trends, WLC reporting, optimization)
• Others….
70
Reporting
• CPU is the most critical and expensive resource and all teams including
the senior management are keen to have this resource usage available to
them always
https://share.confex.com/share/119/webprogram/Handout/Session11309/11309
.pdf
• RMF CPU Records (Type 70 - CEC CPU usage, LPAR usage, zIIP usage,
zAAP usage, IFL usage, CF usage)
71
• RMF Monitor II (Type 79 - CPU usage by address spaces and enclaves)
Data Collection
• Select only the needed fields (SMF records may include hundreds of fields)
• Perform only the needed aggregations (at the smallest possible interval)
• Offload from the mainframe and use less expensive resources and tools
with equal or better performance
Presentation
Present the CPU utilization in MIPS, MSU, percentage of CEC and number of
standard processors (CPU). The information should be drilled down for further
analysis. The exceptions may be highlighted in red when:
72
• The uncaptured CPU utilization is greater than the threshold
• The difference between MVS BUSY utilization and CPU BUSY utilization
is greater than the threshold
• Allow users to create personal report lists customized for their activities
I have prepared a list of tools and products available in the market which help you
in processing SMF data and present you with the information you need for your
analysis. The list is not exhaustive and there could be many more tools available
to perform your task.
• SAS/MXG
73
• IBM SMF Explorer
• EasySMF
• SMT data
• PerfTechPro zAnalytics
• IBM REXX
• IBM DFSORT
• In-house Solutions
• BMC Visualizer
• IBM RMF
• All of these tools create good DBs which can be used as inputs to other
applications as well as providing immediately usable information
74
• For some only a total replacement will work
• SMF records are generated by RMF Monitor I, Monitor II, Monitor III
• Records can now be written out to either the MANxx data sets or, beginning
with z/OS V1R9, use the MVS System Logger
• RMF Postprocessor reads the SMF type 7x records to generate the RMF
CPU Activity Report
75
RMF
RMF is the main tool under zOS which is considered ‘The Best Friend’ of CPO. It
is difficult to provide a summary of RMF. There is plenty to read and you can
enhance your skills by continuously using this tool. For CPO, we need to
remember the rich capabilities of RMF.
One thing one must note is the SMF interval set in the installation. For effective
analysis and reporting, I suggest configuring the same SMF interval in all LPARs
in the installation.
As you know, RMF post processor reports are invaluable for CPO activities. But,
it depends on availability of the right SMF records up to right time. One of the
most challenging tasks for me has been to generate post processor reports using
current data. I have used the two following methods to achieve this.
1. Generating postprocessor reports using data in SDS. This data is very little
and gets wrapped around very fast. You must be really quick.
2. Switch the MAN dataset and use the current SMF file when downloaded.
But you must wait till the end of the current SMF interval before switching.
You have to find similar ways to get data from SMF logstreams.
76
WLM
zOS executes multiple tasks concurrently. These tasks require system resources.
Workload Manager (WLM), a component of the z/OS system, monitors and
distributes the system resources to the competing workload depending on the
goals defined for it.
You will find lots of technical details and descriptions in IBM manuals, Redbooks,
Share presentations, training sites. So, I will not go into the details of what is
WLM, how it works, and how to define the service definition.
WLM service definition at a glance - this is a very old picture but is still very
relevant. (Picture taken from a presentation by Robert Vaupel in 2007)
77
Here is a summary of workload importance and dispatching priority for reference.
Please refer to the following link for more details, from where I have taken the
picture.
https://share.confex.com/share/118/webprogram/Handout/Session10888/WLM
%20Basics.pdf
78
Here are a few operational tips from me:
a. Follow a good naming convention while defining the service definition. E.g.,
name all service classes with first two characters ‘SC’ such as ‘SCxxxxxx’,
report class with ‘RC’ such as ‘RCxxxxxx’ and so on.
<HLQ>.<PLEXNAME>.<Date>.<Time>.WLM
<HLQ>.<PLEXNAME>.<Date>.<Time>.WLM.CURRENT
79
d. You may keep a history of your updates to the service definition. Use
‘Notes’ drill-down on the menu or any of your convenient tool (Notes, Word
file, Excel file, MVS file etc.).
e. While using Resource Group Type 1, stay careful as the scope is Sysplex
and the WLM SU/Sec rating for different LPARs in the Sysplex is based on
the number of LPs allocated to the respective LPARs.
For example, under a z14 7xx model CEC, if you allocate 2 LPs to a LPAR,
WLM will consider it be a model 702 and the SU/SEC is 88397.7901. And
after some time, if you change the #LPs to 3, WLM will consider the model
to be 703 and the SU/Sec will change to 86021.5054. You can always find
the SU/Sec rating in the LPAR in a specific interval in WLM activity report,
printed at the end of any interval.
SYSTEMS
---ID--- OPT SU/SEC CAP% --TIME-- INTERVAL ---ID--- OPT SU/SEC CAP% --TIME-- INTERVAL
P01 00 88397.7 100 15.01.00 00.59.03 P02 00 86021.5 100 15.01.00 00.59.03
80
f. While updating classification rules, place your entry at the right place for
level-1 entries as WLM will select the first matched classification rule in the
list. Your resource may be already included under a group or generic
definition above it.
g. If not sure, I always include the very specific classification rules at the top
of the list and I would never sort the classification list.
j. I create a report class for each and every system tasks like GRS, JES2,
MASTER etc. which may be around 40 to 50 report classes.
81
m. Try to avoid capping unless you are a service provider where you want to
provision what the customer needs in a shared environment.
n. I issue ‘D WLM’ command before and after the installation of definition and
activation of policy for a record in SYSLOG.
o. Do not define two service classes with the same importance and goal. I
have only used it for batch jobs running under JES managed initiators and
WLM managed initiators to have the same goal.
p. Keep the service definition clean and document the justifications for goal
changes as applicable.
CPO needs a lot of information for their analysis. It is virtually impossible to list
the information you need for your day-to-day work. So, I look for required
information I need in specific situations and contexts. You must develop your own
way of finding the right information, as quickly as possible, and their sources.
During my work under CPO, I have referred to a lot of information and the sources
could be counted in 100’s. You need to read a lot and refer to many documents
while working on specifics such as optimized LPAR configuration, SMT, WLM,
capping, performance, RMF, SMF, optimization, capacity forecast etc. I refer to
IBM Manuals, IBM Redbooks, share presentations, Cheryl Watson Tuning
Letters, and lots of papers, presentations and articles found on the internet. I
compile them and store them in my laptop for later reference.
However, I will provide a flavor of typical sources from where I get invaluable
information for my CPO work. I will copy some screenshots from the IBM ‘RMF
Report Analysis SC34-2665-01’ manual for illustration.
You have to develop an analysis strategy to navigate through and refer to various
sources of information while working on CPO activities and initiatives.
82
this and have a look at the trend almost every day. It has helped me to identify
many opportunities for performance improvement. For example;
• I also found that the TSO users are never logged off and remain in the
system which gave me the input to implement TSO session time out due
to idling.
• It has helped me to get a trend of Max and AVE JOB, TSO and STC
tasks in the system and the variation to this especially in the test
systems.
83
RMFMON III ‘CPC’ Panel
You will find LPAR configuration in the CEC at a glance. I have highlighted some
data in RED color which I frequently look at.
Partition --- MSU --- Cap Proc Logical Util % - Physical Util % -
Def Act Def Num Effect Total LPAR Effect Total
*CP 40.0 1.5 61.5 63.1
LPAR1 0 5 N N N 1.0 3.2 3.3 0.0 0.1 0.1
LPAR2 0 1 N N N 1.0 0.9 1.0 0.0 0.0 0.0
LPAR3 0 7 N N N 1.0 4.6 4.8 0.0 0.2 0.2
LPAR10 0 8 N N N 1.0 5.6 5.7 0.0 0.2 0.2
LPAR11 0 30 N N N 2.0 10.4 10.6 0.0 0.8 0.8
84
RMFMON III ‘SYSINFO’ Panel
• MVS Busy - Avg MVS Util % – z/OS has dispatched work on a logical CP
eligible to be executed
• LPAR Busy - Avg CPU Util% – PR/SM has dispatched work on a physical
CP so that it can be executed
• MVS Busy > LPAR Busy when workload exceeds LPAR weight & surplus
CPU unavailable from other LPARs, e.g., because of (soft)capping.
85
RMFMON III ‘PROCU’ Panel
86
SDSF ‘JM’ line command
This panel has been very helpful in dynamically tracking the storage usage. Once
our OPC task was suddenly getting S878 abend frequently. We increased the
region size and kept monitoring the storage usage using SDSF ‘JM’ line
command against OPC task and took action to recycle the task before it hit S878
abend.
87
Postprocessor ‘WLM Activity’ report
This has helped me a lot to perform deep dive analysis on performance related
issues and analysis. The entire report has helped me to resolve multiple
performance issue. But a few important fields to look at are:
88
RMFMON III Enclave panel
Others
I can keep writing a lot more, but I just gave some examples. You can document
an exhaustive list based on your analysis requirements.
89
Create my own database for critical data
I have created some data for quick and easy reference and analysis. In one of my
organizations, we developed it in-house using SAS/MXG to extract required data
from SMF records and then develop a web-based application in Windows
platform to present the data. In other organizations, we used a tool to get the
required information. The metrics and amount of data needs to be decided based
on your analysis approach. I prefer historical data over the last 13 months to
understand what the status was in the same month last year and the trend over
the last 12 months.
90
Memory utilization LPAR
Allocated
Used
Uncaptured CPU LPAR Current and history
Disk utilization SYSPLEX, Current and history
LPAR
Channel path utilization SYSPLEX,
LPAR
IO Rate SYSPLEX,
LPAR
DASD
IO response SYSPLEX,
LPAR
Tape Rate SYSPLEX,
LPAR
Top high utilized volumes SYSPLEX,
LPAR
Systems Summary Snapshots LPAR Current and History
Summary (as provided in RMF post
processor summary report
and more, if necessary)
91
SYSPLEX, Current and history
CICS Region Name LPAR
SYSPLEX,
CICS Region level CPU LPAR
transaction #Total transactions SYSPLEX,
(Interval LPAR
record) Top 50 Transactions and SYSPLEX,
CPU (CPU sec, MSU, MIPS) LPAR
for each transaction
CPU (CPU sec, MSU, MIPS) SYSPLEX, Current and history
for each job LPAR
Start time SYSPLEX,
LPAR
End time SYSPLEX,
LPAR
Elapsed time SYSPLEX,
LPAR
Batch Job Workload SYSPLEX,
LPAR
Service Class SYSPLEX,
LPAR
Report Class SYSPLEX,
LPAR
JES Job # SYSPLEX,
LPAR
SYSPLEX,
LPAR name LPAR
CPU SYSPLEX, Current and history
Other LPAR
workloads
Database Memory SYSPLEX,
Messaging LPAR
Was #Transactions if applicable SYSPLEX,
Systems LPAR
??? WLM Goal SYSPLEX,
LPAR
CPU Business Current and History
Business Entity
workload #Transactions / JOBs Business
(If defined / Entity
Grouped) SLA (Service Level Business
Agreement) Entity
CPU CF Current and History
Coupling Subchannel utilization CF
Facility Service time (Sync / Async) CF
CHNGD CF
92
Capacity monitoring
Resource capacity utilization monitoring an important task under CPO. In general,
it is a distributed activity covering zOS systems programmers, CPO team,
operations team, and development community. It is a very normal activity in
almost all organizations. But what we lack is the real time performance monitoring
and I will explain the same in detail here.
The biggest challenge in my career has been to get data in real time. I have faced
multiple CPO issues every day and I always look for the most current data and
most often need some history. I can get some snapshots of current data from
various tools like OMEGAMON, RMFMON, SDSF and other non-IBM products.
They do have very limited historical data. But I have to navigate through multiple
panels to view and collect some of the data. This often does not give me the clarity
that I am looking for and I always look for more data as quickly as possible. One
of the techniques which I used, is to make a list of most frequently used data and
store the history in an Excel file or database. But this also has its own challenges
as Excel has limitations in processing larger amounts of data. So, the best way I
had explored was to store the data in a database and make use of modern web
tools to extract, process and present it quickly in different formats. But currently,
we have many tools available to make our life easy, provided our organization is
ready to invest in the tool. This undoubtedly, solves the problem to some extent
in getting historical data, however getting the real time data at a central place still
remains the challenge.
93
In another organization, we had codes developed in mainframe in assembler to
collect the required data in real time at a very cheap cost (CPU cycles) and feed
it (some data every minute and some in five minute) in xml format to a front-end
Linux server and process the same using Elasticsearch and Kibana to display it.
Many dashboards with the comparison facility have been developed for effective
monitoring. Using this, around 60 to 70 percent of the monitored real time capacity
and performance issues have been identified within 5 to 10 minutes of
occurrence, and appropriate action was taken to avoid any major issues. The
metrics, we considered are: CEC and LPAR CPU utilization, CICS #transactions
and CPU utilization, Subsystems CPU utilization etc. (other metrics can be
collected as per need). Alert emails were automatically triggered when the
defined threshold was breached, for example - if the average LPAR utilization is
consistently above 90% for a period of 15 minutes or a task using more than one
engine worth capacity for a period of 10 minutes. We also collected some data
from RMF III through APIs.
To me, this is undoubtedly a great innovation and I wonder, if such a tool could
be developed and available in the market for use by most customers.
Here, I am giving a flavor of only few examples that we have generated and
presented in real time. As I see, we have the opportunity to explore and generate
and present a lot of information in real time, in a user-friendly way and easy to
understand dashboards. Our operations team is 24 x 7 with so many displays are
available in the command center. They are our perfect partners from whom we
can seek help for real time monitoring. With little effort at our end and little
education to our operations specialists, we can make use of the information to
reap the benefits for now and for the time to come.
94
Example 1
The following information shows a very high-level snapshot about CECs MIPS.
This data is refreshed every 5 minutes.
This gives a snapshot of the current status of all CECs and the utilization can be
plotted in real time.
CEC2
3906-738 GP MIPS zIIP MIPS IFL MIPS
Activated 40324 18320 9160
Consumed 32500 9000 1200
Utilization % 81% 49% 13%
Unused 7824 9320 7960
Reserved 4007 0 0
95
Example 2
The following information shows a very high-level snapshot about some of the
LPAR information in a Sysplex, refreshed every 5 minutes.
Please note that ‘IBM GP CPU Speed’ is the IBM rated GP MIPS per engine
based on the model. While the ‘Curr GP/zIIP MIPS’ is the actual engine MIPS
output per engine at this time.
SMF record 113 provides the necessary information which helps to compute the
current MIPS output. To compute MIPS, divide the number of CPU speed i.e.,
CPU cycles per second of your CEC by the number of cycles per instruction (CPI)
and then divide by 1 million.
For example, if we are using a z15 which has a CPU speed of 5.2 Ghz (S), MIPS
is calculated as follows based on the information collected from SMF 113 for a
specific interval.
You may consider adding many more information to the above table.
96
Example 3
The following example shows the display of configuration in a CEC and the
current status. This information is refreshed every 5 minutes.
You may use this dashboard to track any changes to weight and #LPs on real
time and trigger alarms or mails against any changes to the configuration.
97
Example 4
The following example shows the CICS transaction rate and average response
time of the most critical transactions in the agreed SLA. This information is
refreshed every 5 minutes.
The above will help to learn the SLA breach, if any, and investigate at our end
before the business entities or problem management knock at our door with
response time issues. As I have seen, now a days, business units helpdesk use
very smart tools at their end to track the business transaction response time.
98
Example 5
The following example shows the R4HA MSU tracking in real time. This
information is refreshed every 5 minutes.
If your billing is based on R4HA MSU, there are tools available in the market such
as BMC Compuware ‘ThruPut Manager’ to control your R4HA MSU peak,
provided you are ready to invest in the software.
99
Example 6
100
This report tells us every morning how we are doing with our MSU-hours and how
we expect our bill to be for this month.
The vital data here is the forecast value. It is easy to generate this using some
techniques of your own and taking into consideration the historical data for last
couple of years, the information in current years such as weekends, month ends,
quarter ends, special processing days (e.g., Black Friday and Cyber Monday sale,
fund raising activity, special concerts etc.) and holidays.
Once we have the forecast, then track it every day. In that way, we will have
enough time in hand to take proactive actions to control it, if possible. If not
possible to control, then identify the cause resulting in the surge in MSU-hours
and report to the senior management and Finance well in advance so that
organization is ready to meet the cost challenge.
Example 7
The following example shows the MIPS utilization of CICS workload and compare
the same against the utilization previous day(s). This information is refreshed
every 5 minutes.
15000
10000
5000
0
0:00
0:45
1:30
2:15
3:00
3:45
4:30
5:15
6:00
6:45
7:30
8:15
9:00
9:45
10:30
11:15
12:00
12:45
13:30
14:15
15:00
15:45
16:30
17:15
18:00
18:45
19:30
20:15
21:00
21:45
22:30
23:15
Time
Yesterday Today
101
Example 8
The following example shows the MIPS utilization of a specific LPAR, LP1, and
compares the same against the previous day(s) utilization. This information is
refreshed every 5 minutes.
200
150
100
50
0
0:00
0:40
1:20
2:00
2:40
3:20
4:00
4:40
5:20
6:00
6:40
7:20
8:00
8:40
9:20
10:00
10:40
11:20
12:00
12:40
13:20
14:00
14:40
15:20
16:00
16:40
17:20
18:00
18:40
19:20
20:00
20:40
21:20
22:00
22:40
23:20
Time
Example 9
The following example shows a simple tracking of critical path batch processing
during batch window. The actual tracking may have lot more information.
102
SMF Record 99 and 113
SMF record type 99 have 14 subtypes. Please refer to IBM manual ‘SMF SA38-
0667-50’ page-992 for details on SMF record 99, from where I have extracted the
following information.
Subtype 1 Contains system level data, the trace of SRM actions, and data about
resource groups. The SRM actions are recorded in trace codes. All trace codes
are described in z/OS MVS Programming: Workload Management Services. A
subtype 1 record is written every policy interval.
Subtype 2 Contains data for service classes. A subtype 2 record is written every
policy interval for each service class if any period in the service class had recent
activity.
Subtype 3 Contains service class period plot data. A subtype 3 record is written
every policy interval for each service class if any period in the service class had
recent activity and plot data.
103
Subtype 4 Contains information about a device cluster. A device cluster is a set
of service classes that compete to use the same non-paging DASD devices. A
subtype 4 record is written every policy interval for each device cluster in the
system.
104
Record type 113
Please refer to IBM manual ‘SMF SA38-0667-50’ page-1108 for details on SMF
record 113.
The system writes record type 113 to record hardware capacity, reporting, and
statistics for IBM System z10 or later machines. With the Extended Monitor
Facility released with the IBM z196, the SMF record can be utilized to capture
software events. We have to configure the collection of data using Hardware
Instrumentation Services (HIS) for which you can find more details in the following
link.
https://www.vm.ibm.com/perf/tips/cpumf.html
The SMF 113 measurements are designed to provide insight into the movement
of data and instructions among the processor cache and memory areas. These
measurements are invaluable for quantifying the net effect of the usage that the
processor caches have on the MIPS capacity of a processor. The SMF 113
measurements have become the basis for IBM’s LSPRs for processor sizing.
We can generate reports on Cycles per Instruction or actual MIPS per processor,
Level 1 Miss Percentage and Relative Nest Intensity calculated from data from
the SMF 113 records.
Cycles Per Instruction (CPI) - The number of cycles divided by the number of
instructions executed. An indication of how fast the CPU is running the work. It is
influenced by factors such as instruction stream complexity and cache and
memory access (RNI).
MIPS - Millions of instructions per second. The rate at which instructions were
being executed when the CPU was processing. This is effectively the inverse of
the CPI, so they show the same information. The MIPS value increases when the
work is being processed faster, so the MIPS display may be more intuitive than
CPI.
This MIPS measurement only has a loose relationship with the MIPS rating of the
processor. A wide variety of workloads is used to come up with a MIPS rating.
The typical variation in this MIPS measurement is an illustration of how much it is
influenced by workload and why care needs to be taken when comparing MIPS
ratings.
105
Level 1 Miss per 100 Instructions (L1MP) - The number of times data or
instructions were not found in level 1 cache. Fetches from lower cache levels are
slower, so in general as the L1MP rises, the work runs more slowly.
Relative Nest Intensity (RNI) - An indication of the relative usage of the memory
hierarchy, using formulas published by IBM for the specific processor type.
LSPR Workload - The LSPR workload hint over time. Derived from the L1MP and
RNI using the table published by IBM. Can be Low, Average or High.
• You can gain a deeper understanding of how your workloads are utilizing
the hardware
So, I suggest;
• Examine the data to better understand how your workload is using your
hardware
The following two examples have taken data from SMF record type 99, 113 and
70. As you can see, this simple data provides a wealth of information for the CPO
team. You can explore generating more data based on your needs.
You may refer to the following links which will provide you more technical details.
https://www.vm.ibm.com/perf/tips/burg-z16.pdf
https://share.confex.com/share/117/webprogram/Handout/Session9592/Peter.E
nrico.zOS.SMF113.and.CPU.Counters.pdf
106
Example 10
The following example shows the hardware topology for one of the LPARs. This
information is refreshed every 5 minutes.
Proc: Act: Instructions CPI MIPS Drawer Cluster Chip Engine Core Type
14 33.02% 285,083,929,266 1.8 2,888.88 1 3 2 ZIIP 07 VH
15 19.27% 203,477,368,678 1.47 3,537.41 1 3 2 ZIIP 07 VH
16 14.71% 194,095,402,091 1.18 4,406.77 1 3 2 ZIIP 08 VH
17 10.32% 124,978,949,566 1.28 4,062.50 1 3 2 ZIIP 08 VH
18 4.75% 64,029,429,706 1.15 4,521.73 1 3 2 ZIIP 09 VM
19 3.40% 41,997,632,806 1.26 4,126.98 1 3 2 ZIIP 09 VM
107
Example 11
From the data presented in Example 9, this is a deep dive information for GP
processor 04 in LP1. You can find more granular data at LP level. I am not telling
you to keep on analyzing this data regularly. But this information may help you in
doing performance tuning at a more granular level.
Processor 04 at 12:45
Processor report:
Processor topology:
Drawer: 01 CLuster: 03 Chip: 02
108
Example 12
This example describes the automatic email triggered to report anomalies and
needs immediate action.
From: RTP-Automation
Hello Team,
Note: This mail is generated by the automation and please do not reply to this
mail.
This utilization could be normal or abnormal. If you find it normal, then ignore it.
If it is abnormal, you need to perform an analysis immediately.
This automatic email helps to reduce the CPU wastage.
In absence of this, the task may have a problem and keep consuming the CPU
for hours and go unnoticed. This email provides an opportunity to the teams for
a quick review. If found abnormal, take appropriate action to fix it ASAP thereby
controlling the wastage.
109
Systems Configuration
Manage resource configuration
• CEC Name
• LPAR names
110
7. Naming convention
• CEC Naming
• LPAR naming – All LPARs (zOS, zVM, zLinux, Internal CF, External CF)
LPC4
LPC3
LPD5 LPR3 LPN5
LPD6
LPD7 Prod Sysplex 2 LPR4
LPN6
LPN7
LPD8
LPN8
ECF1 LT6C1
LTN1 Test Sysplex 1 LTC2
LTA1 LTN2
LSN1 LSA1
Sysprog LPAR LSD1 ECF3
LNN1
LNA1
LND2 Network LPARs LNN2
ECF2 VTN2
VPN1 VPN2 VSN2
VTN1 VSN1
zVM LPARs VPD1
VPD2
P S
P Primary Data Centre S S Secondary Data Centre S
LPAR Legends:
Normal
Availability
Disaster
Recovery
111
CPO Health Check
I strongly recommend generating a system health check report every day to make
sure that there is no accidental change to the system configuration and
performance due to some manual change, system IPL, or any changes which is
generally done in the night or early morning. The health check metrics may
include the following and the list can be customized to your need.
• LPAR weights
112
I have compiled the list as follows.
f. CPU per transaction will vary 3-5% for every 10% change in CPU utilization
g. Job run in larger LPAR will take less CPU time than when run in small LPAR
h. Uncaptured CPU time. Uncaptured CPU time is CPU time the LPAR
consumed but which is not charged back to any address space or enclave.
Many causes for uncaptured CPU time. Common causes are as follows:
• Full pre-emption
113
• Interrupts
• SLIP processing
m. Changed CF Async requests (CHNGD) should not be > 10% of all requests
p. Lock structures real lock contention - not > 2% of total CF request for
structure
q. Lock structures false lock contention - not > 1% of total CF request for
structure
114
CPO Process Document – a sample
For effective CPO management, I strongly recommend creating a CPO process
document following the installation standards. I review and update this document
yearly. However, I make ad hoc updates based on the feedback from audits,
change in the organization process document standards, and any major changes
to my CPO process.
The process document has a proper name approved by your process team in the
organization.
You must add a ‘Change History’ table containing Date, Version number, a short
description of changes in this version, owner of the document and the approver.
Introduction
115
performance, and optimization for infrastructure (systems, applications, and
users) under mainframe z/OS operating systems. Unless otherwise stated, all the
descriptions in this document refers to mainframe hardware, software, and
capabilities under mainframe z/OS only.)
Describe the definitions, if any, you would like to describe in the document.
(Example:
• Optimization management is ….
Objective
Mainframe environments
CPO activities
(Example: The CPO team perform all activities related to mainframe capacity,
performance and optimization under z/OS. The following sections describe
various capacity and performance related functions and activities…..)
116
Capacity and performance data collection
(Example: The required SMF data is extracted and the utilization statistics are
processed outside mainframe by the xxxx tool. The output from the xxxx tool is
the golden source for CPO team to perform most of the CPO activities. However,
on need basis, some ad hoc data processing is performed in mainframe using
SAS programs…..)
• Tool 1….
• Tool 2….
Capacity provisioning
(Example: Based on the capacity forecast, the CPO team makes sure that the
required capacity is available to meet the demand.)
117
Resource capacity considered under capacity forecast
(Example: The following resources are considered for the forecasting. In general,
the forecast is performed for GP CPU as this is the most variable resource
component. However, the other components are reviewed from time to time and
mostly on demand, for example during a major hardware/software upgrade, major
project implementation and to address performance issues.
• GP CPU
• zIIP CPU
• other resources)
Performance management
o Tool1
o Tool2
o Omegamon
• Performance analysis
118
Capacity and performance analysis
The CPU usage trend analysis is also performed periodically to determine the
step growth and generate a monthly capacity report. The following drill-down CPU
usage analysis is performed to identify the major contributor.
• CICS transactions)
(Example: The objective of this activity is to track and take all possible actions to
control the YTY BAU CPU capacity growth within x%.
The average hourly CPU usage is cumulated over the entire month (including
weekends and holiday) to determine the average CPU usage per day in the
month. The difference between the daily average usages in the current year
against the last year represents the growth (positive or negative).
It is often a challenge to separate the actual BAU growth against the incremental
demand due to new applications, functional changes and optimization. But all
possible observations are put together during the monthly analysis to justify the
growth and is presented in the monthly report.
119
Track and manage CPU resources (MSU-hour)
(Example: The objective of this activity is to track the MSU-hour usage against
the target for the year and take various actions to keep the yearly average usage
below the target.
• Monitoring and alerts on heavy usage and looping and action to stop it)
120
Optimization
Capacity tools
(Example: The following capacity tools are used to manage capacity and
performance in mainframe. A few of these tools are owned by this CPM team.
• Tool1
• Tool2
• Omegamon
• Others…..)
(Example: The capacity and performance data is processed, and the following
reports are published.
121
• Report1 - Process and publish resource utilization reports
• Management reports
(Example: CPO team handles all the capacity and performance related issues.
• Proactive
• Reactive
o Emergency calls
o RCA
o Participate as required)
122
Projects and initiatives
(Example: CPO team estimates the capacity requirements for projects and
special initiatives on demand. The following activities are performed.
• Capacity estimation
• Performance evaluation
• Special initiatives
(Example: The CPO team gets involved in the following operational activities on
demand.
• Online/Offline LPs
• Prepare the capacity allocation changes (CPU weight, memory, LPs) and
recommend mainframe SMEs for implementation
123
Conclusion
If you ask me a question on CPO or want me to work on something, the first
question that jumps to my mind is: is this request ‘Generic’ or ‘Specific’? If it is
specific, then I try to understand the context and make a plan and approach to
perform the task accordingly.
Sometimes the task is simple, and I am able to deliver the result based on my
knowledge and past experiences. But, when required, I do not hesitate to look for
additional information from IBM Manuals, Redbooks, and articles published on
the internet to read and research. Most importantly, I look at the latest information
related to the version of zOS, product and APARs and the hardware configuration.
Getting a hold of the correct information, and the speed at which you access the
information, proves your capability to perform the task in time and deliver the most
accurate result.
Working under the CPO has always been very fascinating, interesting and at the
same time very challenging. We have used many innovative ways to approach
an issue, analysis, or a query from the SMT, business and user community. Once
you start working on a CPO query or issue, then you will start thinking from many
points of view, and once you continue working patiently, you will definitely be able
to deep dive to the root. And undoubtedly, you will learn something new from each
task.
In our professional career, unless we are in the field of research and development
doing new innovations, in more than 99% of the cases, we do whatever someone
else has already done it elsewhere or knows how to do it. So, my philosophy is, if
someone else is able to do it, then I will be able to do it. And if I am able to do it,
you will also be able to do it. You just need to build confidence in yourself to
perform all the tasks in front of you with speed and accuracy and develop ways
to do it with the least possible effort.
124