Mainframe zOS CPO

Mainframe zOS CPO
(Capacity, Performance and Optimization)

Hints and Tips
Natabar Sahoo
natabarss@yahoo.com
V01
April 2023
(This book is dedicated to my friend and my younger brother ‘Late Harish Malu’.
He was the smartest and most brilliant technician that I have worked with in my
life, and lost his life in Covid-19)
Disclaimer
All the details described in this book are the author’s personal opinion and
experiences. IBM manuals, Redbooks, and many other presentations and articles
on the internet have been my sources of information. I have included some
extracts from IBM manuals and other sources for easy reference and I have tried
to include the source from which the information is collected. The information like
Rule of Thumb (RoT) was a compilation from many sources and they might have
changed over time. You may use this book as a reference, but you need to
develop your own process, techniques, and approach for performing your
Capacity, Performance and Optimization (CPO) function and most importantly
use the latest information against your mainframe configuration.
Comments and Feedback

I want this book to be useful to the CPO team and other mainframe technical
Subject Matter Experts (SMEs). So. I very much welcome your comments and
feedback on this book. Please do write to me at natabarss@yahoo.com and I look
forward to hearing from you.
Mainframe zOS Capacity, Performance and Optimization (CPO)
1
Acknowledgements
I would sincerely like to thank the following people for their time to review this
book and provide their valuable feedback.
My friend and colleague Frank Tidemand Johansen who is currently a mainframe

architect and have vast mainframe experiences and I consider him an expert on
zOS, zVM, zLinux, CPO, mainframe configuration, costing, and application
development (REXX and Assembler) and many more areas.
Sushanta Dash who currently leads the CPO team at Kyndryl, India. He has a
very rich experience of doing performance tuning under databases.
Satish Nagarajan who has rich experiences as a zOS systems programmer and
is currently leading the CPO team in a large mainframe installation.
My daughter Parama Sahoo, a doctor with an MBA, who has helped me in editing
this book. She has experience on performing capacity planning and post-merger
systemic integration in hospitals.
2
Contents
Disclaimer ............................................................................................................ 1
Comments and Feedback ................................................................................... 1
Acknowledgements ............................................................................................. 2
Introduction .......................................................................................................... 8
Capacity, Performance and Optimization (CPO) function – strategic view ........ 13
CPO as a culture in the organization ............................................................. 13
Existence of mainframe zOS CPO team ........................................................ 14
Direction from Senior Management Team (SMT) .......................................... 14
Data analytics - tools for data analysis........................................................... 15
Data analysis automation ............................................................................... 15
Real time performance monitoring ................................................................. 15
Service Level Agreement (SLA) ..................................................................... 16
Few basics and terminologies ........................................................................... 17
zOS CPO ....................................................................................................... 17
Understanding mainframe MIPS & CPU sec – Layman view ......................... 18
MSU-hour ....................................................................................................... 18
Rolling 4 Hour Average (R4HA) MSU – A short description ........................... 19
Introduction ................................................................................................. 19
MIPS ........................................................................................................... 20
CPU Second ............................................................................................... 20
Service Unit ................................................................................................ 20
MSU (Million Service Units per hour).......................................................... 21
Understanding MSU (Hardware and Software) .......................................... 21
Computation of Rolling 4 Hour Average (R4HA) MSU ............................... 22
An example of R4HA computation for one LPAR ....................................... 23
Mainframe Processors ................................................................................... 24
Mainframe Applications .................................................................................. 24
The Program and MIPS .............................................................................. 24
3
Creating Optimized and Efficient Programs ............................................... 25
Application optimization – An example of DD DUMMY ............................... 27
HiperDispatch ................................................................................................ 28
Relative Nest Intensity (RNI) .......................................................................... 34
LSPR workload categories ......................................................................... 34
Hints and Tips ................................................................................................... 40
Cost - The biggest challenge and my simple solution .................................... 40
Manage hardware ....................................................................................... 41
Optimize Software suite .............................................................................. 42
Get maximum CPU cycles out of the hardware .......................................... 42
Reduce CPU wastage ................................................................................ 43
Optimize, Optimize, and Optimize .............................................................. 44
Nothing is right or wrong ............................................................................. 46
CPO Team ..................................................................................................... 47
CPO Objective ............................................................................................ 47
CPO Benefits .............................................................................................. 48
Capacity Management ................................................................................... 49
Short-term capacity forecast ....................................................................... 50
Long-term forecast ...................................................................................... 50
Ad hoc forecast ........................................................................................... 50
Capacity provisioning ................................................................................. 50
Resources covered under capacity plan..................................................... 51
Capacity Reporting ..................................................................................... 52
Performance Management ............................................................................ 52
Capacity Optimization .................................................................................... 53
Objective ..................................................................................................... 54
Optimization Tasks ..................................................................................... 54
Workload..................................................................................................... 54
Input ............................................................................................................ 55
Savings ....................................................................................................... 55
4
Optimization outcome ................................................................................. 55
Savings calculation example ...................................................................... 56
Computation of cost .................................................................................... 56
CPO data analysis – A systematic approach .................................................. 58
Workload utilization bucketing .................................................................... 58
Workload classification for easy Chargeback ............................................. 59
Metrics for CPO analysis ............................................................................ 59
CEC CPU Analysis ..................................................................................... 61
LPAR CPU analysis .................................................................................... 61
Workload CPU Analysis.............................................................................. 62
Address space CPU Analysis ..................................................................... 62
Analysis approach ...................................................................................... 63
Tricks of performing data analysis .............................................................. 63
Major analysis outcome .............................................................................. 68
Tools and data source ....................................................................................... 69
SMF................................................................................................................ 69
What is SMF ............................................................................................... 69
Major Uses .................................................................................................. 69
Who Needs the Data .................................................................................. 70
Reporting .................................................................................................... 71
Sources of CPU Information in SMF ........................................................... 71
Data Collection ........................................................................................... 72
Presentation................................................................................................ 72
Tools and products processing SMF data .................................................. 73
RMF Data and SMF Records ..................................................................... 75
RMF ............................................................................................................... 76
WLM ............................................................................................................... 77
Important Source of Information for CPO....................................................... 82
Post processor ‘Summary’ Report .............................................................. 82
RMFMON III ‘CPC’ Panel ........................................................................... 84
5
RMFMON III ‘SYSINFO’ Panel ................................................................... 85
RMFMON III ‘PROCU’ Panel ...................................................................... 86
Picture source: RMF Report Analysis SC34-2665-50 page 148 ....................... 86
SDSF ‘JM’ line command ........................................................................... 87
Postprocessor ‘WLM Activity’ report ........................................................... 88
RMFMON III Enclave panel ........................................................................ 89
Others ......................................................................................................... 89
Create my own database for critical data ................................................... 90
Capacity monitoring ........................................................................................... 93
Real Time Performance (RTP)....................................................................... 93
Data in Real Time ....................................................................................... 93
SMF Record 99 and 113 ........................................................................... 103
Systems Configuration .................................................................................... 110
Manage resource configuration ................................................................... 110
CPO Health Check ....................................................................................... 112
Rule of Thumbs (RoT).................................................................................. 112
CPO Process Document – a sample ................................................................ 115
Name, contents, and change history............................................................ 115
Introduction .................................................................................................. 115
Definitions and terminologies ....................................................................... 116
Objective ...................................................................................................... 116
Mainframe environments ............................................................................. 116
CPO activities .............................................................................................. 116
Capacity and performance data collection ................................................... 117
Capacity and performance data processing ................................................ 117
Capacity and performance monitoring tools ................................................ 117
Capacity forecast and provisioning .............................................................. 117
Capacity provisioning ................................................................................... 117
Resource capacity considered under capacity forecast ............................... 118
Performance management .......................................................................... 118
6
Capacity and performance analysis ............................................................. 119
Track and manage CPU capacity growth..................................................... 119
Track and manage CPU resources (MSU-hour) .......................................... 120
Optimization ................................................................................................. 121
Capacity tools............................................................................................... 121
Capacity and performance reporting ............................................................ 121
Incidents and problems ................................................................................ 122
Projects and initiatives ................................................................................. 123
Operational activities – on demand............................................................... 123
Conclusion ....................................................................................................... 124
7
Introduction
In this book, I have documented some of the ‘Hints and Tips’ on ‘Mainframe zOS
Capacity, Performance and Optimization (CPO)’ based on my personal
experiences gained through working in this area for a long time. I will not describe
much technical details as you will find plenty of reading materials from different
sources like IBM Manuals, IBM Redbooks, non-IBM product manuals, share
presentations, Sheryl Watson’s Tuning Letters, and many articles published on
the internet. But real-life experience is rarely shared and vanishes after an expert
stops working and loses regular contact and interaction with the community,
friends, and colleagues. Also, knowledge not used regularly does not get a
chance to get updated with the most current information and very often fades
away over time with different priorities in life because of non-usage of the same.
Another aspect which inspired me to write this book is ‘Knowledge Sharing’. In

my view, ‘knowledge gained is an asset’ to the individual and in the industry where
we work - ‘knowledge not shared’ is ‘knowledge lost’. I very strongly believe in
sharing my knowledge and when you share your knowledge, you identify the gaps
in your understanding and enrich your knowledge further. So, whenever I used to
have some interesting experiences, I used to document them. I also used these
opportunities to call my team members and share these with them, so that they
knew what to do if they faced such problems in the future rather than spending
hours of their time in solving the same issues/problems. Thus, very consciously,
I developed a habit and personal discipline in mentoring and building the skillset
in my team. These days, skill building, mentoring, knowledge sharing, has taken
a back seat. It could be due to;
• Understaffing
• Large workload and lack of time
• No inspiration or motivation to do so
• Lack of ownership
• Fear of sharing knowledge because someone may take away your job
I do not blame anybody for this. This is the general culture in the industry now.
However, when in an organization, I always consider two major aspects of my
8
work: the first and most imperative is that I work for myself, and the second is that
I work for an organization. And so, I have responsibilities towards both these
aspects. When I work for myself, I keep my career growth in my mind and
someday I may leave this organization and join another. However, I have a
specific role in this organization and the organization should not suffer when I
leave. So, it should be a part of my personal responsibilities to document what I
do and most importantly consciously build my successor from the first day that I
am in the organization. Even if you do not leave the organization, one day or
another you will stop working (so called ‘retirement’ from your job) and at that
time, the organization should be able to run seamlessly. Of course, a person
contributes to the organization in multiple ways and it is virtually impossible to
clone all of these contributions (especially the human factors), but at least the
normal activities should not have a gap.
Documenting the experiences is not an easy task, as the experiences gained over
many years of working is exorbitant and will result in writing many books. The
challenge lies in knowing what to document and what not to. So, I call this book
‘Hints and Tips’ as it is very generic and can be taken as a guideline and reference
while working and building on it. To me, different professionals work differently
using different techniques while working on the same issue. Also, the same
professionals can use different techniques on the same issue in different
contexts. Therefore, you have to build your own style of working keeping speed,
efficiency, and accuracy in view, while developing the skill of multi-tasking
This book may not describe information that is completely new to you. However,
this is a compilation of my experiences that I believe will be essential to anyone
in these roles. I have practically applied these experiences, and no matter how
large or small the benefits may be — to me, all of them are important and
integrated together to meet the larger goal of the CPO function.
I have worked in the IT field for 40 years including 33 years as a zOS systems
programmer. In my mainframe career, I have worked on CPO for nearly 14 years.
When I started working on mainframe in 1990, it was 3090/120S machine with a
simple configuration of one processor, ~10 MIPS with 32MB central storage and
25GB of DASD storage. At that time, I spent quite a bit of my time in tuning the
LOGON proc and working with application developers in tuning the DASD storage
space allocation. Thus, the CPO activity has started from the very beginning of
my mainframe career and has remained the focus throughout. I very genuinely
9
love CPO and always enjoy exploring and finding opportunities to optimize
anything and everything that comes to my mind.
In mainframe zOS, you will mostly hear about ‘Capacity and Performance
Management (CPM)’. In most organizations, CPM management is performed
either be by a dedicated CPM team, or as an integrated function of the zOS
systems programmers. However, I have added the function ‘Optimization’ to this
function because this is an integral part of CPM. With this, I change the function
name from ‘CPM’ to ‘CPO’. In general, optimization receives the least focus due
to lack of skill, expertise, time, and priority. But I suggest that Optimization activity
should be considered as an integral function of the mainframe infrastructure and
I am sure that the mainframe organization will realize a real benefit out of this
function. With my experience, we dedicated two human resources for performing
optimization and the cost of these two resources have been self-financed by the
$$ savings realized by the organization through the optimization performed by
them. And most importantly we have delayed or avoided the planned upgrade to
support the normal year to year growth.
One thing that always has surprised me and I have never found a realistic answer
to that: in 1990, if a single processor machine having10 MIPS was able to run
MVS/ESA V3/V4 with 100+ developers developing CICS/DB2 applications
concurrently - then, why do I need a minimum of 50 to 70 MIPS today just to keep
the operating system active without any workload? What also surprises me is that,
if the organization had ‘M’ amount of MIPS, how has the demand grown to ‘1.nM’
in just in 5 years when the growth in business has been relatively very minimal.
In my experiences, the increase in CPU demand is everywhere owing to:
• New zOS and product versions
• Modernization of application code
• Usage of new languages such as Java
• Implementation of web applications
• More and more encryptions
• More and more automation
• Data Warehousing
10
• Development of complex DB2 SQLs
• And so on.
Whenever we talk about Mainframe, the one obvious thing that comes to the
forefront is ‘Cost’. We can debate about the “Cost” of mainframe for hours without
reaching a conclusion. However, I undoubtedly want to stress on the cost we pay
to IBM and other vendors (Green $$) and the cost chargeback to business entities
inside organization (Blue $$).
A large portion of mainframe cost goes towards the IBM monthly license charge
(MLC). IBM has different techniques of charging customers. The newest
methodology is ‘Tailored Fit Pricing (TFP) which is based on the CPU
consumption measured in MSU-hours burnt over a month. Therefore, most of my
‘Hints and Tips’ will be directed towards saving CPU consumption in order to help
the organization run and manage mainframe zOS cost-effectively. My focus will
be on ‘zOS’ only and the approach can be extended to other areas.
Under CPO, we have become mostly reactive as opposed to proactive — be it

Capacity, Performance or Optimization. For example
• We do prepare a yearly capacity plan to meet the organizational

requirement but forget about it till the next year. We forecast, but then
hardly keep track of the actual capacity demand and take proactive actions
to control it. Rather, we work on it only when we have capacity issues.
• We do performance management but again mostly reactively and that to

only when we hit some performance issues. Very rarely do we deep dive to
find the root cause of the issues we face and fix it once and for all. Also, we
rarely put any focus on, or create an approach to perform a systematic
performance management proactively.
• We mostly ignore the optimization. As a result, we do not contribute much

towards cost reduction by performing resource optimization, although we
have plenty of opportunities to do so. For example, if our charge is based
on R4HA (Rolling four Hours Average) MSU, then we know an anomaly
only the day after we process the SMF (System Management Facilities)
data or at the end of the month after the SCRT report is generated. But, if
we could somehow track the R4HA (Rolling 4 Hours Average) MSU in real
time and anticipate/foresee/predict any anomaly, we could take various
11
actions to control it, because we at least have one to two hours of time to
act on it.
To switch from reactive to proactive, we need to prepare a systematic approach

to achieve this. You can establish multiple techniques towards this. One of the
most important experiences which I have gained is ‘Real Time Performance
(RTP)’ monitoring and management. This has extensively helped us to
proactively manage many performance issues before they result into a real
problem or issue. Although it may not seem easy initially to implement RTP, it is
also not the most difficult task. With the investment of some effort, RTP monitoring
can be set up. But I assure you that the result gained is far more than the
investment of effort. We will discuss more on RTP in one of the topics below.
12
Capacity, Performance and Optimization
(CPO) function – strategic view
As I have observed, CPO was an area of least focus by many organizations until
the early 2000s. However, since then, most organizations have implemented this
function by creating a dedicated team or delegating this function to respective
SMEs. The current CPO teams mostly focus on resolving issues related to
Capacity and Performance, and perform capacity planning. However, they very
rarely deep dive to find the root cause related to capacity and performance, and
work towards a sustainable solution to help the organization in running mainframe
installations effectively and efficiently. Moreover, the initiative to focus on
‘Optimization’ is almost Zero.
In my experience, mainframe installations have grown relatively big and CPO

plays a central role in the mainframe organization. CPO function significantly
improves mainframe performance and saves mainframe costs by reducing the
CPU consumption (MSU-hours). Any action in reducing overall physical capacity,
improving system performance, application code optimization, optimized system
and application configuration, has a direct impact on reducing CPU cycles to do
the same amount of work. To achieve this, my simple philosophy is to optimize
anything and everything, reduce CPU wastage, and reuse the reclaimed CPU
cycles to support the day-to-day business BAU (Business As Usual) growth.
Therefore, I propose a strong CPO function in all mainframe organizations. In my

view, all of the following CPO recommendations must be considered to gain the
maximum benefits out of the mainframe zOS.
CPO as a culture in the organization
The CPO is not only the responsibility of the zOS technical infrastructure teams
such as zOS systems programmers or the CPO team. It is a shared responsibility
of all teams under the mainframe organization including the senior management
team (SMT), mainframe development community, zOS infrastructure subject
matter experts (SMEs), batch planners and the zOS operations team (Console
and Batch). Most importantly, the business community must collaborate by
providing their mainframe infrastructure requirements well in advance (wherever
13
possible) and challenge the mainframe infrastructure and development teams to
deliver the best optimized solution to meet their demands.
Existence of mainframe zOS CPO team
There should be a dedicated CPO team directly under the zOS infrastructure. The
main responsibility of this team will be to analyse the zOS resource utilization
(primarily CPU) data and provide recommendations for ‘Capacity’, ‘Performance’
and ‘Optimization’. Execution and deployment of the recommendations takes an
enormous amount of time, so this should be a responsibility of the infrastructure
SMEs (Subject Matter Experts), planners, operations team and the development
community. Assuming this de-centralized approach is accepted, I recommend a
maximum of two to three experienced SMEs in the CPO team with strong
analytical and presentation skills. However, the roles and responsibilities of the
CPO functions need to be clearly documented, agreed, and approved under the
mainframe organization.
Direction from Senior Management Team (SMT)
The senior management team in the organization should have a strong

commitment and provide direction to integrate CPO as a culture in the
organization and deploy a well-developed process surrounding it.
There are three critical steps that the SMT should employ to integrate CPO as a
culture in the organization:
1. Establish a strong communication process through regular interactions with

the development community, the zOS infrastructure team, and the
business entities
2. Define a clear goal to develop and deploy Zero defect optimized code
3. Provide a direction to the mainframe community (zOS infrastructure team,

development team, and the business entities) to have a strong
collaboration to integrate/implement CPO activities to their regular work
These steps will allow the SMT to show a strong commitment and provide a
direction to CPO function within the organization.
14
Data analytics - tools for data analysis
To make the CPO resource utilization analysis effective, we need to make the
most out of the information from the SMF data. But getting the required
information out of SMF records is not easy. There should be tools to process the
SMF data to make it handy for analysis. These tools can either be developed in-
house, or you can choose one available in the market. The right data should be
available at the right time to perform the right analysis and take the right decision.
Data analysis automation
Quick analysis, execution and deployment of various CPO recommendations is

key for the effective, efficient, and optimized operation of Mainframe zOS. I will
very strongly recommend innovating and deploying full or partial automation
everywhere possible. Some examples of these are (1) implementing step-growth
analysis to identify CPU utilization anomalies, (2) using debugging tools during
code development to identify and fix possible code defects, (3) making use of
system automation products to deploy the CPO recommendations, where
applicable.
Real time performance monitoring
‘Real Time Performance (RTP)’ monitoring is very necessary to stay on top of

CPO activities, but most organizations have not exploited this enough. The main
reason is a lack of focus, skill, and expertise to design, develop and deploy this.
However, with a little effort, this is very achievable.
The interfaces available for real time performance monitoring and analysis are
very limited. Using IBM products like SDSF, RMFMON II, RMFMON III,
OMEGAMON or any other non-IBM products, the navigation and correlation of
information is not so user-friendly. To find simple information to address an issue,
you may have to navigate through many screens under multiple products.
Furthermore, without the availability of historical data, it is difficult to infer whether
the behavior is normal or abnormal. So, we mostly depend on the post processor
data for in-depth analysis. However, if we can innovate some way to collect the
required data for our real time monitoring, compile them centrally and present
them in a simple way to detect the anomaly easily, then we will be able to detect
15
a lot of issues proactively. I will write on RTP in more detail in one of the chapters
below.
Service Level Agreement (SLA)
In many organizations, due to the assumption that support will only be needed by
internal customers - there are often no SLAs, or very few SLAs which are often
ignored during workload processing. However, even if your only customers are
internal - it is important to establish a detailed SLA at various levels. These
indicators are critical to gauge/measure all functions under mainframe
infrastructure and will help you to provide support at the agreed level of SLA. For
example;
• Response time goal of critical CICS online transactions
• Timeline of critical batch jobs execution
• Acceptable delays in starting a critical batch job
• Batch window (start to end of business batch)
• Print files availability to business units
• Agreement with external entities (e.g. MQ servers managed outside the

organization)
SLA management provides a great number of inputs to the CPO team for capacity
planning, performance management, and to create optimization initiatives.
16
Few basics and terminologies
zOS CPO
In my view, ‘Capacity’, ‘Performance’ and ‘Optimization’ are very separate

functions with a strong interdependencies. Therefore, I consider these three
functions to be logically combined under a single function, called CPO, that needs
to be managed by a dedicated team with a delegation of roles and responsibilities
in the organization.
For clarity, I define the functions as follows:
• Capacity management is the process of planning, estimating, and

provisioning sufficient resource capacity in a cost-effective manner to meet
the current and future service needs for all applications and users hosted
under z/OS.
• Performance management is the process to make the best use of the

current resources to meet the stated objectives and Service Level
Agreements (SLA).
• Optimization is a process to understand what is going on in the systems,

identify the bottlenecks and processing inefficiencies through resource
utilization analysis, real time monitoring, and identifying opportunities to
define objectives to reduce resource usage and save costs.
In my experience, Capacity Management is a science where you follow scientific

techniques to forecast and provision the required capacity. While Performance
Management and Optimization is an art where you develop your own techniques
to analyse data, perform detailed analysis and metrics to generate a list of
performance and optimization opportunities, and to implement the same in order
to measure and report savings.
17
Understanding mainframe MIPS & CPU sec – Layman view
What do we really mean by mainframe MIPS. MIPS is an acronym for “Millions of

Instructions Per Second”. It is the number (in millions) of instructions that our
mainframe can process in a second of operating time. Under the mainframe
configuration, when our job or transaction consumes one second of CPU time, it
means that the processing has executed approximately xxxx millions of
instructions (xxxx is the per engine MIPS rating depending on the model of the
CEC) which includes the executing program and the service modules such as
zOS, DB2, MQ, CICS etc. When measured, it is generally averaged over a period
of time e.g., 5 min or one hour or one day etc. and thus the computation of MIPS
depends on the time interval. But in an absolute sense, the mainframe executes
xxxx million instructions every second by every online physical processor, when
busy.
Then, what is this CPU second? The CPU second is tied up with a processor. For
example, in sysplex1, we have 20 GP processors, then this gives sysplex1 a total
20 CPU sec in every second of operating time. This is where we do the
multiprocessing, that means, if there is a demand, sysplex1 can process 20 tasks
concurrently at any point in time. Aside from this, we have other specialty engines
like zIIP, IFL etc. which also process a similar, or larger. number of instructions
per processor every second by executing the applicable workload.
MSU-hour
A million service units (MSU) is a measurement of the amount of processing work

a computer can perform in one hour.
IBM Sub-Capacity Reporting Tool (SCRT) uses the following base calculation to
compute MSU consumption per hour:
MSU consumed = (SMF70EDT / 1000000) * 16 * SMF70CPA_scaling_factor /

SMF70CPA_actual
• SMF70EDT is the effective dispatch time for the LPAR. If dedicated or ‘wait
complete’ processors are in use, this value may be adjusted as appropriate
• SMF70CPA_scaling_factor and SMF70CPA_actual are the CPU

adjustment factors based on the model capacity rating of the CPC
Ref: https://www.ibm.com/downloads/cas/YMW2JWP4
18
Rolling 4 Hour Average (R4HA) MSU – A short description
Many organizations still use the R4HA methodology for IBM billing. So, I have
provided a short description to bring some clarity to the R4HA.
Introduction
IBM software charging is arguably unnecessary complex, viewed by many as

more of an “art form” than a science. Initially IBM, and most other vendors,
licensed software to a specific mainframe model and its “full capacity”, as if it was
always 100% busy. In 2000, IBM introduced a usage-based software pricing
model; sub-capacity pricing and the concept of the rolling four-hour average
(R4HA).
IBM Monthly License Charge (MLC) software is priced based on peak MSUs
(Millions of Service Units) used per month and not on machine capacity
• Where the average MSU’s consumed is calculated for all rolling 4-hour
periods within a month,
• And the peak R4HA MSU is used to determine the monthly licensing
charges.
Measured each month by IBM’s Sub-Capacity Reporting Tool (SCRT), which

calculates the monthly peak R4HA MSU consumed (for the period beginning on
the second day of the previous month up until the end of the first day of the current
month) - this is the basis for the software bill for the month.
See below for a brief overview of key CPU measurements and billing terms.
These sections briefly describe some of the CPU measurement metrics used in
the computation of R4HA MSU. However, for your convenience, you may also go
directly to the section ‘Computation of Rolling 4 Hour Average (R4HA) MSU’ on
the next page.
19
MIPS
MIPS (Millions of Instructions per Second), is probably the most common unit
used when talking about mainframe capacity. When mainframes were still young,
manufacturers could measure MIPS capacity by repeatedly running a small
standard routine. However, MIPS has not been a meaningful measurement for
decades. IBM mainframes have a huge number of instructions; some are simple
and quick, and others are complicated and slow. For example, one application
using five million simple instructions will use a lot less CPU than one using five
million complicated ones. In addition, the number of instructions available is
increasing with each new mainframe processor type. In summary, MIPS is an
indication of the speed of a processor and is very much workload dependent. IBM
publishes the MIPS rating of its various processor models, but there is no single
MIPS number for any CEC and no tool that reports MIPS numbers.
CPU Second
Traditionally CPU seconds - the number of seconds a CPU is actually in use -

was a measurement of how much work is performed by a processor. z/OS
(Mainframe OS) records the number of CPU seconds that each unit of work has
consumed, providing an excellent way of measuring workload consumption for
billing. However, the amount of work that can be done in one CPU second is not
the same for each processor type. For example, one CPU second used by a z15
processor type is different to a CPU second used by a z16 processor type due to
differing clock speeds. This makes the measurement inconsistent when
upgrading mainframe hardware from one processor type to another.
Service Unit
SU (Service Units) allows cross processor type comparisons. IBM publishes SU

ratings, which differ by processor model, even under the same processor type.
Further complications arise when a CEC has multiple LPARs with a varying
number of engines allocated to each LPAR. The SU_SEC for an LPAR is
determined at system start time and is based on the number of engines online to
that LPAR. For example, in a z14 CEC with 7xx model, one LPAR with 2 engines
delivers 88397 SU per second while another LPAR with 9 engines delivers 76555
SU per second even though they are hosted in the same CEC. Therefore, this is
not a good metric for CPU measurement and billing.
20
MSU (Million Service Units per hour)
MSU was created from Service Units. MSU measures the rate of CPU usage, but
can also refer to the capacity of the processor model, e.g., a processor with an
MSU rating of 100 can process up to 100 million service units per hour.
The original idea of MSUs was to use it as an indicator of CPC capacity. MSU is
calculated as:
MSU = The SU/Sec for the model * number of engines * 3600 / 1,000,000
Example: The z15 701 processor model has a rating of: 103225.8065
SU/Sec * 1 * 3600 / 1,000,000 = 371.6 or 372 MSUs
MSU is primarily intended for software licensing. Most vendors scale their
software licensing fees to an MSU rating. IBMs sub-capacity licensing also uses
MSU when calculating the final bill.
Understanding MSU (Hardware and Software)
To help customers reduce software licensing charges, post the introduction of the
z990 processor type, IBM began tweaking the MSU capacity of processors (in
order to lure users to a newer machine). IBM publishes the MSU ratings of the
various processor models based on their internal measurements and
adjustments, which are only used for software license charging.
When IBM started altering the MSU number as a way of discounting software
cost, it gave rise to two definitions for MSU:
“Hardware” MSU – calculated using the original formula above, this is the basis
for measuring and reporting in some of the mainframe usage.
“Software” MSU – also based on the calculation above, but adjusted depending
on the CPU type and model and is a fixed value for a specific model. SW MSU is
used as the basis for software charging and LPAR capping and is used in
reporting some of the CPU measurements.
21
Example: The z15 701 is rated at 253 “Software” MSUs against 372 “Hardware”
MSU, as calculated above.
In short, the “Hardware” MSU is an indicator of capacity while “Software” MSU is

used for SW license charge and billing.
Computation of Rolling 4 Hour Average (R4HA) MSU
Workload manager (WLM), a component of z/OS (Mainframe OS), is responsible

for sampling MSU utilization for each LPAR every 10 seconds. Every 5 minutes,
WLM documents the highest observed MSU utilization sample from 10 second
buckets within a 5-minute interval. This process keeps track of the past 48
observations of the 5-minute samples for each LPAR. When the 49th observation
is recorded, the 1st observation is discarded, and so on. These 48 observations
of 5-minute intervals represent a total of 5 minutes * 48 = 240 minutes or the past
4 hours. The average of these 48 previous observations is the R4HA MSU
recorded every 5 minutes by WLM and is rolled over every 5 minutes. WLM stores
the average of these 48 values in the WLM control block and periodically
(installation defined) writes it to an SMF record.
As an example, an installation records SMF at an interval of every 5 minutes. In

this way, we have the R4HA MSU captured every 5 minutes, to be processed by
the SCRT tool. The SCRT algorithm calculates R4HA values at an hourly interval
by taking an average of the 12 five-minute R4HA values recorded by SMF in an
hour. We feed the utilization data from the 2nd of previous month to the 1st of
current month into SCRT to generate the peak R4HA MSU in the month. SCRT
generates 24 hourly average data points per day and the peak value over the
month is used for billing.
At the beginning of each month the SCRT output report, for the previous month,
is sent to IBM (for billing).
22
An example of R4HA computation for one LPAR
The following table describes the computation of R4HA MSU for one LPAR.
However, for actual billing, the data for all the CECs in all data centers over the
whole month are used for the computation of overall peak R4HA MSU.
23
Mainframe Processors
In general, we term the mainframe box as a ‘Machine’. This includes processors,

which we call ‘Engines’. Early mainframes had a single processor, which was
known as the central processing unit (CPU). Today's mainframes have a central
processor complex (CPC), which may contain several different types of
processors that can be used for different purposes. The following are the types of
processors (engines) available in mainframe and we may even have all of these
in our machines.
Central Processor (CP) - This is a general-purpose processor and is used

for normal operating system and application software.
System Assistance Processor (SAP) – This processor is used in I/O

subsystems for Input/Output processing.
Integrated Facility for Linux (IFL) – This processor executes Linux workload.
zEnterprise Integrated Information Processor (zIIP) – This processor

executes certain eligible workloads (Java, DB2, BI, ERP, CRM etc.).
Integrated Coupling Facility (ICF) – This processor is used with coupling

facility.
Spare – exclusively reserved to provide failover in the event of a processor

failure.
Mainframe Applications
The Program and MIPS
Mainframe has processors running at certain speeds. Which means, each

processor delivers a certain number of CPU cycles and executes a certain
number of instructions per second. This, in general, is measured in MIPS (millions
of instructions executed per second). Fundamentally, a processor (CPU) does
this:
1. Fetch the instruction
2. Decode the instruction
24
3. Fetch the operand
4. Execute the instruction
5. Store the result
6. Go to next instruction
With this, the processor is in an infinite loop just performing ‘Fetch and Execute’.
Who tells the processor what to execute? The simple answer is ‘our programs’.
We write the programs in COBOL, PL/I, Assembler, Java etc., compile and link
edit to create load modules. When executed (e.g., batch job or CICS transaction),
it calls for systems services, which are again programs, supplied by the operating
system and software products. During processing, it executes a certain number
of instructions that are accounted for as: how many MIPS the program or
transactions have consumed. If our program can somehow result in executing a
smaller number of instructions, this will account to less MIPS and directly result
in saving $$$. So, the big question is, do we have a control over our programs to
execute a smaller number of instructions? The answer is certainly YES. We are
the creator of the program and we have options available at each step (coding,
compiling, link editing etc.) to optimize our program. This is called the art of
creating optimized and efficient programs, which will result in executing a smaller
number of instructions in total to process the same amount of data and generate
the same output.
Creating Optimized and Efficient Programs
A computer program is a collection of instructions that are executed by a

computer to perform a specific task. And programming is a technique to create
programs using multiple programming languages. Programming, in general, has
a life cycle involving multiple steps such as analysis, design, coding, debugging
and testing, implementing, and maintaining the program. There are many
techniques and disciplines followed in creating optimized and efficient programs.
I have noted down a few that I have learned from my application programming
friends such as:
• Using a structured programming style
• Choosing efficient datatypes
25
• Handling database tables efficiently
• Compiler and run-time options that affect run-time performance
• Efficient coding techniques
• Tuning CICS, DB2 and VSAM access
And I am sure that you will have many experts around you from whom you can
get a lot of hints and tips in optimizing your programs.
The following questions are asked from CPO point of view.
Application design Considerations
• Have you considered the best practices
• Data handling (buffpool, buffsize, block size etc.)
• Have you considered the SLA
• Have you included the application recovery strategy
• Have you thought of load balancing
• Have you considered continuous availability (application split – sysplex)
• Any possibility to use the specialty engines
Application development Considerations
• Is your application code optimized? – i.e. can you perform the same work
with lesser number of instructions
• Do you understand how your application code is executed in mainframe? –

i.e. the frequency, systems where they run and the integrate with other
programs
• Use the debugger, if any, for code optimization
26
Application testing Considerations
• Resource utilization benchmarking against SLA

• Performance analysis
• Stress test
• Failover test – recovery or restart after program abend or failure
• DR test
Application optimization – An example of DD DUMMY
I am sure you all know DD DUMMY coded in your JCL. The use of the DUMMY
parameter is mostly done in testing a program. When testing is finished and you
want input or output operations performed on the data set, replace the DD
DUMMY statement with a DD statement that fully defines the data set. The
DUMMY parameter specifies that no device is to be allocated to this DD statement
and that no disposition processing or I/O is to be done. Many times, we use the
DUMMY parameter if we know that we will not need a file in a job step. But this is
not a good practice especially for output files. Using DUMMY for an input file e.g.,
“//SYSIN DD DUMMY” is perfectly fine. But, when we use DUMMY for the output
file, we might end up doing the full processing in the program, and just eliminating
I/O to the output file by coding DUMMY in the JCL.
A classic example: There is a business need not to generate the output file
anymore. But, instead of updating the program, we simply code DD DUMMY
in the JCL. We undoubtedly met our processing objective and saved some IO
and Disk/Tape space. But we did not reduce the processing time in the
program itself, which continues to burn CPU and therefore does not drive
down the CPU consumption and save $$.
Our suggestion: Please do look for DD DUMMY coded in your JCL especially
for output files, review the program, make necessary updates to the program
to eliminate any processing done against this file.
27
HiperDispatch
To get the maximum out of the hardware, it is very important to understand your
system hardware configuration and the HiperDispatch function.
I will describe some basics here. You can read a lot about HiperDispatch from
different sources. To make my description simpler, I have taken some description
ASIS from the IBM Manuals.
The above picture (Ref Page 29 of Redbook sg248850) shows that a z15 holds
up to five CEC drawers. Your installation configuration might have xxx number of
processors (a combination of GP and specialty engines) spread over multiple
drawers. When you assign the LPs to LPARs in the CEC, the LPs allocation to a
specific LPAR may spread to different drawers.
28
The above picture shows the drawer cache structure in z14 and z15. (Ref page
95 in Redbook sg248851).
As I have described above, the processor execution is as follows:
1. Fetch the instruction
2. Decode the instruction
3. Fetch the operand
4. Execute the instruction
5. Store the result
6. Go to next instruction
So, when the data is fetched, it all depends on where the data resides:
L1/L2/L3/L4 cache, or memory? The data must be in L1 for the processor to use.
So, if the data resides in cache other than L1, then it must be fetched to L1. So,
how deep into the shared cache and memory hierarchy (“nest”) must the
processor go to retrieve data not present in L1 cache? This is critical/important
because, access time increases significantly for each level of cache accessed,
thereby increasing the processor wait time.
29
The following picture shows the relative cost of fetching the data from different
levels of cache or memory (taken from a presentation by Robert Vaupel in 2010).
While configuring the LPARs, we assign the number of logical processors (LPs)
and the amount of memory to be allocated to the LPAR. When the LPAR is
activated, PR/SM builds logical processors and allocates memory for the LPAR.
To understand HiperDispatch, you need to understand PR/SM entitlement. To

understand how PR/SM uses machine topology information, first you should know
how PR/SM determines entitlement for a partition. Entitlement is the amount of
real CPU time each partition is guaranteed. Entitlement for a partition with shared
CPUs is a function of the LPAR's weight, the sum of the weights for all other
shared partitions, and the number of shared physical CPUs in the Central
Processing Complex (CPC). Entitlement is calculated separately for each CPU
type as the number of shared CPUs multiplied by the ratio of an LPAR's weight
to the sum of the weights of all shared partitions.
PR/SM is aware of the processor drawer structure on ‘znn’ servers. The processor
unit assignment of characterized PUs is done at POR time, i.e., when the CEC is
initialized. The initial assignment rules keep PUs of the same characterization
type grouped as much as possible in relation to PU chips and CPC drawer
boundaries to optimize shared cache usage. This initial PU assignment, which is
done at POR, can be dynamically rearranged by an LPAR by swapping an active
30
core to a core in a different PU chip in a different CPC drawer or cluster to improve
system performance.
HiperDispatch is a function that combines the zOS operating system dispatcher

actions and the Hipervisor (PR/SM) knowledge about the topology of the znn
system. The Hipervisor or PR/SM dispatches logical processors of the partitions
to physical processors. PR/SM always attempts to dispatch a logical processor
back to the same physical processor, or if this is not possible at least to a physical
processor of the same drawer. Even if PR/SM achieves that, the logical
processors will be re-dispatched on the same physical processor or drawer, it is
still possible that the unit work finds itself on a different physical processor or even
different drawer just because of the z/OS dispatching process. So, there is a high
likelihood that it must regain its cache context either from memory or remote level
2 caches which has an impact on the execution time of the work and thus an
impact on the throughput of the system.
In summary, performance can be optimized by re-dispatching units of work to the

same processor group, which keeps processes running near their cached
instructions and data. And thus, it minimizes transfers of data ownership among
processors and processor drawers, thus preserving cache contents, minimizing
cache misses and increasing processing efficiency.
Without the HiperDispatch feature enabled, PR/SM generally distributes the

weight equally to the number of logical processors, allocated to the LPAR. For
example, LPAR1 in a CEC having 10 GPs with partition weight allocation of 200
(out of 1000) and assignment of 5 LPs, will use 40% of the 5 physical processors.
But, in reality, 2 physical processors with 100% share can achieve more efficient
processing which is achieved through HiperDispatch function through ‘Vertical
CPU Management’ also generally known as ‘Vertical Polarization’.
PR/SM is able to give operating systems two cache advantages. First, PR/SM
can provide information about where logical CPUs are placed in the physical
topology. Second, PR/SM can place logical CPUs to increase cache benefits.
In vertical polarization mode PR/SM maps logical CPUs to real CPUs as closely
to one another as possible and moves these mappings as little as possible.
PR/SM also tries to consolidate a partition's entitlement onto a subset of the

logical CPUs that it places topologically near to one another. To do this, PR/SM
divides logical CPUs into three types:
31
• Vertical High (VH): Equivalent to a physical processor effectively dedicated
to the LPAR
• Vertical Medium (VM): Physical processor that is shared between LPARs
• Vertical Low (VL): Physical processors with no requirements that

are parked until needed
With the HiperDispatch feature enabled, the following table illustrates the
allocation of VH, VM and VL processor assignment based on the weight (In this
example, we have z14 720 model CEC with 20 physical GPs).
LPAR Weight LCPs LPAR Phys Proc VH Proc VM VL

Share (%) Guarantee Proc Proc
LPAR1 400 9 40 8 8 0 1
LPAR2 140 5 14 2.8 2 1 0
LPAR3 60 2 6 1.2 0 2 0
LPAR4 220 5 22 4.4 3 2 0
LPAR5 180 3 18 3.6 3 0 0
Total 1000 24 100 20 16 5 1
Phys Proc Guarantee = Total number of physical processors * Partition's weight

/ Total weight
E.g., Phys Proc guarantee for LPAR1 is ‘20 * 40 /100 = 8’
If the decimal number of the processor guarantee is ≥ 0.5 = the vertical high (VH)
processors will be the integer number and there will be 1 vertical medium (VM)
If the decimal number of the processor guarantee is < 0.5 = the vertical high (VH)
processors will be the integer number minus one and there will be 2 vertical
mediums (VM)
And the VL processors will remain parked till needed.
32
Best practices:
According to the IBM Best Practices recommendation, the number of logical

processors should be the quantity required to satisfy the LPAR share/processor
guarantee, i.e., the Vertical High and Vertical Medium, and 1 to 2 additional
processors (Vertical Low).
The LPAR time slice is sensitive to the number of logical CPs. Having more logical
processors may drive your time slice to a smaller interval for the vertical medium
and vertical low logical processors.
As I learnt in the past from IBM performance experts, the sum of all logical shared
processors should not be more than triple the number of physical processors.
Otherwise, the LPAR management time to reassign the PUs to the logical CPUs
can increase to an unacceptable level.
Work will run most efficiently if you run within your defined weight, using vertical
highs and vertical mediums to support the workload and avoid use of vertical lows
except for occasional workload spikes. If the workload in the LPAR relies upon
vertical lows for throughput you may want to change the weight to match actual
usage.
In addition, the following “best practices” may be used:
1. Important LPARs should get as much vertical high (VH) processors as

possible;
2. Workloads should run mostly on vertical high (VH) and vertical medium
(VM) processors;
3. Vertical low (VL) processors should be used only for occasional workload
spikes
4. The number of vertical low processors should be limited to the ones really
needed to reduce the risk of the vertical ‘short CP’ effect. That means, if
there is a demand in a LPAR with more number of VL processors and CEC
has got free capacity, then the VH weight will be distributed over the VL
processors thereby causing the performance impact known as ‘Short CP’
effect.
33
The short CP effect can lead to poor response time especially for CPU intensive
workloads that can be stranded on a logical processor that will not run again for
a long time. It can also cause a waste of cycles on each running processor
spinning for system locks held by the not running processors.
Relative Nest Intensity (RNI)
Understanding Relative Nest Intensity (RNI) is important for anyone working in

z/OS performance and capacity as RNI should be the basis for capacity planning
when changing processor types.
To understand RNI better, I have extracted the following from the document
‘LSPR for IBM z SC28-1187-24’, page-29. I suggest you refer to the latest LSPR
manual because the technology changes very fast.
LSPR workload categories
Introduction
Historically, LSPR (Large Systems Performance Reference) workload capacity

curves (primitives and mixes) have had application names or have been identified
by a software characteristic. For example, past workload names have included
CICS, IMS, OLTP-T, CB-L, LoIO-mix and TI-mix. However, capacity performance
has always been more closely associated with how a workload uses and interacts
with a particular processor hardware design. With the availability of CPU MF
(SMF 113) data on z10 and later, the ability to gain insight into the interaction of
workload and hardware design in production workloads has arrived. The
knowledge gained is still evolving, but the first step in the process is to produce
LSPR workload capacity curves based on the underlying hardware sensitivities.
Thus, the LSPR introduces three new workload capacity categories which replace
all prior primitives and mixes.
Fundamental Components of Workload Capacity Performance
Workload capacity performance is sensitive to three major factors: instruction

path length, instruction complexity, and memory hierarchy. Let us examine each
of these three.
34
Instruction Path Length
A transaction or job will need to execute a set of instructions to complete its task.
These instructions are composed of various paths through the operating system,
subsystems and application. The total count of instructions executed across these
software components is referred to as the transaction or job path length. Clearly,
the path length will be different for each transaction or job depending on the
complexity of the task(s) that must be performed. For a particular transaction or
job, the application path length tends to stay the same presuming the transaction
or job is asked to perform the same task each time. However, the path length
associated with the operating system or subsystem may vary based on a number
of factors including:
a) Competition with other tasks in the system for shared resources – as the
total number of tasks grows, more instructions are needed to manage the
resources;
b) The Nway (number of logical processors) of the image or LPAR – as the

number of logical processors grows, more instructions are needed to
manage resources serialized by latches and locks.
Instruction Complexity
The type of instructions and the sequence in which they are executed will interact
with the design of a micro-processor to affect a performance component we can
define as “instruction complexity.” There are many design alternatives that affect
this component such as: cycle time (GHz), instruction architecture, pipeline,
superscalar, out-of-order execution, branch prediction and others. As workloads
are moved between micro-processors with different designs, performance will
likely vary. However, once on a processor this component tends to be quite similar
across all models of that processor.
Memory Hierarchy and “Nest”
The memory hierarchy of a processor generally refers to the caches (previously

referred to as HSB (High Speed Buffer), data buses, and memory arrays that
stage the instructions and data needed to be executed on the micro-processor to
complete a transaction or job. There are many design alternatives that affect this
35
component such as: cache size, latencies (sensitive to distance from the micro-
processor), number of levels, MESI (management) protocol, controllers,
switches, number and bandwidth of data buses and others. Some of the cache(s)
are “private” to the micro-processor which means only that micro-processor may
access them. Other cache(s) are shared by multiple micro-processors. We will
define the term memory “nest” for a System z processor to refer to the shared
caches and memory along with the data buses that interconnect them.
Workload capacity performance will be quite sensitive to how deep into the
memory hierarchy the processor must go to retrieve the workload’s instructions
and data for execution. Best performance occurs when the instructions and data
are found in the cache(s) nearest the processor so that little time is spent waiting
prior to execution; as instructions and data must be retrieved from farther out in
the hierarchy, the processor spends more time waiting for their arrival.
As workloads are moved between processors with different memory hierarchy

designs, performance will vary as the average time to retrieve instructions and
data from within the memory hierarchy will vary. Additionally, once on a processor
this component will continue to vary significantly as the location of a workload’s
instructions and data within the memory hierarchy is affected by many factors
including: locality of reference, IO rate, competition from other applications and/or
LPARs, and more.
Relative Nest Intensity
The most performance sensitive area of the memory hierarchy is the activity to
the memory nest, namely, the distribution of activity to the shared caches and
memory. We introduce a new term, “Relative Nest Intensity (RNI)” to indicate the
level of activity to this part of the memory hierarchy. Using data from CPU MF,
the RNI of the workload running in an LPAR may be calculated. The higher the
RNI, the deeper into the memory hierarchy the processor must go to retrieve the
instructions and data for that workload.
Many factors influence the performance of a workload. However, for the most part
what these factors are influencing is the RNI of the workload. It is the interaction
of all these factors that result in a net RNI for the workload which in turn directly
relates to the performance of the workload.
36
The traditional factors that have been used to categorize workloads in the past
are listed along with their RNI tendency in following figure.
It should be emphasized that these are simply tendencies and not absolutes. For
example, a workload may have a low IO rate, intensive CPU use, and a high
locality of reference – all factors that suggest a low RNI. But what if it is competing
with many other applications within the same LPAR and many other LPARs on
the processor which tend to push it toward a higher RNI? It is the net effect of the
interaction of all these factors that determines the RNI of the workload which in
turn greatly influences its performance.
Note that there is little one can do to affect most of these factors. An application
type is whatever is necessary to do the job. Data reference pattern and CPU
usage tend to be inherent in the nature of the application. LPAR configuration and
application mix are mostly a function of what needs to be supported on a system.
IO rate can be influenced somewhat through buffer pool tuning.
However, one factor that can be affected, software configuration tuning, is often
overlooked but can have a direct impact on RNI. Here we refer to the number of
address spaces (such as CICS AORs or batch initiators) that are needed to
support a workload. This factor has always existed but its sensitivity is higher with
today’s high frequency microprocessors. Spreading the same workload over a
larger number of address spaces than necessary can raise a workload’s RNI as
the working set of instructions and data from each address space increases the
competition for the processor caches. Tuning to reduce the number of
simultaneously active address spaces to the proper number needed to support a
workload can reduce RNI and improve performance. In the LSPR, we tune the
number of address spaces for each processor type and Nway configuration to be
consistent with what is needed to support the workload. Thus, the LSPR workload
capacity ratios reflect a presumed level of software configuration tuning. This
suggests that re-tuning the software configuration of a production workload as it
37
moves to a bigger or faster processor may be needed to achieve the published
LSPR ratios.
Calculating Relative Nest Intensity
The RNI of a workload may be calculated using CPU MF data. For z10, three
factors are used:
• L2LP: percentage of L1 misses sourced from the local book L2 cache

• L2RP: percentage of L1 misses sourced from a remote book L2 cache
• MEMP: percentage of L1 misses sourced from memory.
These percentages are multiplied by weighting factors and the result divided by
100. The formula for z10 is:
z10 RNI=(1.0xL2LP+2.4xL2RP+7.5xMEMP)/100.
Tools available from IBM (zPCR) and several vendors can extract these factors
from CPU MF data. For z196 and zEC12 the CPU MF factors needed are:
• L3P: percentage of L1 misses sourced from the shared chip-level L3 cache

• L4LP: percentage of L1 misses sourced from the local drawer L4 cache
• L4RP: percentage of L1 misses sourced from a remote drawer L4 cache
• MEMP: percentage of L1 misses sourced from memory
The formula for z196 is:

z196 RNI=1.67x(0.4xL3P+1.0xL4LP+2.4xL4RP+7.5xMEMP)/100
The formula for zEC12 is:
zEC12 RNI=2.3x(0.4xL3P+1.2xL4LP+2.7xL4RP+8.2xMEMP)/100
The RNI of a workload may be calculated using CPU MF data. For IBM z16, four
factors are used:
• L3P: percentage of L1 misses sourced from the shared chip-level VL3

cache
• L4LP: percentage of L1 misses sourced from the local drawer VL4 cache
38
• L4RP: percentage of L1 misses sourced from a remote drawer VL4 cache
• MEMP: percentage of L1 misses sourced from memory
IBM z16 RNI=4.3x(0.45xL3P+1.3xL4LP+5.0xL4RP+6.1xMEMP)/100
Note these formulas may change in the future.
LSPR Workload Categories Based on Relative Nest Intensity
As discussed above, a workload’s relative nest intensity is the most influential

factors that determines workload performance. Other more traditional factors
such as application type or IO rate have RNI tendencies, but it is the net RNI of
the workload that is the underlying factor in determining the workload’s capacity
performance. With this in mind, the LSPR now runs various combinations of
former workload primitives such as CICS, DB2, IMS, OSAM, VSAM, WebSphere,
COBOL and utilities to produce capacity curves that span the typical range of RNI.
The three new workload categories represented in the LSPR tables are described
below.
LOW (relative nest intensity): A workload category representing light use of

the memory hierarchy. This would be similar to past high scaling primitives.
AVERAGE (relative nest intensity): A workload category representing

average use of the memory hierarchy. This would be similar to the past
LoIO-mix workload and is expected to represent the majority of production
workloads.
HIGH (relative nest intensity): A workload category representing heavy use

of the memory hierarchy. This would be similar to the past DI-mix workload.
39
Hints and Tips
I describe my experiences here. They are not described in any particular order.
You can refer to any topic based on your interest.
Cost - The biggest challenge and my simple solution
Running the mainframe platform effectively and efficiently, timely upgrading and
patching with minimal or no downtime, identifying and fixing all issues with no
downtime, implementing automation, and creating a robust Disaster Recovery
infrastructure and policy, are mostly the objectives and goals of mainframe zOS
infrastructure team.
But the biggest challenge that still remains in any mainframe organization is the
‘cost’; i.e. the Green $$ paid to IBM and the vendors.
In general, the overall mainframe cost is complex and includes multiple

components such as:
• Hardware purchases – CPU, CF, DASD, TAPE and many more –

Depreciated over 3 to 5 years, in general
• Additional hardware upgrades – e.g., +2 GP engines or nnTB of DASD etc.
• SW License charges – One time and monthly
• HW maintenance
• DR infrastructure
• Floor space, power, air conditioning etc.
• Office space
• People
But the biggest variable cost comes from the monthly SW License charges which
is based on the MSU-hour consumption under TFP.
So, what is the solution to manage the overall cost of mainframe?
40
Manage hardware
Make a long-term plan for physical HW provisioning and upgrade. for at least the
next three to five years considering the following:
• Current utilization
• Year to year normal business growth (BAU)
• Future business requirements (new, upgraded, sunset of business

applications)
• Systems zOS and products upgrade plan
• Critical business SLA
• Absolute peak demand of various resources
• Encryption requirements
The short-term capacity requirement should consider month-end, quarter-end,

and year-end processing, Black Friday and Cyber Monday sales, and meet the
demand of special business promotions or fund-raising programs. This should be
fairly easy to forecast based on past experience and historical usage data.
The physical capacity requirements can be forecasted using suitable techniques

such as linear projections, mathematical models, or through forecasting models
available in tools like Excel. You must find a way to do this with an accuracy of
plus or minus 5% error.
You must ensure availability of ‘Capacity on Demand’, especially if the short-term

demand is very volatile with no control to have an estimation - however, this
should not be case.
You must ensure adequate Disaster Recovery (DR) capacity based on the
organizational policy such as to support 100% or 90% or 80% of normal
production workload, whether test systems are required during DR or not, and
whether some of the applications can remain down or not.
41
You should clearly document the hardware requirement over a period of time, the
minimum, maximum and good to have, and review it with the SMT and get their
approval. And most importantly, highlight any deviations proactively.
Optimize Software suite
This is not easy. However, we must have a list of software running under our zOS.
zOS suite of software is integrated, and you can only control licensed software
such as SDSF, RMF and others, which are controlled through Parmlib member
IFAPRDxx member. Focus on the other products from IBM and Non-IBM vendors
to justify why we need the product in our system, do a cost-benefit analysis and
always look for the alternatives, if available. You must have a process to manage
the EoS (End of Service) and EoL (End of Life) software. Also review the software
products providing duplicate functionalities with other products with minimal
benefit. For example, if we are using the entire Omegamon product suite, it could
have duplication with other specific products providing similar functions for
Network, DB2.
Get maximum CPU cycles out of the hardware
You must find innovative ways of getting the maximum out of the available
hardware. I have experienced the following:
• Configure the weights and LPs to allocate only Vertical High (VH) engines
to production systems which provides almost 10 to 20% more MIPS per
engine
• Implement Simultaneous Multi-Threading (SMT) for specialized engines,

as per availability from IBM
• Explore all possibilities of offloading eligible workload to specialized

engines
42
Reduce CPU wastage
Reduce any wastage of CPU cycles. For example:
• Keep the average LPAR usage below 90% as much as possible
• Implement auto logoff of dormant TSO sessions
• Shutdown very rarely used STCs, especially in Test systems and start only
when it is needed
• Do not restart failed jobs, unless you fix the issue. Do not allow to fail again
• Track and avoid S322 abends (job abend after consuming allocated CPU
time). Implement job restart technique to continue the remaining
processing after job fail
• Distribute workload in the various LPARs. For example, avoid starting HSM
processing or DB2 archive at the same time in all LPARs running in the
same CEC
• Avoid batch processing during absolute online peak window
• Use WLM managed initiators unless you need JES managed initiators
because of installation need, and if needed, start the required number only
• Suppress the SMF records not required in the installation
• Avoid SLIP traps from running over long period of time
• Avoid unnecessary DD statements allocation if not required for the

processing. In many installations, the application team define a standard
JCL with 100’s of DD allocation where many are not required for the specific
job processing
• Eliminate CPU intensive customizations, and use only when required, e.g.
trace data collection
• Optimize the product suite – discontinue license for unused products
43
• Reduce volume of test data in test systems – Rather create your own test
data instead of copying a chunk of production data
• Optimize CF structures configuration
• Manage CF structure duplexing – User or systems managed duplexing
• Exploit Asynchronous systems managed duplexing
• Number of LPARs & number of logical CPs defined
This list can be very long based on our observation, data utilization analysis and
mix of workload running in our systems.
Optimize, Optimize, and Optimize
I very strongly believe that we do have opportunities to save the MSU-hour. To

achieve this, optimize anything and everything because every CPU cycle saved
is a direct saving towards the overall MSU-hour. For this, you must generate
optimization opportunities, both in application programs and system tasks.
My technique is very simple. Identify the abnormalities that result in wastage and
take actions to fix it.
But, how to identify these abnormalities? For this you need to establish what is
normal. Only then will you be able to identify the abnormal. You need to start
somewhere.
My suggestion;
• To analyse the CPU usage of all tasks running in the system, the system
tasks, online CICS transactions, and the batch jobs.
• You must establish a historical trend over at least last 13 months and
determine a CPU usage pattern such as:
o The average CPU utilization per day of GRS and other system tasks
o Average CPU usage in an hour by CICS transactions
o Number of CICS transactions executed per second
44
o Usage pattern of DB2 and MQ based on the variation of workload
processing
o Usage by HSM task
Then, use this historical data as a baseline for now and do a step-growth analysis
to identify the abnormalities, if any. If you find any opportunities for optimization,
create tasks and assign it to the respective SMEs for implementation. This will be
a repetitive analysis. You can refresh your baseline from time to time, if found to
be new normal.
Please remember, most business applications are developed in High-Level (HL)

languages (e.g., COBOL, PL/I, Java, REXX). So, unlike assembler programs,
there is always an opportunity to optimize application programs developed in HL
languages. One immediate thought that comes to my mind is moving Spaces to
a record or to specific fields and how many times do you repeat this in the
program? In my early career, I was a COBOL programmer. I have written
hundreds of the programs, but as I gained experience, I found multiple techniques
of writing optimized programs. I gained the same experience while writing an
application in REXX by making use of the rich built-in REXX functions.
And the most common CPU wastage is in writing inefficient SQL. I have noticed
opportunities to tune most SQLs.
I strongly suggest that the application program code should be reviewed by an

experienced program developer for each and every program. It is critical to
establish a process to perform performance and stress testing before moving the
application code to production systems. This will help to create a baseline for
expected CPU utilization by the program when moved to production. I suggest
this, because I often have seen that when the program is not tested with proper
test data to check all possible logics in the program - when it is finally moved to
production, the codes behave differently with certain data which were not
considered during testing in the test system.
45
Nothing is right or wrong
Please note that: nothing is right or wrong. If you want to understand it from a
different perspective, everything is right and everything is wrong.
Generally, the mainframe platform exists in the organization since a long time and
lots of so-called legacy has been built into the systems. Special systems
configurations through Exits and Usermods, in-house developed programs
having hook to system components, in-house developed programs to perform
specific functions and so on.
In 1994, I had visited an organization, who had developed CICS programs in the
assembler and had a highly customized VTAM, which resulted to be a different
component altogether and was supported by a specific staff member in IBM.
When the staff retired from IBM, the organization was in very big trouble in
supporting the customized VTAM. They did not have any choice other than
exploring moving the applications to standard CICS and COBOL so that they can
use standard VTAM functions and get rid of the in-house customization
dependency. This was a huge technical debt and cost to manage.
When we build (install, patch, and customize) the software, we most often carry
forward all customizations that were there previously. This is perfectly fine in the
short-term, because we do not want to introduce any issues to the systems and
applications after the build. But, with the advancement of time, and zOS and
language features - products have provided many rich features which, I strongly
feel, need to be analysed for applicability. It is also key to prepare a long-term
plan to take advantage of these new features. Also, there should be a focus to
review the old EXITs, USERMODs, in-house developed tools, developed and
used for log time, to reduce the technical debts inside the mainframe organization.
At the same time, from the CPO point of view, everything is wrong unless proven
right. So, create a very systematic approach to review and analyse each and
every component (e.g., zOS and product configuration parameters, all in-house
developed tools, application codes for CICS transactions and batch jobs and all
housekeeping tasks like SMF data collection and processing, systems and
database backup, HSM space management, zOS DR configuration, etc.) and
look for all possible optimizations till you are convinced that this is normal. Then
take it as a baseline and move forward. When you complete the cycle of reviewing
all tasks in the systems, please go back in the loop again. To conserve time and
46
make the review and analysis process effective, make use of automation tools.
And most importantly, keep your eyes and ears open and note all the changes to
the system, such as: upgrade of a single product, enhancement to a specific
business application, implementation of new business applications. Also, keep an
eye on all systems incidents and problems. This helps generate a large number
of CPO opportunities.
CPO Team
I strongly suggest at least a two or three member dedicated CPO team under the
zOS infrastructure to perform analysis and identify various opportunities for
optimization. Then follow the CPO process in your organization to execute the
optimizations by the CPO team itself, or offload or assign the same to the different
SME areas (systems programmers, application development teams, planners,
operation team etc.) for execution and deployment. The CPO team must provide
a periodic update to the SMT on the number of initiatives, cumulative savings so
far against the target, and the required support from the SMT to push forward the
optimizations.
CPO Objective
I have provided a generic list of objectives for the CPO team.
• To ensure that there is adequate resource capacity available to meet

current and future workload demands
• Ensure effective usage of the resources, that means to maintain a

‘balanced system’
• Generate and publish resource capacity utilization reports
• Generate and publish short-term capacity forecast during month-end,

quarter-end, year-end, and special processing and long-term capacity
forecast for the next 3 to 5 years
• Suggest hardware upgrades as necessary and support the procurement

and actual upgrade
47
• Manage CBU (Capacity Back Up) to meet capacity demand for business
continuity
• Manage capacity on demand activities such as OOCoD (On Off Capacity

on Demand) to meet short-term unforeseen, one-time special demands
• Accept capacity input from development teams required for new or

changed applications which demand additional resource capacity
• Perform capacity analysis, identify optimization opportunities, and follow up

as a task or project for implementation
• Manage all capacity related problems and issues
• Own and manage capacity management related tools/models
• Skill building and mentoring your successors in the team
• Prepare documentation on optimization approach, strategy, and

techniques and educate application development community and SMEs to
enable performing optimization activities efficiently
CPO Benefits
CPO benefits are manyfold and it is very difficult to list them all. But, in my view,
the major benefits as follows:
• Adequate resource capacity planning and provisioning (within budget)

thereby reducing any major performance issues and systems down time
due to lack of capacity
• Proactive capacity forecasting both short-term (month end processing,

special processing like fund raising and business promotion) and long-term
(major capacity upgrades)
• Efficient systems configuration such as #LPARs, #LPs (Vertical High,

medium and Low processors), central storage allocation, CF sizing etc
• Effort and action in saving every CPU cycle thereby reducing CPU
utilization and increased processing times were realized
48
• Determine inefficiencies in the system, middleware, I/O, and application
performance.
• Analyse and identify application execution delays and generate CPO

opportunities.
• Build a thorough understanding of online and batch performance

characteristics.
• Most importantly, understand the detailed workload running in the system

which is generally not the case in many organizations.
Capacity Management
ITIL (Information Technology Infrastructure Library) process describes the three

layers of Capacity Management.
• Business capacity management.
• Service capacity management.
• Component capacity management.
All three layers are important for the organization and I have included most of
them in this book. However, I will focus more on the ‘Component Capacity
management’ which we all know as ‘Resource Capacity Management’. The major
task under Resource Capacity Management is to provision an adequate resource
capacity to meet all business processing demands i.e., to provision the right
amount of resources at the right time and at the right price to keep operations
running smoothly.
The result of capacity management is the Capacity Forecast and is documented

in a dynamic Capacity Plan. The capacity plan is prepared considering the old
capacity plan, historical capacity utilization, upcoming business requirements,
upcoming seasonal activities, DR requirements and most importantly cost. There
are multiple techniques which can be developed to perform your capacity
management. You may use the standard tools available in the market to manage
this.
49
In my opinion, the capacity plan should be a dynamic document, prepared yearly
with thorough review and final approval from the senior management team. And
the actual resource utilization should be tracked, and the capacity plan may
undergo ad hoc changes depending on the variation in business requirements
and the variation in usage pattern based on multiple operational factors.
The capacity plan must include the following:
Short-term capacity forecast
Capacity forecast for the production environment to meet the demand for month
end, quarter end, year-end, seasonal and special processing.
Long-term forecast
Long-term forecast is performed using the utilization trend for at least the previous
12 months and considering the agreed BAU growth, known projects and
optimization savings. The long-term plan must forecast the capacity for the next
three to five years.
Ad hoc forecast
Capacity forecast for projects and other onetime requirements is done based on
demand and the same is taken into the short-term and long-term forecasts, as
applicable. In general, ad hoc capacity forecast is performed on demand to
address the following;
• To meet the projects or application requirements.
• Abnormal growth observed based on the utilization trend analysis.
• Capacity demand for stress testing.
Capacity provisioning
Based on the capacity forecast, the CPO team should make sure that the required
capacity is available to meet the demand. The CPU capacity provisioning
consider the following aspects;
50
• Provisioning of enough physical CPU capacity to support the absolute
online peak demand to meet the critical workload SLA.
• Manage the overall hourly average CPU utilization below nn% (suggest
90%)
• Provisioning of maximum average CPU requirement during batch window

(6 to 8 hours) to meet the business SLA
• Provisioning of OOCoD (On Off Capacity on Demand) to meet short-term

transient spikes over short period, due to system or application issues such
as looping or bugs and takes time for analysis and fix the same
• Provisioning of adequate CBU capacity for DR based on installation

requirements
Resources covered under capacity plan
The following resources are considered for the forecasting. In general, the
forecast is performed for GP CPU as this is the most variable resource
component. However, the other components should be reviewed from time to
time and mostly on demand - for example during a major hardware/software
upgrade, major project implementation and to address performance issues.
▪ GP CPU
▪ zIIP CPU
▪ Crypto processor
▪ Central storage
▪ DASD storage
▪ CF CPU & memory
▪ And other resources as needed
51
Capacity Reporting
A report must be prepared every month or on an ad hoc basis to present the

current capacity status, utilization vs. forecast, and the observations and
recommendations. This must be presented to the Senior Management Team for
their information and action, if any. Our recommendations, if any, must be based
on the resource utilization data, observations and input received from other teams
like development community, business entities, finance, chargeback, and others.
Performance Management
The objective of performance management is to maintain operational efficiency

in all mainframe environments and avoid resource wastage. The following major
activities are performed under this.
• Manage service definition in WLM (Workload Manager) to define the

relative prioritization of various workload in different environments
• Performance monitoring and alerting
• Performance data analysis and actions
• Handle performance related issues
• Performance reporting
Systems performance management is performed at different levels. The zOS

systems programmers (zOS, CICS, DB2, MQ, Network and others) perform
performance monitoring and tuning at the respective components level. However,
the CPM team identifies the performance opportunities based on utilization data
analysis. Implementation of Real Time Performance (RTP) performance
monitoring and analysis is a major activity performed by the CPO team.
52
Capacity Optimization
I have a very simplistic view on capacity optimization as follows.
The key activities performed are:
• Collect the CPU utilization data, process and analyse
• Identify and report potential savings opportunities
• Follow up for immediate action on short-term opportunities
• Manage long-term opportunities as a project
• Measure and report savings
• Reuse optimized resources against the growth forecast
53
Objective
Optimization is a very generic term and can be applicable in different contexts.

So, here, I define very specific optimization objectives for the CPO team.
• To reduce consumption in CPU usage (MSU-hour) and thereby reducing

costs on the mainframe monthly license charges
• Identify abnormalities and fix it
• Identify and reduce wastage at all levels
• Define a specific target for saving xxxx MSU-hour per financial year
• Improve business applications performance – Efficiency, stability,

throughput, and transaction response time.
Optimization Tasks
These tasks are at very high level, but the CPO team can compile the list of tasks
at very detailed level to work on.
• Identify opportunities, provide justifications, way to execute it and the

benefits – discuss and offload task to developers, SMEs for implementation
• Workload segregation/spread/shift – from high usage to low usage window

and maintain a balanced utilization throughout the day
• Manage the LPAR utilization below 90%
• Maintain the normal utilization baseline for all tasks
Workload
I will focus on all the workloads running in the systems.
• Batch jobs – focus first on all jobs consuming more than 60 sec CPU time
and then go down the list.
54
• CICS transactions (make a smart strategy for optimization as the number
and volume is too high)
• System and sub-system tasks – All
• TSO users – Focus only on the high consumers and implement time-out for
idling users
Input
The following will be the source of inputs for optimization initiatives.
• Capacity and performance analysis
• Capacity and performance related incidents and problems
• Capacity and performance monitoring and alerts
• Input from development areas and business entities
Savings
I classify optimization savings as follows.
• Direct – Savings in green $$ (Paid out to vendor) – Mostly the savings

towards MSU-hours and reduction in provisioning of physical capacity
• Indirect – Savings in blue $$ - This is Internal, means reclaim and reuse

(e.g. avoid or delay upgrade, reduction in chargeback)
Optimization outcome
Multiple outcomes are realized from the optimization initiatives.
• Optimization resulting in a direct reduction to the MSU-hour
• Reduction in CPU consumption – savings of every CPU cycle counts
55
• May not have direct reduction in CPU consumption but results in system
performance improvement – workload balancing
• The optimization savings may be in any component – saving is a saving
• So, make it generic, all savings are averaged over the whole month –
Nothing is ignored and be conservative in your calculation
• CPU measurement at least 2 weeks before and 2 weeks after the

optimization implementation
• The cost is computed over the whole year (cost per MSU-hour = xxxx $$
per year)
Savings calculation example
I calculate CPU savings by calculating the average usage at least two weeks
before implementation, and for at least two weeks of stable execution after
implementation. As the implementation could be in steps:
• Calculation of savings of a task e.g., batch job or CICS transaction or

system task XYZ
o Usage before =
o Usage after =
o Difference =
o Yearly savings =
Computation of cost
I suggest calculating the cost based on the MSU-hour unit rate used for IBM billing
and internal chargeback. You need to follow the standards used in your
organization.
For example, you have a z14 model 704.
#GP engines = 4
56
Per engine MIPS = 1486
LSPR MSU rating of model 704 = 808 MSU
Per engine MSU = 808 / 4 = 202 MSU
Convert CPU Sec to MSU-hour
3600 CPU Sec = 202 MSU-hour
1 CPU Sec = 202 / 3600 = 0.056111 MSU-hour
Convert MSU-hour to CPU Sec
1 MSU-hour = 1 / 0.056111 = 17.82 CPU Sec
Convert MIPS to MSU-hour
1486 MIPS averaged over an hour = 202 MSU-hour
1 MIPS averaged over an hour = 202 / 1486 = 0.135935 MSU-hour
Convert MSU-hour to MIPS
1 MSU-hour = 1 / 0.135935 = 7.36 MIPS
Note: Based on your machine configuration, you can compute the values.
Based on the optimized task and frequency of execution you can compute the
amount of MSU-hours saved during a day/month/year and multiply it by the MSU-
hour unit rate to compute the cost savings.
Please be careful of the computation if you have different models of CECs such
as a combination of 5xx, 6xx, 7xx. But this is easy to manage if you are using a
vendor tool or a tool developed in-house.
Optimization based on CEC models is also possible. For example, if you have a
z14-704 CEC with 808 MSU. But, a z14-511 has 835 MSU and 11 engines. If a
single processor speed is not very important for you, then exchanging a z14-511
over a z14-704 will give much more L1 cache and parallelism in the system.
57
CPO data analysis – A systematic approach
Workload utilization bucketing
For easy utilization analysis, reporting and chargeback, bucketization of workload

is very important. Based on the WLM service definition, we can report utilization
bucketized at three levels: (1) workload, (2) service class, and (3) report class.
We classify all our tasks running in the system under different service classes
whereas each service class is associated with a workload. For example, if we
define a workload called SYSTEM and associate the service class SYSTEM to it,
it does not mean that all the tasks bucketized under SYSTEM workload are
system tasks, as some of the non-system address spaces might have been
classified under SYSTEM service class. The same is applicable to SYSSTC
S\service class as many of the non-system started tasks are classified here e.g.,
DB2 IRLM tasks.
So, I strongly suggest collecting utilization at a more granular level using the
report classes. Then, establish mathematical logics for proper bucketization. For
example, if all your DB2 started tasks are classified under Workload DB2 except
IRLM tasks, then strip off the IRLM utilization through a report class and add it to
utilization under the DB2 Workload to get a complete utilization by DB2 started
tasks.
Please do take the help of report classes to avoid defining too many service
classes.
Please make sure to classify all tasks running in the system. If anything is left out,
then they go under service class SYSOTHER and run with discretionary goal,
which is not good. So, it is important to identify the tasks not classified and classify
them appropriately.
Establish a process to analyse utilization at the workload level.
58
Workload classification for easy Chargeback
I strongly suggest to properly bucketize the workloads under different business

entities to make chargeback easy and accurate. For example, bucketizing all
CICS regions, MQ tasks under business BUS1 under one bucket.
• Establish a process to classify the workload (Batch and online) under

different business entities
• Develop an approach to apportion the common workload and system

usage
o Apportion the common components (DB2, MQ, CICS…etc.) under the

business, if you are sharing the same DB2 subsystems and CICS
regions for running workload for different business entities
o Apportion the system components to the business
Metrics for CPO analysis
I propose a very systematic approach of performing utilization analysis.
• GP CPU utilization – Follow the order for normal analysis
• CEC
• LPAR
• Uncaptured
• Workload
• Service class
• Address space
I recommend a drill down approach for easy analysis. For example, the CEC
utilization should be a stack of all the LPARs utilization in the CEC. When we click
on any LPAR, it should drill down to stack chart of all workload utilization in the
LPAR. You can establish your strategy for performing drill down analysis.
59
• Batch workload
• #Jobs
• CPU
• EXCP
• Online workload
• CICS - CPU consumption and #Transaction
• DB2 - Normal and distributed workload
• MQ
• WAS
• zIIP CPU utilization
• zIIP eligible workload, but not offloaded to zIIP
• zIIP workload executed in GP as zIIP engines are busy
• Crypto processor utilization
• Analyse the Crypto processor utilization periodically (at least

monthly). I have observed daily spikes in utilization for around one
hour during some specific activities.
• CF CPU and Memory utilization
• CF CPU utilization
• Memory allocation to CF LPAR vs. allocation to structures
• DASD and Tape
• Average IO response time
• Total physical capacity vs. allocation and actual utilization
60
• GP CPU granularity - TCB, SRB, RCT, IIT, HST – You will be required to
perform CPU analysis at such a granular level in some specific cases only
if it is required
CEC CPU Analysis
• CEC CPU utilization
• CEC zIIP utilization
• LPAR management CPU%
▪ Combined LPAR management% for CPs, zIIPs (combined) should

be less than 2%
▪ White space available in different time windows
• MSU and R4HA MSU
LPAR CPU analysis
• The LPAR utilization
▪ LPAR
▪ Workload / Service class / Report class
▪ CPU usage by address spaces
▪ Uncaptured CPU and capture ratio
▪ Cause of uncaptured CPU
o High page fault rates
o Full preemption
o Suspense lock contention
o Spin lock contention
61
o Getmain/Freemain activity (recommend cell pools)
o SRM time-slice processing
o Interrupts
o SLIP processing
o Long queues being processed in uncaptured processing
o Excessive swap-out and swap-in activity
o Affinity processing (such as the need for a specific CPU or

crypto facility)
Workload CPU Analysis
• CPU Consumption at the importance level
• CPU Consumption at the WLM Service Class and Service Class Period
Level
• CPU consumption under Report Classes
• Other CPU consumption measurements
▪ Prime online window
▪ Batch window
▪ Other time window
Address space CPU Analysis
• Top address spaces consuming CPU
• Any specific address space in focus
• Should split for Batch, TSO and STC
62
Analysis approach
• Focus (CPU / Memory / DASD response)
• Start from summary
• Workload grouping (CEC/LPAR/BATCH/ONLINE/CICS/DB2/MQ etc.)
• Trend analysis
• Peak vs. Average
• Variable or continuous growth
• Utilization pattern
• Normal day vs. month end/quarter end/year end
• Comparison of same day from week to week and same month over year to
year
• Normal vs. Abnormal
Tricks of performing data analysis
Mainframe resource utilization is extremely variable and changes every second.

As I described, all zOS activities are recorded in the SMF records. Depending on
the mainframe configuration such as number of CECs, number of LPARs, the
products suites, number of tasks running in the system, enablement of collection
of groups of SMF records, and the SMF recording interval - the amount of data
collected could be excessive. So, the following are my recommendations for an
effective data analysis under CPO.
1. Set SMF interval of recording to 5 minutes, if possible. This will help to

collect more granular data for analysis and result in more accurate CPO
recommendations. Nowadays, the DASD is very cheap and with faster z
machines, and collection of SMF data at 5 minute intervals is not be very
expensive. Also, deployment of non-mainframe tools or platforms for SMF
data processing will eliminate the stress on mainframe cost towards SMF
data processing.
63
2. For effective data analysis, focus on the context for which you are doing
the data analysis. And so, it is important to use the right data required for
the analysis.
a. If you are doing a performance data analysis for a specific task

having an issue in a specific time, then use the data for this task for
the period where we have issue. Take the same data for other days
for comparison.
b. If you are doing data analysis for the online period, you may split the
window in to two: prime online and normal online.
c. If you are doing data analysis for the batch window, you may split the
window into two: critical batch and non-critical batch.
d. You may split the workload to online and batch throughout the day
so that you can have a better picture of your online and batch
workload distribution.
e. Make special note of your housekeeping tasks such as systems full

backup, incremental backup, DB2 log archive, HSM space
management etc.
f. If your charge is based on the R4HA MSU, then focus on the peak
utilization window. Move all non-critical workloads running during the
peak period to a low utilization window, if possible. Move some of the
non-critical workloads or weekly tasks to the weekend. Run some
heavy CPU intensive jobs or database clean up jobs to the weekend.
g. If your charges are based on MSU-hour, then try to spread the

workload to have a smoother systems utilization throughout the day
without compromising performance of certain workloads in specific
time windows. Here the focus is to reduce wastage as much as
possible and try to keep the systems utilization below 90%.
h. When you do the step growth analysis to identify granular growth

(positive or negative), start from top to bottom. Compare CEC and
LPAR utilization over last one month, 3 months, six months or one
year depending on your established analysis approach. If you see an
anomaly, then identify the variation in utilization in a workload
running under the LPAR. Then deep dive to the task under a
64
workload and finally to the time window. Verify, if there were any
problems recorded by the zOS systems programmers in the
respective systems and time windows. If not, then you have to work
with the respective SME or owner of the task for further analysis. It
may be necessary to follow up with the vendor of the product for
further assistance.
i. From the SMF records, you get data averaged over the SMF interval,
5 minutes, 10 minutes, 15 minutes etc. If you need to get more
granular data to learn the variation in usage during this interval, you
may take help from other sources like RMFMON III, OMEGAMON or
any other tool.
j. If you notice that, RMFMON III data is very vital for analysing some
specific issues, please keep a backup your RMFMON VSAM
datasets as the data may get wrapped around soon depending on
the number of datasets defined to store the historical data.
k. During your analysis, if you use different tools like SDSF, RMFMON
II, RMFMON III etc. please keep a snapshot of any interesting
observations, which will help you a lot during the later analysis.
l. For step-growth analysis, you may use average utilization over;
i. SMF interval
ii. Hour(s)
iii. Online window
iv. Batch window
v. Whole day (24 hours)
vi. Weekdays (Monday to Friday)
vii. Weekends
viii. Month
ix. Quarter
65
x. Year
xi. Or other time intervals
m. For critical utilization like CEC, LPAR, WORKLOAD, specific tasks

etc., you may build the historical data over at least 13 months for
easy trend analysis.
n. Make use of the trend analysis, which I find, is a very powerful

method of identifying the anomaly. You may have a running trend or
superimposed trend over a specific time window for which you can
develop your own techniques. Comparison of data is very vital to
identify abnormalities.
o. You may use the cumulative usage data such as CPU sec, MSU-
hour depending on the different scenario.
p. Many a times, while doing utilization analysis, we get carried away

by the CPU utilization data only. But we forget that the CPU utilization
is driven by other factors and workload such as- number of
transactions in CICS, number of DB2 calls, number of batch jobs,
amount of EXCP done in a batch program, MB/GB/TB of data written
to the tape, looping task and so on. So, you should perform a
thorough analysis before reaching any conclusion or identify the root
cause.
q. You should not have any doubts in your mind after doing the
analysis. You must have taken all related considerations nto account
and have the proper justifications based on the data. As I mentioned,
performance and optimization analysis is an art which is based on
common sense. So, when you go with your recommendations to
people, you will get many questions and challenges and you should
be ready and equipped with your data and observations.
r. Never be in a hurry and take a reasonable amount of time to perform

a thorough analysis. Ask for more time if required, but always keep
in mind the time sensitiveness of the problem and you must support
the mainframe infrastructure management on very critical issues by
providing your analysis very fast. Prepare charts, power points etc.
for your presentation.
66
s. Never hesitate to ask for help and assistance from other SMEs. zOS
is an ocean, and we are not supposed to know everything. But many
things are required to aid us in our analysis. People who work around
us have many years of varied experience. They can assist us in
many ways and provide invaluable guidance in performing our tasks
better.
3. You must analyse the CF CPU utilization, subchannel utilization, Sync and
Async service time, Sync requests converted to Async (CHNGD) and take
appropriate actions for any abnormality against your defined threshold. You
must allocate sufficient Central storage to the CF LPARs and monitor the
amount of storage allocated to the structures. You must maintain enough
free storage in the LPAR to support structures moved from other CF LPARs
during issues, maintenance, and DR.
Note: Please keep special focus on structures moved from this CF LPAR
to other CF LPAR during issues, DR or CF maintenance and make sure to
move them back to the home LPAR, especially when moved to remote CF.
4. I have not touched much upon the issue relating to central storage.
Nowadays, central storage is cheap, and we can be generous in allocating
more central storage to LPARs to maintain a healthy 30 to 40% available
frame queue. However, if you have a central storage constraint, you must
monitor it closely and analyse the central storage usage very frequently.
You must keep an eye on the local page dataset usage and include it as
one of the items under daily health check.
5. I also have not discussed much on IO issues under DASD and Tape
because channel speed is very high today and we provision adequate
DASD and Tape storage to cater to our need. But, as a part of CPO activity,
we must track the DASD and Tape utilization data to ensure availability of
enough buffer available for at least next one year. We should not come to
a situation requiring any emergency purchase of DASD and Tape storage.
Please do keep an eye on the average DASD and Tape response time and
IO volume from the RMFMON SUMMARY post processor report.
67
Major analysis outcome
• Optimization opportunities
• Short-term capacity forecast
• Long-term capacity forecast
• Hardware/software recommendations
• System configuration recommendations
• Management reports
68
Tools and data source
Here I describe some common tools used by the CPO function.
SMF
For CPO, we need the right data to perform the right analysis, which is majorly
the CPU usage. zOS captures and collects the CPU utilization at various levels
in the SMF records. The structure of each type of SMF record is complex by
nature. So, we need a tool to format the record to usable data required for
analysis. Many of the organizations use SAS/MXG to process the SMF data or
use a tool on the non-IBM platform where the SMF data is sent for formatting.
What is SMF
• System Management Facilities (SMF), a component of zOS, collects and

records system and job-related information
• Each record has a standard heading format
• Each record is identified by a one-byte record number (ranging from 0 to

255) – 0-127 is reserved for IBM use
• Some records have subtypes (e.g., a type 30, subtype 4; or a 30(4); or a

30.4)
• SMF files can exceed 1TB of data daily
• 80-90% comes from subsystems data (CICS, DB2, IMS, etc.)
• Raw data structure is complex and has to be converted into useful

information
• Process is time and resource intensive
Major Uses
• Accounting and Chargeback (Internal or external customers)
69
• Performance tuning (e.g., devices, jobs, network, data sets, WLM)
• Capacity planning
• Optimization activities
• Managing and reporting resources usage – CPU, storage, memory, network
• Configuration analysis
• Problem identification
• Jobs scheduling
• Managing systems security
• Historical analysis and trend
• Management reporting
• Numerous other reports
• And many more.
Who Needs the Data
• IT Manager (summary of hardware usage)
• Operation manager (configuration changes, exceptions, batch abends,

etc.)
• Infrastructure manager (hardware usage, system, and applications

performance, etc.)
• CPO team (daily, weekly and monthly trends, WLC reporting, optimization)
• SMEs (z/OS, CICS, DB2, IMS, storage detailed reporting)
• Development team (test and production reporting)
• Others….
70
Reporting
• The number and type of reports can be quite high, so it is important to

decide which reports are important for the installation
• CPU is the most critical and expensive resource and all teams including
the senior management are keen to have this resource usage available to
them always
• The systems configuration data is of course very important
Sources of CPU Information in SMF
Please refer to a presentation by Cheryl Watson.
https://share.confex.com/share/119/webprogram/Handout/Session11309/11309
.pdf
• RMF CPU Records (Type 70 - CEC CPU usage, LPAR usage, zIIP usage,
zAAP usage, IFL usage, CF usage)
• RMF Workload Activity Records (Type 72 - CPU usage by service class

period)
• SMF Address Space Activity (Type 30 - CPU usage by address spaces,

including cross-address space, and cross-system usage)
• DB2 Records (Type 102)
• CICS Records (Type 110)
• MQ Records (Type 115)
• WAS Records (Type 120)
• WebSphere Message Broker (Type 117)
• HTTP Server (Type 103)
• Hardware (Type 113)
71
• RMF Monitor II (Type 79 - CPU usage by address spaces and enclaves)
• TSO/E (Type 32)
Data Collection
• Decides what record types and subtypes we need
• Select only the needed fields (SMF records may include hundreds of fields)
• Use an efficient processing approach (read input data once, build

database, avoid duplications, etc.)
• Perform only the needed aggregations (at the smallest possible interval)
• Offload from the mainframe and use less expensive resources and tools
with equal or better performance
• Parsing 1TB or more of SMF data daily on a non-mainframe platform is

more efficient
Presentation
• Make a fully automated end to end process
• Make processing transparent with no user parameters
• Decide generic thresholds that fit most systems and sites
• Make reports user friendly and well documented
• Easy to navigate using a top-down approach with drill-down paths
Present the CPU utilization in MIPS, MSU, percentage of CEC and number of
standard processors (CPU). The information should be drilled down for further
analysis. The exceptions may be highlighted in red when:
• The CPU used by the LPAR is greater than the allocation
• The CPU used at CEC level exceeds the defined threshold
72
• The uncaptured CPU utilization is greater than the threshold
• The difference between MVS BUSY utilization and CPU BUSY utilization
is greater than the threshold
• Easy trending and comparison facilities for efficient analysis
• Should support and be displayed in any standard browser
• Reports should be portable to any platform for viewing
• Reports to be easily archived for future on demand reviews, research and

auditing
• Report’s contents should be accurate
• Information provided in reports have to be fully available for printing, cutting

and pasting, and exportable to MS Excel or other external applications
• Easy to find and customize the needed report(s)
• Group reports used to perform specific activities (daily check, workload

performance analysis, trending etc.)
• Allow users to create personal report lists customized for their activities
• distribute exceptions to specific people or groups
Tools and products processing SMF data
I have prepared a list of tools and products available in the market which help you
in processing SMF data and present you with the information you need for your
analysis. The list is not exhaustive and there could be many more tools available
to perform your task.
• SAS/MXG
• IZDS (IBM Z Decision Support)
• IZPCA (IBM Z Performance and Capacity Analytics)
73
• IBM SMF Explorer
• Spectrum SMF Writer
• EasySMF
• SMT data
• PerfTechPro zAnalytics
• SAS based (MXG, ITRM, CA-MICS)
• IBM TDS (Tivoli Decision Server)
• IBM REXX
• IBM DFSORT
• Develop program using IBM PLI
• In-house Solutions
• BMC Visualizer
• IBM RMF
• Several smaller vendors …
Few observations on the acquisition and maintenance of the tool.
• All of these tools create good DBs which can be used as inputs to other
applications as well as providing immediately usable information
• Most carry a high price tag
• Overhead, redundancy, and delays in supporting new metrics in the tool
• Complex or cumbersome management
• Upgrades can be both costly and difficult
• Staffing levels may be high
74
• For some only a total replacement will work
• For some a simple in-house developed tool will help
RMF Data and SMF Records
• SMF records are generated by RMF Monitor I, Monitor II, Monitor III
• Records can now be written out to either the MANxx data sets or, beginning
with z/OS V1R9, use the MVS System Logger
• RMF Postprocessor reads the SMF type 7x records to generate the RMF
CPU Activity Report
75
RMF
RMF is the main tool under zOS which is considered ‘The Best Friend’ of CPO. It
is difficult to provide a summary of RMF. There is plenty to read and you can
enhance your skills by continuously using this tool. For CPO, we need to
remember the rich capabilities of RMF.
• RMF Data Gatherer (RMFMON I, RMFMON II, RMFMON III)
• RMF Sysplex Data Server (SDS) and APIs
• RMF Spreadsheet Reporter
• RMF PM (Performance monitoring)
• RMF Exits (WTO alerts)
One thing one must note is the SMF interval set in the installation. For effective
analysis and reporting, I suggest configuring the same SMF interval in all LPARs
in the installation.
As you know, RMF post processor reports are invaluable for CPO activities. But,
it depends on availability of the right SMF records up to right time. One of the
most challenging tasks for me has been to generate post processor reports using
current data. I have used the two following methods to achieve this.
1. Generating postprocessor reports using data in SDS. This data is very little
and gets wrapped around very fast. You must be really quick.
2. Switch the MAN dataset and use the current SMF file when downloaded.
But you must wait till the end of the current SMF interval before switching.
You have to find similar ways to get data from SMF logstreams.
76
WLM
zOS executes multiple tasks concurrently. These tasks require system resources.
Workload Manager (WLM), a component of the z/OS system, monitors and
distributes the system resources to the competing workload depending on the
goals defined for it.
You will find lots of technical details and descriptions in IBM manuals, Redbooks,
Share presentations, training sites. So, I will not go into the details of what is
WLM, how it works, and how to define the service definition.
IBM Manual, ‘MVS Planning: Workload Management’ is a good book to refer to

for WLM definition.
WLM service definition at a glance - this is a very old picture but is still very
relevant. (Picture taken from a presentation by Robert Vaupel in 2007)
77
Here is a summary of workload importance and dispatching priority for reference.
Please refer to the following link for more details, from where I have taken the
picture.
https://share.confex.com/share/118/webprogram/Handout/Session10888/WLM
%20Basics.pdf
78
Here are a few operational tips from me:
a. Follow a good naming convention while defining the service definition. E.g.,
name all service classes with first two characters ‘SC’ such as ‘SCxxxxxx’,
report class with ‘RC’ such as ‘RCxxxxxx’ and so on.
b. I do not make any changes temporarily, unless I am addressing some very

critical issues. Rather, follow a discipline for implementing any change,
even if it is very minor.
c. Always keep a backup of your service definition (File → Option 4) before

making any update/changes.
• I create a backup file to back up the current definition e.g.
<HLQ>.<PLEXNAME>.<Date>.<Time>.WLM
• I create another backup file to back up the current definition, which I

update for changes, e.g.
<HLQ>.<PLEXNAME>.<Date>.<Time>.WLM.CURRENT
79
d. You may keep a history of your updates to the service definition. Use
‘Notes’ drill-down on the menu or any of your convenient tool (Notes, Word
file, Excel file, MVS file etc.).
e. While using Resource Group Type 1, stay careful as the scope is Sysplex
and the WLM SU/Sec rating for different LPARs in the Sysplex is based on
the number of LPs allocated to the respective LPARs.
For example, under a z14 7xx model CEC, if you allocate 2 LPs to a LPAR,
WLM will consider it be a model 702 and the SU/SEC is 88397.7901. And
after some time, if you change the #LPs to 3, WLM will consider the model
to be 703 and the SU/Sec will change to 86021.5054. You can always find
the SU/Sec rating in the LPAR in a specific interval in WLM activity report,
printed at the end of any interval.
SYSTEMS
---ID--- OPT SU/SEC CAP% --TIME-- INTERVAL ---ID--- OPT SU/SEC CAP% --TIME-- INTERVAL
P01 00 88397.7 100 15.01.00 00.59.03 P02 00 86021.5 100 15.01.00 00.59.03
80
f. While updating classification rules, place your entry at the right place for
level-1 entries as WLM will select the first matched classification rule in the
list. Your resource may be already included under a group or generic
definition above it.
g. If not sure, I always include the very specific classification rules at the top
of the list and I would never sort the classification list.
h. I always update the goal based on what is achieved in my system. I do not

put an arbitrary high value and leave it like that. I generate the WLM activity
report from time to time and tune the goal based on what is actually
achieved in the system.
i. I implement the classification for CICS transactions.
j. I create a report class for each and every system tasks like GRS, JES2,
MASTER etc. which may be around 40 to 50 report classes.
k. I define ‘CPU Critical’ and ‘Memory Critical’ wherever needed.
l. Try to avoid creating scheduling environments, where possible. I do not

want to create any system affinity in a Sysplex.
81
m. Try to avoid capping unless you are a service provider where you want to
provision what the customer needs in a shared environment.
n. I issue ‘D WLM’ command before and after the installation of definition and
activation of policy for a record in SYSLOG.
o. Do not define two service classes with the same importance and goal. I
have only used it for batch jobs running under JES managed initiators and
WLM managed initiators to have the same goal.
p. Keep the service definition clean and document the justifications for goal
changes as applicable.
Important Source of Information for CPO
CPO needs a lot of information for their analysis. It is virtually impossible to list
the information you need for your day-to-day work. So, I look for required
information I need in specific situations and contexts. You must develop your own
way of finding the right information, as quickly as possible, and their sources.
During my work under CPO, I have referred to a lot of information and the sources
could be counted in 100’s. You need to read a lot and refer to many documents
while working on specifics such as optimized LPAR configuration, SMT, WLM,
capping, performance, RMF, SMF, optimization, capacity forecast etc. I refer to
IBM Manuals, IBM Redbooks, share presentations, Cheryl Watson Tuning
Letters, and lots of papers, presentations and articles found on the internet. I
compile them and store them in my laptop for later reference.
However, I will provide a flavor of typical sources from where I get invaluable
information for my CPO work. I will copy some screenshots from the IBM ‘RMF
Report Analysis SC34-2665-01’ manual for illustration.
You have to develop an analysis strategy to navigate through and refer to various
sources of information while working on CPO activities and initiatives.
Post processor ‘Summary’ Report
The summary report has been an important source of information to determine

the workload distribution pattern in an LPAR. I have built a historical database for
82
this and have a look at the trend almost every day. It has helped me to identify
many opportunities for performance improvement. For example;
• I found an increase in DASD average response time in one LPAR which

was caused due to deactivation of few Paths because of a hardware
error. EMC had taken few days to identify the hardware error and fix it.
• I also found that the TSO users are never logged off and remain in the
system which gave me the input to implement TSO session time out due
to idling.
• It has helped me to get a trend of Max and AVE JOB, TSO and STC
tasks in the system and the variation to this especially in the test
systems.
• It has helped me to find a pattern on PAGING and the time when it

occurred.
Picture source: RMF Report Analysis SC34-2665-50 page 513
83
RMFMON III ‘CPC’ Panel
You will find LPAR configuration in the CEC at a glance. I have highlighted some
data in RED color which I frequently look at.
RMF V2R4 CPC Capacity Line 1 of 31

Command ===> Scroll ===> CSR
Samples: 60 System: P01 Date: 07/27/22 Time: 09.35.00 Range: 120 Sec
Partition: xxxxxxx 3906 Model 725 Boost: N
CPC Capacity: 3644 Weight % of Max: **** 4h Avg: 179
Image Capacity: 583 WLM Capping %: 0.0 4h Max: 288 Group: N/A
MT Mode IIP: 2 Prod % IIP: 79.7 AbsMSUCap: N Limit: N/A
Partition --- MSU --- Cap Proc Logical Util % - Physical Util % -
Def Act Def Num Effect Total LPAR Effect Total
*CP 40.0 1.5 61.5 63.1
LPAR1 0 5 N N N 1.0 3.2 3.3 0.0 0.1 0.1
LPAR2 0 1 N N N 1.0 0.9 1.0 0.0 0.0 0.0
LPAR3 0 7 N N N 1.0 4.6 4.8 0.0 0.2 0.2
LPAR10 0 8 N N N 1.0 5.6 5.7 0.0 0.2 0.2
LPAR11 0 30 N N N 2.0 10.4 10.6 0.0 0.8 0.8
PHYSICAL 0.8 0.8

*IFL 4.0 0.6 56.4 57.1
LPARX N N N 2.0 100 100 0.0 50.0 50.0
LPARY N N N 2.0 12.8 13.4 0.3 6.4 6.7
PHYSICAL 0.4 0.4
*IIP 17.0 0.8 31.0 31.8
LPAR1 N N N 1.0 0.7 0.8 0.0 0.1 0.1
LPAR2 N N N 3.0 17.4 17.6 0.1 8.7 8.8
LPAR10 N N N 3.0 20.2 20.4 0.1 10.1 10.2
PHYSICAL 0.4 0.4
84
RMFMON III ‘SYSINFO’ Panel
This gives me good information on utilization summary.
• Enclave usage = Eappl% - Appl%
• Uncaptured usage = Avg CPU Util% - Eappl%
• No of total tasks in the system under ‘---Users---’
• MVS Busy - Avg MVS Util % – z/OS has dispatched work on a logical CP
eligible to be executed
• LPAR Busy - Avg CPU Util% – PR/SM has dispatched work on a physical
CP so that it can be executed
• MVS Busy > LPAR Busy when workload exceeds LPAR weight & surplus
CPU unavailable from other LPARs, e.g., because of (soft)capping.
85
RMFMON III ‘PROCU’ Panel
I find the top CPU consumers in a system in any interval.
86
SDSF ‘JM’ line command
This panel has been very helpful in dynamically tracking the storage usage. Once
our OPC task was suddenly getting S878 abend frequently. We increased the
region size and kept monitoring the storage usage using SDSF ‘JM’ line
command against OPC task and took action to recycle the task before it hit S878
abend.
Display Filter View Print Options Search Help

----------------------------------------------------------------------------------------------------------------
SDSF JOB MEMORY P01 ASID 0297 OPC JOB40758 LINE 1-20 (50)
COMMAND INPUT ===> SCROLL ===> CSR
PREFIX=* DEST=(ALL) OWNER=* SYSNAME=*
NP TYPE SP Key Fix FP Total Total-24 Total-31 Total-64 Count LargestA LargestF Frag
PRIVATE 0 8 NO YES 60KB 12KB 48KB 36 32KB 4080 40
PRIVATE 7 8 NO YES 2052KB 2052KB 1 2052KB 4016 2
PRIVATE 50 8 NO YES 1680KB 1680KB 676 180KB 676
LSQA 205 0 DREF NO 688KB 688KB 625 88KB 4KB 664
LSQA 215 0 DREF YES 288KB 288KB 100 80KB 3808 102
LSQA 225 0 YES YES 76KB 76KB 36 32KB 3776 47
87
Postprocessor ‘WLM Activity’ report
This has helped me a lot to perform deep dive analysis on performance related
issues and analysis. The entire report has helped me to resolve multiple
performance issue. But a few important fields to look at are:
• Transactions – Avg, Ended, End/sec
• Service – Where the CPU is spent
• Service Time – CPU granularity
• The Ex Vel% achieved and the Perf Index (PI)
• The response time distribution
• DASD I/O – to deep dive on IO issues
• Crypto usage and delays
88
RMFMON III Enclave panel
The ENCLAVE report provides detailed information about the activities of

enclaves. It has helped a lot to find the high utilization in DB2 DDF workload. You
can also get a lot more information about enclaves under SDSF.
Others
I can keep writing a lot more, but I just gave some examples. You can document
an exhaustive list based on your analysis requirements.
89
Create my own database for critical data
I have created some data for quick and easy reference and analysis. In one of my
organizations, we developed it in-house using SAS/MXG to extract required data
from SMF records and then develop a web-based application in Windows
platform to present the data. In other organizations, we used a tool to get the
required information. The metrics and amount of data needs to be decided based
on your analysis approach. I prefer historical data over the last 13 months to
understand what the status was in the same month last year and the trend over
the last 12 months.
In the process term, it is sometimes called ‘Capacity Management Information

System (CMIS)’ which includes all the data necessary to perform CPO activities.
Analytics Data and Metrics Domain What is presented?

CEC – Physical capacity Data Center History of HW upgrades
Model
#CPU - All types
Total Memory
LPARs and placement Data Center History of configuration and

#LPARs in each CEC changes
Logical Processors LPAR History of configuration and
#LPs - All CPU types changes
System OS Upgrades LPAR History of configuration and
Setup zOS Version changes
Upgrade dates in each
LPAR
Resource allocation LPAR History of configuration and
#CPU - All types changes
Memory
DASD Data Center History of storage
#Volumes configuration and changes
Models
CPU Utilization - All types CEC, LPAR, Current and history of
(CPU sec, MSU, MIPS) SYSPLEX utilization
CEC
CPU and LPAR Includes facility to zoom the
Memory SYSPLEX data to specific time window,
utilization WORKLOAD online window, batch window
SERVICE CLASS and comparison over multiple
REPORT CLASS days
90
Memory utilization LPAR
Allocated
Used
Uncaptured CPU LPAR Current and history
Disk utilization SYSPLEX, Current and history
LPAR
Channel path utilization SYSPLEX,
LPAR
IO Rate SYSPLEX,
LPAR
DASD
IO response SYSPLEX,
LPAR
Tape Rate SYSPLEX,
LPAR
Top high utilized volumes SYSPLEX,
LPAR
Systems Summary Snapshots LPAR Current and History
Summary (as provided in RMF post
processor summary report
and more, if necessary)
#CICS Transactions SYSPLEX, Current and history

LPAR
#Batch jobs SYSPLEX,
LPAR
Top 50 batch jobs SYSPLEX,
Generic LPAR
statistics per Top 50 CICS transactions SYSPLEX,
day LPAR
Top 20 STCs SYSPLEX,
LPAR
Top 20 TSO users SYSPLEX,
LPAR
MIPS forecast for month end Data Center
CPU (CPU sec, MSU, MIPS) SYSPLEX, Current and history
LPAR
#IO SYSPLEX,
LPAR
Workload SYSPLEX,
LPAR
STC / TSO
Service Class SYSPLEX,
(Interval
LPAR
record)
Report Class SYSPLEX,
LPAR
JES Job # SYSPLEX,
LPAR
SYSPLEX,
LPAR name LPAR
91
SYSPLEX, Current and history
CICS Region Name LPAR
SYSPLEX,
CICS Region level CPU LPAR
transaction #Total transactions SYSPLEX,
(Interval LPAR
record) Top 50 Transactions and SYSPLEX,
CPU (CPU sec, MSU, MIPS) LPAR
for each transaction
CPU (CPU sec, MSU, MIPS) SYSPLEX, Current and history
for each job LPAR
Start time SYSPLEX,
LPAR
End time SYSPLEX,
LPAR
Elapsed time SYSPLEX,
LPAR
Batch Job Workload SYSPLEX,
LPAR
Service Class SYSPLEX,
LPAR
Report Class SYSPLEX,
LPAR
JES Job # SYSPLEX,
LPAR
SYSPLEX,
LPAR name LPAR
CPU SYSPLEX, Current and history
Other LPAR
workloads
Database Memory SYSPLEX,
Messaging LPAR
Was #Transactions if applicable SYSPLEX,
Systems LPAR
??? WLM Goal SYSPLEX,
LPAR
CPU Business Current and History
Business Entity
workload #Transactions / JOBs Business
(If defined / Entity
Grouped) SLA (Service Level Business
Agreement) Entity
CPU CF Current and History
Coupling Subchannel utilization CF
Facility Service time (Sync / Async) CF
CHNGD CF
92
Capacity monitoring
Resource capacity utilization monitoring an important task under CPO. In general,
it is a distributed activity covering zOS systems programmers, CPO team,
operations team, and development community. It is a very normal activity in
almost all organizations. But what we lack is the real time performance monitoring
and I will explain the same in detail here.
Real Time Performance (RTP)
Data in Real Time
The biggest challenge in my career has been to get data in real time. I have faced
multiple CPO issues every day and I always look for the most current data and
most often need some history. I can get some snapshots of current data from
various tools like OMEGAMON, RMFMON, SDSF and other non-IBM products.
They do have very limited historical data. But I have to navigate through multiple
panels to view and collect some of the data. This often does not give me the clarity
that I am looking for and I always look for more data as quickly as possible. One
of the techniques which I used, is to make a list of most frequently used data and
store the history in an Excel file or database. But this also has its own challenges
as Excel has limitations in processing larger amounts of data. So, the best way I
had explored was to store the data in a database and make use of modern web
tools to extract, process and present it quickly in different formats. But currently,
we have many tools available to make our life easy, provided our organization is
ready to invest in the tool. This undoubtedly, solves the problem to some extent
in getting historical data, however getting the real time data at a central place still
remains the challenge.
In one of my previous organizations, we had developed a tool using REXX to

receive the data from RMFMON III, especially to catch the high abnormal CPU
usage by an address space and send a mail to the CPO team for action when the
defined threshold was exceeded. The main objective was to trap the system tasks
running at high priority (VTAM and TCPIP in those period) impacting online
workload CICS and DB2.
93
In another organization, we had codes developed in mainframe in assembler to
collect the required data in real time at a very cheap cost (CPU cycles) and feed
it (some data every minute and some in five minute) in xml format to a front-end
Linux server and process the same using Elasticsearch and Kibana to display it.
Many dashboards with the comparison facility have been developed for effective
monitoring. Using this, around 60 to 70 percent of the monitored real time capacity
and performance issues have been identified within 5 to 10 minutes of
occurrence, and appropriate action was taken to avoid any major issues. The
metrics, we considered are: CEC and LPAR CPU utilization, CICS #transactions
and CPU utilization, Subsystems CPU utilization etc. (other metrics can be
collected as per need). Alert emails were automatically triggered when the
defined threshold was breached, for example - if the average LPAR utilization is
consistently above 90% for a period of 15 minutes or a task using more than one
engine worth capacity for a period of 10 minutes. We also collected some data
from RMF III through APIs.
The assembler code developed provides a lot of rich functionalities. It provides

the entire mainframe configuration at a glance such as LPAR, Weights and LP
allocations, current usage, snapshot on #CICS transactions with response time,
and lots of other deep dive information on DB2, CICS, MQ and lot more. And the
code runs with very little usage of MSU-hours.
To me, this is undoubtedly a great innovation and I wonder, if such a tool could
be developed and available in the market for use by most customers.
Here, I am giving a flavor of only few examples that we have generated and
presented in real time. As I see, we have the opportunity to explore and generate
and present a lot of information in real time, in a user-friendly way and easy to
understand dashboards. Our operations team is 24 x 7 with so many displays are
available in the command center. They are our perfect partners from whom we
can seek help for real time monitoring. With little effort at our end and little
education to our operations specialists, we can make use of the information to
reap the benefits for now and for the time to come.
94
Example 1
The following information shows a very high-level snapshot about CECs MIPS.
This data is refreshed every 5 minutes.
This gives a snapshot of the current status of all CECs and the utilization can be
plotted in real time.
Time: 12 December 2022, 12:45

CEC1
3906-724 GP MIPS zIIP MIPS IFL MIPS
Activated 30049 18320 9160
Consumed 27500 6500 1200
Utilization % 92% 35% 13%
Unused 2549 11820 7960
CEC2
3906-738 GP MIPS zIIP MIPS IFL MIPS
Activated 40324 18320 9160
Consumed 32500 9000 1200
Utilization % 81% 49% 13%
Unused 7824 9320 7960
Reserved 4007 0 0
95
Example 2
The following information shows a very high-level snapshot about some of the
LPAR information in a Sysplex, refreshed every 5 minutes.

Sysplex: ABC
IBM GP Curr Curr GP Avail CS Page
R4HA Curr CICS
LPAR CPU GP zIIP CPU % Frame DS
MSU MSU Tans/s
Speed MIPS mips Usage Queue used
LPAR1 1,220 1.622 2.741 71% 759 1051 18079539 0,00% 1.153
LPAR2 1,550 1.984 3.636 74% 844 1000 22590472 0,00% 1.165
LPAR3 1,220 1.607 2.16 65% 629 889 21846182 0,00% 1.211
Please note that ‘IBM GP CPU Speed’ is the IBM rated GP MIPS per engine
based on the model. While the ‘Curr GP/zIIP MIPS’ is the actual engine MIPS
output per engine at this time.
SMF record 113 provides the necessary information which helps to compute the
current MIPS output. To compute MIPS, divide the number of CPU speed i.e.,
CPU cycles per second of your CEC by the number of cycles per instruction (CPI)
and then divide by 1 million.
For example, if we are using a z15 which has a CPU speed of 5.2 Ghz (S), MIPS
is calculated as follows based on the information collected from SMF 113 for a
specific interval.
CPU cycles in the interval (C) = 1,247,568,108,281

Instructions executed in the interval (I) = 440,589,240,343
Cycles per instruction (CPI) = C / I = 2.83
MIPS = S / CPI / Million = 5.2 * 1000000000 / 2.83 / 1000000 = 1837.46
You may consider adding many more information to the above table.
96
Example 3
The following example shows the display of configuration in a CEC and the
current status. This information is refreshed every 5 minutes.

CEC: CEC1 3906-719
GP MIPS : 24846 MSU : 2953
Online Wt Max Curr Cur CSTOR PRC PRC PRC
Lpar Weight
GP MIPS MIPS MIPS MSU MB VH VM VL
DRLP1 0 0 0 0 0 0 0 0 0 0
DRLP2 0 0 0 0 0 0 0 0 0 0
DUMMY1 0 0 0 0 0 0 0 0 0 0
ICF1 0 0 0 0 0 1 49152 0 0 0
ICF2 0 0 0 0 0 1 8192 0 0 0
zOS1 7 380 9441 9154 8173 971 368640 7 0 0
zOS2 2 8 199 2615 58 7 10240 0 1 1
zOS3 8 446 11081 10461 7954 945 368640 8 0 0
zOS4 2 62 1540 2615 2644 314 329728 0 2 0
zOS5 2 7 174 2615 55 7 10240 0 1 1
zOS6 2 7 174 2615 249 30 69632 0 1 1
zOS7 4 71 1764 5231 1235 147 135168 0 2 2
zOS8 2 10 248 2615 229 27 135168 0 1 1
zOS9 2 3 75 2615 37 4 10240 0 1 1
zOS10 1 6 149 1308 13 2 16384 0 1 0
PHYSICAL - - - - 261 - -
TOTAL 32 1000 - 41846 20908 2456 1511424
You may use this dashboard to track any changes to weight and #LPs on real
time and trigger alarms or mails against any changes to the configuration.
Many a times, we create DUMMY LPARs to reserve the excess capacity, if we

have, either through advance purchase or capacity released due to optimization
or sunset of applications. For example, if we have purchased 2 additional GP
engines to meet a very specific demand over short period of time in the year, then
we would like to reserve it by dedicating 2 GP engines to a DUMMY LPAR and
release only during the specific short time window. This will help us to control the
R4HA MSU and save in monthly maintenance charge. However, advance
purchase can have alternative of capacity on demand (OOCoD).
97
Example 4
The following example shows the CICS transaction rate and average response
time of the most critical transactions in the agreed SLA. This information is
refreshed every 5 minutes.

Sysplex: ABC
Response time (MilliSec)
Tran/sec
Transaction
Avg Resp time SLA
TRN1 60 400 500
TRN2 80 250 500
TRN3 45 350 500
TRN4 111 450 500
The above will help to learn the SLA breach, if any, and investigate at our end
before the business entities or problem management knock at our door with
response time issues. As I have seen, now a days, business units helpdesk use
very smart tools at their end to track the business transaction response time.
98
Example 5
The following example shows the R4HA MSU tracking in real time. This
information is refreshed every 5 minutes.

Current R4HA Current R4HA MSU
Environment MSU MSU Target Highlight
TOTAL 5066 5760 5100 =<90% Green
>90%&<95% Yellow
Prodplex1 4222 5002 4050 =>95% Red
prodplex2 250 277 400
Prodplex3 275 180 300
Test 500 545 600
Sandbox 45 45 50
Others nnn nnn nnn
We get a lot of information just by looking at the screen. Nowadays, we make

effective use of CPU resources through shared configuration of LPARs in a CEC
and without dedicated processors or resource capping. So, the RED color for
some specific Sysplexes is perfectly normal as they are using more when other
LPARs are not using their guaranteed CPU and do not impact the overall MSU
usage. But, if the TOTAL usage is RED, then it is necessary/critical to investigate
what is happening and take appropriate action. In this situation, many a times, we
have taken action to shutdown Sandbox LPARs, shutdown some housekeeping
tasks, stop HSM space management and in the worst case, activate capacity on
demand (OOCoD) to address the high demand.
If your billing is based on R4HA MSU, there are tools available in the market such
as BMC Compuware ‘ThruPut Manager’ to control your R4HA MSU peak,
provided you are ready to invest in the software.
99
Example 6
The following example shows the daily MSU tracking.
Daily MSU-hour tracking - June 2022

Daily Accumulated
Date Budget Actual % Budget Actual %
2-Jun-22 95291 103514 109% 95291 103514 109%
3-Jun-22 99798 103860 104% 195089 207374 106%
4-Jun-22 94463 90169 95% 289553 297544 103%
5-Jun-22 68065 63675 94% 357618 361219 101%
6-Jun-22 90140 85305 95% 447758 446523 100%
7-Jun-22 95383 91187 96% 543141 537710 99%
8-Jun-22 93083 96246 103% 636224 633956 100%
9-Jun-22 93452 94685 101% 729676 728642 100%
10-Jun-22 97959 99878 102% 827635 828519 100%
11-Jun-22 82506 81259 98% 910140 909779 100%
12-Jun-22 56109 53445 95% 966249 963223 100%
13-Jun-22 88301 91262 103% 1054550 1054485 100%
14-Jun-22 95383 99687 105% 1149933 1154172 100%
15-Jun-22 93083 101637 109% 1243016 1255809 101%
16-Jun-22 93452 95646 102% 1336468 1351454 101%
17-Jun-22 97959 99977 102% 1434426 1451432 101%
18-Jun-22 72388 86011 119% 1506814 1537443 102%
19-Jun-22 45990 55241 120% 1552804 1592684 103%
20-Jun-22 88301 93633 106% 1641105 1686317 103%
21-Jun-22 95383 92359 97% 1736488 1778676 102%
22-Jun-22 93083 92775 100% 1829571 1871451 102%
23-Jun-22 93452 96890 104% 1923023 1968341 102%
24-Jun-22 97959 2020982
25-Jun-22 72388 2093369
26-Jun-22 45990 2139359
27-Jun-22 88301 2227660
28-Jun-22 95383 2323043
29-Jun-22 93083 2416127
30-Jun-22 121689 2537816
1-Jul-22 91152 2628968
Total 2628968 1968341 75%
100
This report tells us every morning how we are doing with our MSU-hours and how
we expect our bill to be for this month.
The vital data here is the forecast value. It is easy to generate this using some
techniques of your own and taking into consideration the historical data for last
couple of years, the information in current years such as weekends, month ends,
quarter ends, special processing days (e.g., Black Friday and Cyber Monday sale,
fund raising activity, special concerts etc.) and holidays.
Once we have the forecast, then track it every day. In that way, we will have
enough time in hand to take proactive actions to control it, if possible. If not
possible to control, then identify the cause resulting in the surge in MSU-hours
and report to the senior management and Finance well in advance so that
organization is ready to meet the cost challenge.
Example 7
The following example shows the MIPS utilization of CICS workload and compare
the same against the utilization previous day(s). This information is refreshed
every 5 minutes.

Sysplex: ABC
CICS Workload utilization & comparison
30000
25000
20000
MIPS
15000
10000
5000
0
0:00
0:45
1:30
2:15
3:00
3:45
4:30
5:15
6:00
6:45
7:30
8:15
9:00
9:45
10:30
11:15
12:00
12:45
13:30
14:15
15:00
15:45
16:30
17:15
18:00
18:45
19:30
20:15
21:00
21:45
22:30
23:15
Time
Yesterday Today
101
Example 8
The following example shows the MIPS utilization of a specific LPAR, LP1, and
compares the same against the previous day(s) utilization. This information is
refreshed every 5 minutes.

LPAR: LP1
LP1 LPAR Utilization
400
350
300
250
CPU%
200
150
100
50
0
0:00
0:40
1:20
2:00
2:40
3:20
4:00
4:40
5:20
6:00
6:40
7:20
8:00
8:40
9:20
10:00
10:40
11:20
12:00
12:40
13:20
14:00
14:40
15:20
16:00
16:40
17:20
18:00
18:40
19:20
20:00
20:40
21:20
22:00
22:40
23:20
Time
N-1 Day N Day(Today)
Example 9
The following example shows a simple tracking of critical path batch processing
during batch window. The actual tracking may have lot more information.
Date: 12 December 2022
Event Job name SLA Start/End Current status
Start of batch ABCD 21:00 21:00 Green
Milestone 1 MNOP 01:00 00:30 Green
Milestone 2 WXYZ 03:00 03:30 YELLOW
102
SMF Record 99 and 113
SMF record type 99
SMF record type 99 (System Resource Manager Decisions) provides detailed

audit information for work run on z/OS. You can use the type 99 records to
analyze the performance characteristics of work. The records contain
performance data for each service class period, a trace of SRM actions, the data
SRM used to decide which actions to take, and the internal controls SRM uses to
manage work. This can help you determine in detail what SRM is doing to meet
your work's goals with respect to other work, and the types of delays the work is
experiencing.
The SMF 99 records contain a wealth of information related to WLM algorithm

decisions. They were originally developed to trace WLM decisions, but over the
years they have been expanded to provide insights into HiperDispatch, Capping,
Group Capacity Limits, machine topology, and more. Most customers have the
SMF 99 WLM decision records turned off due to their high volume.
SMF record type 99 have 14 subtypes. Please refer to IBM manual ‘SMF SA38-
0667-50’ page-992 for details on SMF record 99, from where I have extracted the
following information.
Subtype 1 Contains system level data, the trace of SRM actions, and data about
resource groups. The SRM actions are recorded in trace codes. All trace codes
are described in z/OS MVS Programming: Workload Management Services. A
subtype 1 record is written every policy interval.
Subtype 2 Contains data for service classes. A subtype 2 record is written every
policy interval for each service class if any period in the service class had recent
activity.
Subtype 3 Contains service class period plot data. A subtype 3 record is written
every policy interval for each service class if any period in the service class had
recent activity and plot data.
103
Subtype 4 Contains information about a device cluster. A device cluster is a set
of service classes that compete to use the same non-paging DASD devices. A
subtype 4 record is written every policy interval for each device cluster in the
system.
Subtype 5 Contains data about monitored address spaces. A subtype 5 record is

written each policy interval for each swapped in monitored address space.
Subtype 6 Contains summary information about each service class period,

including the resource control settings for the next policy interval. A subtype 6
record is written each policy interval.
Subtype 7 Contains summary information for the Enterprise Storage Server

(ESS) with Parallel Access Volume (PAV) feature. A subtype 7 record is written
every third policy interval.
Subtype 8 Contains summary information for LPAR CPU management. A subtype

8 record is written each policy interval, when in LPAR mode.
Subtype 9 Contains summary information for dynamic channel path

management. A subtype 9 record is written each policy interval. Record Type 99
812 z/OS V2R1.0 MVS System Management Facilities (SMF)
Subtype 10 Contains information about dynamic processor speed changes. A

subtype 10 record is written for every processor speed change.
Subtype 11 Contains information about Group Capacity Limits. A subtype 11

record is written every 5 minutes.
Subtype 12 Contains HiperDispatch interval data. A set of subtype 12 records is

written each policy interval.
Subtype 13 Contains information about HiperDispatch. This information is for IBM

internal use only.
Subtype 14 Contains HiperDispatch topology data. Subtype 14 records are

written every 5 minutes, or in the current policy interval, if a HiperDispatch
topology change happened.
104
Record type 113
Please refer to IBM manual ‘SMF SA38-0667-50’ page-1108 for details on SMF
record 113.
The system writes record type 113 to record hardware capacity, reporting, and
statistics for IBM System z10 or later machines. With the Extended Monitor
Facility released with the IBM z196, the SMF record can be utilized to capture
software events. We have to configure the collection of data using Hardware
Instrumentation Services (HIS) for which you can find more details in the following
link.
https://www.vm.ibm.com/perf/tips/cpumf.html
The SMF 113 measurements are designed to provide insight into the movement
of data and instructions among the processor cache and memory areas. These
measurements are invaluable for quantifying the net effect of the usage that the
processor caches have on the MIPS capacity of a processor. The SMF 113
measurements have become the basis for IBM’s LSPRs for processor sizing.
We can generate reports on Cycles per Instruction or actual MIPS per processor,
Level 1 Miss Percentage and Relative Nest Intensity calculated from data from
the SMF 113 records.
Cycles Per Instruction (CPI) - The number of cycles divided by the number of
instructions executed. An indication of how fast the CPU is running the work. It is
influenced by factors such as instruction stream complexity and cache and
memory access (RNI).
MIPS - Millions of instructions per second. The rate at which instructions were
being executed when the CPU was processing. This is effectively the inverse of
the CPI, so they show the same information. The MIPS value increases when the
work is being processed faster, so the MIPS display may be more intuitive than
CPI.
This MIPS measurement only has a loose relationship with the MIPS rating of the
processor. A wide variety of workloads is used to come up with a MIPS rating.
The typical variation in this MIPS measurement is an illustration of how much it is
influenced by workload and why care needs to be taken when comparing MIPS
ratings.
105
Level 1 Miss per 100 Instructions (L1MP) - The number of times data or
instructions were not found in level 1 cache. Fetches from lower cache levels are
slower, so in general as the L1MP rises, the work runs more slowly.
Relative Nest Intensity (RNI) - An indication of the relative usage of the memory
hierarchy, using formulas published by IBM for the specific processor type.
LSPR Workload - The LSPR workload hint over time. Derived from the L1MP and
RNI using the table published by IBM. Can be Low, Average or High.
SMF113 needs to be collected to calculate RNI – Should be input to zPCR for

evaluating processor model changes – If you do nothing else with the data, do this
• You can gain a deeper understanding of how your workloads are utilizing
the hardware
• You might detect anomalies
• You can test the result of LPAR configuration changes
So, I suggest;
• Start capturing 113 data if you are not already
• Use it with zPCR for planning processor changes
• Examine the data to better understand how your workload is using your
hardware
The following two examples have taken data from SMF record type 99, 113 and
70. As you can see, this simple data provides a wealth of information for the CPO
team. You can explore generating more data based on your needs.
You may refer to the following links which will provide you more technical details.
https://www.vm.ibm.com/perf/tips/burg-z16.pdf
https://share.confex.com/share/117/webprogram/Handout/Session9592/Peter.E
nrico.zOS.SMF113.and.CPU.Counters.pdf
106
Example 10
The following example shows the hardware topology for one of the LPARs. This
information is refreshed every 5 minutes.

Sysplex: ABC
WLM Data for LPAR: LP1 on CEC: CEC1
WLM node: 0001 Type: 00 VH:02 VM:00 VL:00

Proc #: Act: Instructions CPI MIPS Drawer Cluster Chip Engine Core Type
4 79.93% 440,589,240,343 2.83 1,837.45 1 3 2 GP 02 VH
12 72.27% 405,745,797,152 2.78 1,870.50 1 3 2 GP 06 VH
Total: 1.5 846,335,037,495 2.8 1,857.14

Proc: Act: Instructions CPI MIPS Drawer Cluster Chip Engine Core Type
0 85.13% 500,386,588,233 2.65 1,962.26 1 3 1 GP 00 VH
2 81.27% 484,579,325,458 2.61 1,992.33 1 3 1 GP 01 VH
10 74.45% 466,134,115,856 2.49 2,088.35 1 3 1 GP 05 VH
Total: 2.4 1,451,100,029,547 2.59 2,007.72

6 80.42% 483,319,794,238 2.59 2,007.72 1 3 1 GP 03 VH
8 74.63% 450,924,623,164 2.58 2,015.50 1 3 1 GP 04 VH
Total: 1.5 934,244,417,402 2.59 2,007.72
14 33.02% 285,083,929,266 1.8 2,888.88 1 3 2 ZIIP 07 VH
15 19.27% 203,477,368,678 1.47 3,537.41 1 3 2 ZIIP 07 VH
16 14.71% 194,095,402,091 1.18 4,406.77 1 3 2 ZIIP 08 VH
17 10.32% 124,978,949,566 1.28 4,062.50 1 3 2 ZIIP 08 VH
18 4.75% 64,029,429,706 1.15 4,521.73 1 3 2 ZIIP 09 VM
19 3.40% 41,997,632,806 1.26 4,126.98 1 3 2 ZIIP 09 VM
Total: 0.8 913,662,712,113 1.46 3,561.64
107
Example 11
From the data presented in Example 9, this is a deep dive information for GP
processor 04 in LP1. You can find more granular data at LP level. I am not telling
you to keep on analyzing this data regularly. But this information may help you in
doing performance tuning at a more granular level.
Data for LPAR: LP1
Processor 04 at 12:45
Processor report:
Total Supervisor state Problem state

Cycles in interval: 1,247,568,108,281 823,928,875,663 423,639,232,618 (33.95%)
Instructions executed: 440,589,240,343 269,275,887,598 171,313,352,745 (38.88%)
Cycles per instruction: 2.83 3.05 2.47
Mips rate: 1,837.45 1,704.91 2,105.26
L1 miss per 100 instr. 4.78
- L1 INSTR miss per 100 2.40
- L1 DATA miss per 100 2.37
- SCPL1M cycl/MISS 33.58
- SIIS indicator 0.44
Cache data from: Instruction Data

- % From L2 (in CPU) 93.76% 78.71%
- % From L3 4.26% 10.12%
- % From L4 (on DRAWER) 3.79%
- % From L4 (off DRAWER) 0.12%
- % From Mem(on DRAWER) 1.05% 4.22%
- % From Mem(off DRAWER) 0.00% 0.00%
Logical processor performance

- Affinity: GP VH
- Avg wait: 48
- Avg active: 197
- Disp per act: 7.86
- usec per dispU: 25
Processor topology:
Drawer: 01 CLuster: 03 Chip: 02
108
Example 12
This example describes the automatic email triggered to report anomalies and
needs immediate action.
From: RTP-Automation
To: CPO, Operations
Sub: LPAR: LP1 - HIGH CPU consumption – for immediate action
Hello Team,
The task ABCDEFG in LPAR LP1 is consuming 101% of a GP engine

(Threshold 95%) for past 15 minutes. This needs your immediate action to
review the utilization.
Thank you: Automation.
Note: This mail is generated by the automation and please do not reply to this
mail.
This utilization could be normal or abnormal. If you find it normal, then ignore it.
If it is abnormal, you need to perform an analysis immediately.
This automatic email helps to reduce the CPU wastage.
In absence of this, the task may have a problem and keep consuming the CPU
for hours and go unnoticed. This email provides an opportunity to the teams for
a quick review. If found abnormal, take appropriate action to fix it ASAP thereby
controlling the wastage.
109
Systems Configuration
Manage resource configuration
You must maintain an exhaustive systems configuration file(s), always up-to-date

and this should be a golden source to find all the systems configuration and
history. I will include the following information under this.
1. Configuration details under each GP CEC. I propose to create a new

version of file with every change (large or small)
• CEC Name
• CEC physical configuration – Model, #GP, #specialized engines (zIIP,

IFL, ICF), total central storage, MIPS, MSU
• List of LPARs – All LPARs configured in the CEC
• LPAR details – Allocated central storage, processor weight (all type of

engines), guaranteed capacity, maximum capacity, #LPs (all type of
engines), allocation of VH/VM/VL processors
• CBU and OOCoD capacity entitlement in the CEC
2. Configuration of external CF CECs
• LPAR names
• Resource allocation - #ICFs or #shared ICFs, central storage
3. Hardware upgrade history – All including major and minor
4. CBU history - #tests, date and time of activation and deactivation
5. OOCoD history – date of activation and deactivation
6. Systems configuration diagram – a consolidated diagram covering all

CECs, CFs
110
7. Naming convention
• CEC Naming
• LPAR naming – All LPARs (zOS, zVM, zLinux, Internal CF, External CF)
I have placed a sample mainframe configuration topology diagram here.
CEC1 nn M CEC2 nn Km CEC3

LPN1 LPC2
LPN3
LPN2 LPD1
LPN4 LPR1
LPC1
LPA3
LPA1 Prod Sysplex1 LPR2
LPD2
LPD3
LPA2
LPA4 LPD4
LPC4
LPC3
LPD5 LPR3 LPN5
LPD6
LPD7 Prod Sysplex 2 LPR4
LPN6
LPN7
LPD8
LPN8
ECF1 LT6C1
LTN1 Test Sysplex 1 LTC2
LTA1 LTN2
LTN3 LTC3 LTC4

LTN5 LTD3
LTN4
LTA3 Test Sysplex 2 LTR1
LTR2 LTD4
LTA5 LTA4 LTD5
LSN1 LSA1
Sysprog LPAR LSD1 ECF3
LNN1
LNA1
LND2 Network LPARs LNN2
ECF2 VTN2
VPN1 VPN2 VSN2
VTN1 VSN1
zVM LPARs VPD1
VPD2
P S
P Primary Data Centre S S Secondary Data Centre S
LPAR Legends:
Normal
Availability
Disaster
Recovery
I have used the following naming conventions.
1st Character L zOS LPARs

V VM LPARs
2nd Character P Production LPARs
T Test LPARs
S Sysprog LPARs
3rd Character N Normal LPARs
A Availability LPARs – Local recovery
D Disaster LPARs – Remote recovery
R Recovery LPARs – Recovery from Safeguarded Copy
111
CPO Health Check
I strongly recommend generating a system health check report every day to make
sure that there is no accidental change to the system configuration and
performance due to some manual change, system IPL, or any changes which is
generally done in the night or early morning. The health check metrics may
include the following and the list can be customized to your need.
• LPAR weights
• LPs allocation – GP and specialized engines
• Reserved engines, if any – not released to the pool by mistake
• Current system status – any abnormality
• Any SLA miss – especially batch
• CPU speed – GP and specialized engines
• Local page dataset usage
• Available central storage frames queue
• Any change to WLM definition or Policy
• Any CPO related issues
Rule of Thumbs (RoT)
To effectively manage the CPO activities, it is very important to identify CPO

opportunities for implementation. One of the most challenging tasks is to identify
the ‘normal’ against ‘abnormal’ for various metrics. I strongly suggest to first list
all the metrics for consideration for CPO analysis. Then, create a threshold or
RoT against each and every metrics and take it as a guide for your analysis. There
is no right or wrong. But you have to start somewhere. I had collected a list of
RoTs for my reference. These were not absolute values for me. I continuously
adjusted the values for my organization as I moved forward with my analysis over
a period.
112
I have compiled the list as follows.
a. Channel Utilization > 30% is not good
b. Device Queuing intensity > 300 is not good
c. WLM service class’s PI > 3 is not good
d. For service classes with velocity goals, maintain at least a difference of 5

or more
e. LPAR Management time – LPAR management % for CPs, zIIPs (combined)

should be less than 2%
f. CPU per transaction will vary 3-5% for every 10% change in CPU utilization
g. Job run in larger LPAR will take less CPU time than when run in small LPAR
h. Uncaptured CPU time. Uncaptured CPU time is CPU time the LPAR
consumed but which is not charged back to any address space or enclave.
Many organizations measure this as ‘Capture Ratio’. You need to establish

a threshold for the Capture ratio. As I have noticed, a production LPAR with
>20,000 MIPS allocation and running stable workload, have a capture ratio
> 95%. Whereas a test system with 3,000 MIPS allocation and running lots
of variable workloads have a capture ratio as low as 85%.
Many causes for uncaptured CPU time. Common causes are as follows:
• High page fault rates
• Full pre-emption
• Suspense lock contention
• Spin lock contention
• Getmain/Freemain activity (recommend cell pools)
• SRM time-slice processing
113
• Interrupts
• SLIP processing
• Long queues being processed in uncaptured processing
• Affinity processing (such as need for a specific CPU or crypto facility)
• Initiator CPU (SMF30ICU and SMF30ISB)
i. If not using Intelligent Resource Director (IRD) or Hiper Dispatch (HD),

#LCP to #PCP > 3 is not good
j. Combined CF CPU utilization not > 80%
k. CF storage utilization not > 45%
l. CF Sub-channel busy conditions should not be > 30%
m. Changed CF Async requests (CHNGD) should not be > 10% of all requests
n. Delayed request % should not be > 10% of total requests
o. DASD NVS (Non-Volatile Storage) bypass condition should not be > 5
p. Lock structures real lock contention - not > 2% of total CF request for
structure
q. Lock structures false lock contention - not > 1% of total CF request for
structure
r. CF Path Busy % - not > 10%
s. LPAR GP CPU utilization not > 90%
114
CPO Process Document – a sample
For effective CPO management, I strongly recommend creating a CPO process
document following the installation standards. I review and update this document
yearly. However, I make ad hoc updates based on the feedback from audits,
change in the organization process document standards, and any major changes
to my CPO process.
In addition, I create work instructions for different CPO activities like;
• OOCoD activation/deactivation process
• CBU activation/deactivation process
• Changes to WLM definition
• System configuration changes e.g., addition/deletion of LPARs
• Resource allocation changes to LPARs
I include at least the following in my CPO process document:
Name, contents, and change history
The process document has a proper name approved by your process team in the
organization.
You must include a table of contents.
You must add a ‘Change History’ table containing Date, Version number, a short
description of changes in this version, owner of the document and the approver.
Introduction
Write an introduction to the process document.
(Example: The mainframe Capacity, Performance and Optimization (CPO) is

performed by a dedicated team under the ‘Mainframe Infrastructure’ department.
This document describes the process followed to manage the capacity,
115
performance, and optimization for infrastructure (systems, applications, and
users) under mainframe z/OS operating systems. Unless otherwise stated, all the
descriptions in this document refers to mainframe hardware, software, and
capabilities under mainframe z/OS only.)
Definitions and terminologies
Describe the definitions, if any, you would like to describe in the document.
(Example:
• Capacity management is …...
• Performance management is …...
• Optimization management is ….
• CPU resource is ….)
Objective
Describe the CPO objectives here. The objectives should be challenging,

stretched and practicable to achieve. I recommend getting the approval from the
Mainframe Infrastructure head to align the objectives with the overall mainframe
goal.
Mainframe environments
Provide a high-level overview of the mainframe configuration like #Data centers,

#CECs, #LPARs etc.
CPO activities
(Example: The CPO team perform all activities related to mainframe capacity,
performance and optimization under z/OS. The following sections describe
various capacity and performance related functions and activities…..)
116
Capacity and performance data collection
(Example: Capacity and performance utilization data is collected using the

standard IBM facility called SMF (System Management Facility). It is the
responsibility of the z/OS support team to configure, collect and organize the
storage of SMF data…..)
Capacity and performance data processing
(Example: The required SMF data is extracted and the utilization statistics are
processed outside mainframe by the xxxx tool. The output from the xxxx tool is
the golden source for CPO team to perform most of the CPO activities. However,
on need basis, some ad hoc data processing is performed in mainframe using
SAS programs…..)
Capacity and performance monitoring tools
(Example: The following application/tools are used for resources monitoring,

alerting and analysis.
• Tool 1….
• Tool 2….
• RMFMON III – IBM supplied monitoring tool in mainframe)
Capacity forecast and provisioning
Describe in detail the process of performing capacity forecast and preparation of

capacity plan and tracking.
Capacity provisioning
(Example: Based on the capacity forecast, the CPO team makes sure that the
required capacity is available to meet the demand.)
117
Resource capacity considered under capacity forecast
(Example: The following resources are considered for the forecasting. In general,
the forecast is performed for GP CPU as this is the most variable resource
component. However, the other components are reviewed from time to time and
mostly on demand, for example during a major hardware/software upgrade, major
project implementation and to address performance issues.
• GP CPU
• zIIP CPU
• other resources)
Performance management
(Example: The objective of performance management is to maintain operational

efficiency in all mainframe environments and avoid resource wastage. The
following major activities are performed under this:
• Manage service definition in WLM (Workload Manager) to define the

relative prioritization of various workload in different environments
• Performance monitoring and alert using the following tools
o Tool1
o Tool2
o RMF Monitor III
o Omegamon
• Performance analysis
• Handle performance related issues)
118
Capacity and performance analysis
(Example: Capacity and performance analysis is a regular activity and mostly

done on demand to address the capacity and performance issues and to ensure
adequate capacity availability to meet the overall demand in different
environments.
The CPU usage trend analysis is also performed periodically to determine the
step growth and generate a monthly capacity report. The following drill-down CPU
usage analysis is performed to identify the major contributor.
• CEC – Individual or combined
• LPARs – All LPARs in an environment or individual LPAR
• Workload – Workload, service class, report class or a group of jobs
• Jobs – STC, TSO user, Batch jobs
• CICS transactions)
Track and manage CPU capacity growth
(Example: The objective of this activity is to track and take all possible actions to
control the YTY BAU CPU capacity growth within x%.
The average hourly CPU usage is cumulated over the entire month (including
weekends and holiday) to determine the average CPU usage per day in the
month. The difference between the daily average usages in the current year
against the last year represents the growth (positive or negative).
It is often a challenge to separate the actual BAU growth against the incremental
demand due to new applications, functional changes and optimization. But all
possible observations are put together during the monthly analysis to justify the
growth and is presented in the monthly report.
Optimization is a continuous activity to identify opportunities and implement

necessary changes to reclaim and reuse the CPU resource.)
119
Track and manage CPU resources (MSU-hour)
(Example: The objective of this activity is to track the MSU-hour usage against
the target for the year and take various actions to keep the yearly average usage
below the target.
The MSU-hour is calculated based on system utilization in Millions of Service

Units (MSU). Each IBM machine has an MSU rating published by IBM. This
defined capacity is consumed by all the LPARs and applications on the machine
on an on-demand basis. As application consumption is based on demand and
often unpredictable, the MSU varies from month to month.
The following actions are taken to control the MSU-hour.
• Continuous optimization of application and systems components
• Manage the peak usage
• Batch MSU control through proper batch scheduling
• Workload distribution (action to distribute/spread batch workload)
• Work with planners to manage batch scheduling (as applicable)
• Control #LPs (Logical Processors)
• Alert on CPU usage crossing the specified threshold
• Monitoring and alerts on heavy usage and looping and action to stop it)
120
Optimization
(Example: The objective is to identify opportunities and act on it to reduce CPU

resource usage and reclaim resource for reuse.
CPU resource optimization is performed for following workloads.
• Batch, Online (CICS transactions), System components
• The following capacity and performance activities provides input to identify

optimization opportunities
o Capacity and performance analysis
o Capacity and performance related incidents and problems
o Capacity and performance monitoring and alerts
• Input from development areas and applications)
Capacity tools
(Example: The following capacity tools are used to manage capacity and
performance in mainframe. A few of these tools are owned by this CPM team.
• Tool1
• Tool2
• RMF Monitor III
• Omegamon
• Others…..)
Capacity and performance reporting
(Example: The capacity and performance data is processed, and the following
reports are published.
121
• Report1 - Process and publish resource utilization reports
• Report2 – The utilization and configuration statistics are collected from

various sources and presented near real time
• SCRT report – Generate the report monthly and send it to IBM
• Monthly capacity report – generated and reviewed every month
o This report includes capacity analysis summary, capacity forecast for

month-end and capacity recommendation.
• Ad hoc reports – Generated on demand
• Management reports
• Request from application/business areas
• Ad hoc data analysis)
Incidents and problems
(Example: CPO team handles all the capacity and performance related issues.
• Proactive
o Monitoring and alerts
o Appropriate action based on the alert
• Reactive
o Emergency calls
o Issues reported by operations and mainframe support team,

developers and business
o RCA
o Participate as required)
122
Projects and initiatives
(Example: CPO team estimates the capacity requirements for projects and
special initiatives on demand. The following activities are performed.
• Capacity estimation
• Performance evaluation
• Special initiatives
• Performance and optimization recommendations)
Operational activities – on demand
(Example: The CPO team gets involved in the following operational activities on
demand.
• Make changes in Tool1….
• Online/Offline LPs
• Control the number of JES initiators
• Create change and update WLM definitions
• Prepare the capacity allocation changes (CPU weight, memory, LPs) and
recommend mainframe SMEs for implementation
• Recommend activation/deactivation of reserve capacity, if any
• Handle capacity and performance related incident records
• On demand 24x7 support)
123
Conclusion
If you ask me a question on CPO or want me to work on something, the first
question that jumps to my mind is: is this request ‘Generic’ or ‘Specific’? If it is
specific, then I try to understand the context and make a plan and approach to
perform the task accordingly.
Sometimes the task is simple, and I am able to deliver the result based on my
knowledge and past experiences. But, when required, I do not hesitate to look for
additional information from IBM Manuals, Redbooks, and articles published on
the internet to read and research. Most importantly, I look at the latest information
related to the version of zOS, product and APARs and the hardware configuration.
Getting a hold of the correct information, and the speed at which you access the
information, proves your capability to perform the task in time and deliver the most
accurate result.
Working under the CPO has always been very fascinating, interesting and at the
same time very challenging. We have used many innovative ways to approach
an issue, analysis, or a query from the SMT, business and user community. Once
you start working on a CPO query or issue, then you will start thinking from many
points of view, and once you continue working patiently, you will definitely be able
to deep dive to the root. And undoubtedly, you will learn something new from each
task.
In our professional career, unless we are in the field of research and development
doing new innovations, in more than 99% of the cases, we do whatever someone
else has already done it elsewhere or knows how to do it. So, my philosophy is, if
someone else is able to do it, then I will be able to do it. And if I am able to do it,
you will also be able to do it. You just need to build confidence in yourself to
perform all the tasks in front of you with speed and accuracy and develop ways
to do it with the least possible effort.
124

Mainframe zOS CPO - Hints and Tips-3

Uploaded by

Copyright:

Available Formats

You might also like

Mainframe zOS CPO - Hints and Tips-3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mainframe zOS CPO - Hints and Tips-3

Uploaded by

Copyright:

Available Formats

(Capacity, Performance and Optimization)

Comments and Feedback

Mainframe zOS Capacity, Performance and Optimization (CPO)

My friend and colleague Frank Tidemand Johansen who is currently a mainframe

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

Another aspect which inspired me to write this book is ‘Knowledge Sharing’. In

• Large workload and lack of time

Mainframe zOS Capacity, Performance and Optimization (CPO)

Mainframe zOS Capacity, Performance and Optimization (CPO)

• New zOS and product versions

• Modernization of application code

• Usage of new languages such as Java

• Implementation of web applications

• More and more encryptions

• More and more automation

Mainframe zOS Capacity, Performance and Optimization (CPO)

Under CPO, we have become mostly reactive as opposed to proactive — be it

• We do prepare a yearly capacity plan to meet the organizational

• We do performance management but again mostly reactively and that to

• We mostly ignore the optimization. As a result, we do not contribute much

Mainframe zOS Capacity, Performance and Optimization (CPO)

To switch from reactive to proactive, we need to prepare a systematic approach

Mainframe zOS Capacity, Performance and Optimization (CPO)

In my experience, mainframe installations have grown relatively big and CPO

Therefore, I propose a strong CPO function in all mainframe organizations. In my

CPO as a culture in the organization

Mainframe zOS Capacity, Performance and Optimization (CPO)

Existence of mainframe zOS CPO team

Direction from Senior Management Team (SMT)

The senior management team in the organization should have a strong

1. Establish a strong communication process through regular interactions with

3. Provide a direction to the mainframe community (zOS infrastructure team,

Mainframe zOS Capacity, Performance and Optimization (CPO)

Data analysis automation

Quick analysis, execution and deployment of various CPO recommendations is

Real time performance monitoring

‘Real Time Performance (RTP)’ monitoring is very necessary to stay on top of

Mainframe zOS Capacity, Performance and Optimization (CPO)

Service Level Agreement (SLA)

• Response time goal of critical CICS online transactions

• Timeline of critical batch jobs execution

• Acceptable delays in starting a critical batch job

• Batch window (start to end of business batch)

• Print files availability to business units

• Agreement with external entities (e.g. MQ servers managed outside the

Mainframe zOS Capacity, Performance and Optimization (CPO)

In my view, ‘Capacity’, ‘Performance’ and ‘Optimization’ are very separate

For clarity, I define the functions as follows:

• Capacity management is the process of planning, estimating, and

• Performance management is the process to make the best use of the

• Optimization is a process to understand what is going on in the systems,