Professional Documents
Culture Documents
Tivoli Netcool OMNIbus 7.3.1 Large Scale and Geographically Distributed Architectures - Best Practice - v1.0
Tivoli Netcool OMNIbus 7.3.1 Large Scale and Geographically Distributed Architectures - Best Practice - v1.0
1
Large scale and geographically
distributed architectures
Best Practices
Contents
About this publication .......................................................................................................................iv
Intended audience ......................................................................................................................................... iv
What this publication contains..................................................................................................................... iv
Conventions used in this publication.......................................................................................................... iv
Typeface conventions ........................................................................................................................... v
Operating system-dependent variables and paths ........................................................................... v
Chapter 1 Executive summary ........................................................................................................... 6
Chapter 2 Introduction ........................................................................................................................ 7
Chapter 3 Details of the new architecture model ......................................................................... 10
Architecture model building block............................................................................................................. 10
Event aggregation at the dashboard layer ................................................................................................. 10
Network bandwidth considerations .......................................................................................................... 12
Chapter 4 Lab test environment ...................................................................................................... 14
Test architecture ............................................................................................................................................ 14
Test hardware................................................................................................................................................ 15
Composition and distribution of the test events....................................................................................... 15
Users and filters ............................................................................................................................................ 16
Metrics gathered ........................................................................................................................................... 17
Metric 1: How long does the AEL initially take to load the whole data set? .............................. 17
Metric 2: How long do AEL auto-refreshes take to execute? ........................................................ 17
Test results ..................................................................................................................................................... 17
Results analysis ............................................................................................................................................. 19
A note about data caching ................................................................................................................. 19
Notices.................................................................................................................................................. 20
Trademarks .................................................................................................................................................... 22
iii
Licensed Materials – Property of IBM
Intended audience
This publication is intended for anyone preparing to deploy a large scale or geographically
distributed Tivoli Netcool/OMNIbus solution.
iv
Licensed Materials – Property of IBM
Typeface conventions
This publication uses the following typeface conventions:
Bold
Lowercase commands and mixed case commands that are otherwise difficult to
distinguish from surrounding text
Interface controls (check boxes, push buttons, radio buttons, spin buttons, fields,
folders, icons, list boxes, items inside list boxes, multicolumn lists, containers, menu
choices, menu names, tabs, property sheets), labels (such as Tip: and Operating
system considerations:)
Keywords and parameters in text
Italic
Citations (examples: titles of publications, diskettes, and CDs)
Words defined in text (example: a nonswitched line is called a point-to-point line)
Emphasis of words and letters (words as words example: "Use the word that to
introduce a restrictive clause."; letters as letters example: "The LUN address must
start with the letter L.")
New terms in text (except in a definition list): a view is a frame in a workspace that
contains data
Variables and values you must provide: ... where myname represents....
Monospace
Examples and code examples
File names, programming keywords, and other elements that are difficult to
distinguish from surrounding text
Message text and prompts addressed to the user
Text that the user must type
Values for arguments or command options
v
Licensed Materials – Property of IBM
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus
.doc_7.3.1/omnibus/wip/install/concept/omn_esf_configuringdeploymultitieredarch.html
The new model extends the standard multitier architecture and allows operators to have
consolidated views of events from multiple Tivoli Netcool/OMNIbus instances ― either
collocated on the same site or geographically distributed ― even potentially on different
continents.
Since there is no programmatic limit to the number of datasources that WebGUI can connect
to, this new model provides a method of deploying Tivoli Netcool/OMNIbus in a manner
that is truly "ultra scalable".
ALREADY IN PRODUCTION
IBM can confirm that this new architecture model is currently in use in production by a large
North American Netcool customer in the financial sector. "Customer A" has 4 Network
Operation Centres (NOCs) geographically distributed equidistantly around the world. Each
NOC draws events from three different globally distributed Tivoli Netcool/OMNIbus
partitions.
Customer A enjoys seamless, race-condition-free performance from their globally distributed
Tivoli Netcool/OMNIbus deployment. All of the regional NOCs can continue to operate in
the event of a disconnect with one or more of the other regional partitions ― without manual
intervention. Similarly, recovery of the outage is automatic and just as seamless.
LAB TEST RESULTS
In addition to outlining how to deploy the new architecture, this document provides test
results from a test environment that was set up over the IBM WAN spanning 3 continents to
prove the concept.
The test system was loaded up with 11,300 events ― a typical number of events found in the
production system of Customer A. 30 unique users were logged in and an AEL opened for
each user with the specified filter applied. Even with relatively low specification hardware
and relatively slim network pipes, the tests returned favourable results.
The new architecture model described in this document is recommended for use by anyone
contemplating deploying Tivoli Netcool/OMNIbus on an ultra large scale ― or where the
requirements are such that a geographically distributed model is necessary.
6
Licensed Materials – Property of IBM
Chapter 2 Introduction
Tivoli Netcool/OMNIbus has always been the watchword for high availability and
scalability in the event management space. To provide high availability and scalability out of
the box, Tivoli Netcool/OMNIbus ships with a pre-canned configuration to support
multitiered environments ― called the standard multitier architecture configuration ― for the
purpose of supporting higher numbers of events, users or both. Using the standard multitier
configuration alone, however, is not always an ideal fit in an environment that is either
required to be globally distributed or is of such a large scale, that it exceeds the standard
multitier architecture model's capabilities.
ADDITIONAL REQUIREMENTS
First, although the traditional multitiered architecture model is highly scalable, there is of
course a limit as to the load a single instance can handle. There are occasions where the
amount of events to be handled exceeds even the capabilities of even a 3-tiered system ―
particularly one with high-load intensive custom functionality; a very high numbers of
events or custom table data; or a combination of these or other factors.
Second, as the geographical boundaries of businesses operating in a global marketplace
continue to expand, so too do the business requirements of the network management
monitoring systems that support them. The need to have 24 hour or "follow-the-sun"
operations within a globally shared system is becoming increasingly important and
commonplace to customers that support a globally distributed infrastructure. A globally
distributed architecture model is needed to support such an infrastructure. In such cases, an
architecture model based purely on the primary/backup concept is not a natural fit.
In both cases, an augmented solution is needed.
This document provides a best practice architecture model for the two scenarios described
above using both the Tivoli Netcool/OMNIbus standard multitier architecture configuration
in conjunction with new functionality available in the WebGUI.
COMBINED DATASET VIEWS
Within the context of Tivoli Netcool/Webtop or the WebGUI component of Tivoli
Netcool/OMNIbus, a datasource is defined as an ObjectServer or failover pair with zero or
more Display ObjectServers configured to provide user load sharing capability. New
functionality introduced in WebGUI provides the ability of the WebGUI server to pull event
data from multiple datasources and seamlessly combine the data into common views ― such as
Active Event Lists (AELs), Monitor Boxes and map page elements. This means there is no
longer any technical need to combine the events within the underlying Tivoli
Netcool/OMNIbus infrastructure in order to obtain such views ― as was the case with Tivoli
Netcool/Webtop.
Given this new freedom, the underlying events can now be partitioned off into multiple,
disparate Tivoli Netcool/OMNIbus systems and the events from any number of these
systems combined by WebGUI into common views, as required.
Note: Depending on the context, the term "partition" is used in this document to describe
either the grouping of events by functional need ― or to refer to an individual Tivoli
Netcool/OMNIbus system ― which could itself be 1, 2 or 3-tiered.
PARTITION SCENARIOS
The decision whether or not to split incoming events over multiple Tivoli Netcool/OMNIbus
partitions and also how best to divide the events up, depends on the scale of the overall
solution, the physical distribution of the overall solution or a combination of the two.
7
Licensed Materials – Property of IBM
8
Licensed Materials – Property of IBM
Also included are some performance metrics on WebGUI's ability to pull data across a WAN
from globally distributed datasources.
Note: Note that the performance metrics provided in this document were carried out on
IBM networks and do not constitute any sort of guarantee of performance ― they merely
provide an indication of what can be achieved given the stated parameters.
9
Licensed Materials – Property of IBM
http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/topic/com.ibm.netcool_OMNIbus
.doc_7.3.1/omnibus/wip/install/concept/omn_esf_overviewstandardarch.html
Whether to use 1, 2 or 3 tiers in each partition should be decided on a case by case basis ―
and will depend on the individual loading characteristics within each partition.
Each partition will exist and operate independently of the other partitions. The ability to
operate autonomously is an important feature ― particularly within a geographically
distributed environment ― because it ensures that individual partitions can continue to
operate even when isolated from the other partitions. The ability for each partition to operate
autonomously is also important in that it means that an outage in one partition will not
directly affect the performance or availability of the other partitions.
Note: The "collection" of ObjectServers depicted in each of the datasources in the following
diagrams are meant to generically represent a 1, 2 or 3-tiered standard multitier architecture
system and are not intended to literally represent any particular number of ObjectServers or
tiers in particular.
10
Licensed Materials – Property of IBM
WebGUI server
Combined event view layer
Read/Write
Connection
EXAMPLE:
Widgetcom have three data centres: one in London, one in Bangalore and one in Wellington;
each one with a 3-tier Tivoli Netcool/OMNIbus installation. Widgetcom wish to set up a
"follow the sun" support model ― where managed systems around the world are monitored
on a 24 hour basis by the three globally distributed NOCs during each respective data
centre's business hours. Each of the three data centres must have visibility of events from the
other two ― and each one must also be able to function in isolation if cut off from the other
two.
The Netcool Administrator elects to include a datasource definition for each region in the
WebGUI datasource definitions files on all servers. Active Event Lists and other dashboards
can then be constructed using the combined event sets from all three datasources. The
resulting architecture is shown below:
11
Licensed Materials – Property of IBM
With the new datasources and event views in place, the operators in each of the three data
centres are then able to see and even deal with events from any of the three regions. All of
the data centres can operate independently of each other ― including in isolation.
NOTES:
It is recommended to enable caching for remote datasources to minimise the amount of
data moving over the WAN.
When operators use common filters or views, this reduces the data transfer further when
caching is enabled. The decision whether or not to use data caching for the local
datasource would have to be decided based on the variance of filters and views in use by
operators.
EXAMPLE:
Within a year, Widgetcom's London data centre expands its operations to the point where the
business requirement is to hold more events at any one time than a single Tivoli
Netcool/OMNIbus Aggregation pair can accommodate. This is primarily due to the large
increase in the number of standing rows at the Aggregation layer combined with the complex
custom correlations and event processing operations being carried out on the Aggregation
layer ObjectServers on an on-going basis.
After analysis, the Netcool Administrator identifies that the events are made up of roughly
50% application X events and 50% application Y events. Further, the business requirements
have no custom correlation needs that involve events of both types.
The Netcool Administrator elects to install a second Tivoli Netcool/OMNIbus multitier
partition and relocates application Y events to the new system. This has the effect of halving
the event number from the incumbent system and evenly spreads the load across the two
Tivoli Netcool/OMNIbus partitions. The WebGUI datasource definitions file is updated to
include the new datasource that holds the application Y events ― and the WebGUI operator
views are updated to include events from the new datasource.
With the additional partition in-place, the London data centre is now able to handle a much
higher volume of events than before. Additionally, the maximum event handling headroom
is elevated significantly ― affording Widgetcom more leeway in an event storm scenario.
NOTES:
There is no programmatic limit as to the maximum number of datasources (ie. Tivoli
Netcool/OMNIbus multitier partitions) a WebGUI server can connect to ― hence this
partitioning technique allows for extensive lateral scalability.
Inter-partition event correlation and processing can be achieved by using Tivoli
Netcool/Impact, if required.
12
Licensed Materials – Property of IBM
The average size of an event including an allowance for the occasional journal and/or
detail per event should be calculated. A typical value of 3 KB can generally be used as a
multiplier if a more specific value can not be calculated at the time;
The average number of events the WebGUI server will retrieve per minute from the
target Display layer ObjectServer(s) should be calculated. If data caching is enabled for
the remote datasource, it will be the count of all matching events for all filters that are in
use. If data caching is not enabled for the datasource, this figure would then need to be
multiplied by the maximum number of logged in users at any one time;
The number of WebGUI servers to be deployed.
Multiplying these three values together will give you the number of kilobytes that will need
to be transferred per minute from the remote datasource to the local site ― which can then be
converted to bits per second ― and therefore used as a bandwidth provisioning requirement.
Note: Such calculations would typically give an indication only of the required bandwidth
required. Any bandwidth provisioning would have to include significant contingency for
peak event loads and would likely have to be revisited if any of the above parameters
significantly changed.
13
Licensed Materials – Property of IBM
Test architecture
Three servers were set up in three roughly equidistant parts of the world: one in London, UK
(rheles41); one in Austin, Texas, USA (emonster); one in Perth, Australia (snapper).
Tivoli Netcool/OMNIbus 7.3.1 was then installed on all three servers and the standard
multitier architecture configuration used to construct a simple 2-tier Aggregation/Display
system on each one. Each server installation was comprised of a single Aggregation layer
ObjectServer, a single Display layer ObjectServer and a unidirectional Display ObjectServer
Gateway connecting the two.
Note: Since failover scenarios were not going to be included in these tests, failover
components were not included in the environment. It is recommended however ― and,
indeed, best practice ― to include failover components in a real, production system.
Display layer
ObjectServer
Unidirectional Display
ObjectServer Gateway
Aggregation layer
ObjectServer
WebGUI was then installed on the London test server (rheles41) and was configured to
connect to the datasource in each of the three regions.
14
Licensed Materials – Property of IBM
The WebGUI server was configured to connect to each datasource in standard Dual Server
Desktop (DSD) mode ― that is, a read/write connection to the Display layer ObjectServer
and a write connection to the Aggregation layer ObjectServer.
The measured bandwidth of the London/Perth link was ~230 Kbps. The measured
bandwidth of the London/Austin link was ~400 Kbps.
A diagram of the test environment is shown below:
rheles41
Bandwidth: ~230 Kbps Bandwidth: ~400 Kbps
Test hardware
The hardware specifications of the machines used in the tests are as follows:
15
Licensed Materials – Property of IBM
Type F 4,000
The overall number of events across all systems in the test environment therefore was 11,300.
The synthetic events were created via an ObjectServer trigger located within the Aggregation
ObjectServer within each region. The trigger runs once every 60 seconds and carries out the
following tasks:
Inserts 10% of the total number of events for each event type;
Deletes any events older than 10 minutes.
This ensures that:
After an initial 10 minute period, the total number of events remains constant to the
numbers specified in the table above;
There is 10% event turnover per minute.
The purpose of the 10% event turnover is to simulate event churn ― which would be present
in a real environment. This is important because it simulates the need for the event views to
refresh with new data on a constant basis. It was deemed that 10% event churn per minute is
a conservatively high estimation of event churn and would likely be a lot less in reality.
Acknowledged = 0 and
Flash = 0 and
FirstOccurrence < (getdate() - 300) and
Node <> 'server01'
Since none of the synthetic events were either acknowledged or flashing nor did any have the
Node field set to server01, the filtered event set all users were viewing consisted of all
events whose first occurrence is more than 5 minutes ago. The user filter used in the tests
included all of these field comparisons so that it would be comparable in terms of complexity
and hence induced load on the ObjectServer during execution as a "real world" filter would.
Tivoli Netcool/OMNIbus 7.3.1
Large scale and geographically distributed architectures Best Practices About this publication
© Copyright IBM Corporation 2011, 2012.
16
Licensed Materials – Property of IBM
Since the events are replaced at a rate of 10% per minute, this meant that each user AEL was
displaying approximately 5,650 events. This can be calculated by:
Total events (11,300) ― number of new events (ie. less than 5 minutes old)
This number of AEL events was deemed as a typical number of events that an operator
would have in their event list.
The AEL timed refresh was set to the default of 60 seconds for all users. Since the event
churn rate was 10% per minute, approximately 565 new events were inserted and
approximately 565 events were deleted with each AEL timed refresh.
Metrics gathered
It was deemed that the success of the tests would ultimately be judged on how good the end-
user experience is for the operators. That being the case, the following items were identified
as key metrics to be collected for this exercise. 100 measurements were taken for each of the
following two metrics:
Metric 1: How long does the AEL initially take to load the whole data set?
This metric measures the amount of time it takes the AEL to do a full load of the events when
the filter is first selected.
In order to exclude the length of time it takes for the AEL applet to load (which is heavily
client dependent), this measurement was taken when the filter selection was changed from
one filter to the target filter. 100 measurements were taken for this exercise.
Test results
The following table shows a summary of the measurements taken for the two metrics. The
values have been averaged and the standard deviation calculated. All values are shown in
seconds:
Metric 1: filter select change (full data reload) Metric 2: AEL auto-refresh (partial data reload)
1 1 1 1 1 1
1 1 1 1 1 1
17
Licensed Materials – Property of IBM
4 1 1 1 1 1
1 5 1 1 1 1
1 1 6 1 1 1
1 1 1 1 1 1
1 4 1 1 2 1
7 1 3 1 1 1
1 1 1 3 3 1
1 1 1 1 1 1
1 1 1 1 7 1
1 1 1 2 1 1
2 1 1 1 1 1
1 1 2 2 1 1
3 1 1 1 1 1
5 4 1 1 1 1
1 3 1 1 1 1
1 1 1 4 1 1
2 3 1 1 1 1
1 1 3 1 1 1
1 1 1 1 1 1
1 1 2 1 1 1
1 1 1 1 1 1
3 4 1 1 1 1
5 1 1 1 1 4
1 6 1 1 9 1
1 1 1 1 1 1
1 1 3 1 1 1
1 1 2 1 1 1
1 1 1 1 2 1
3 7 1 1 2 1
1 1 2 1 1 1
5 1 1 1
1 3 1 1
18
Licensed Materials – Property of IBM
Results analysis
The average load time for an AEL was 1.74 (±1.45) seconds whereas the average load time of
an AEL auto-refresh was 1.29 (±1.11) seconds. The time taken for an AEL to do a full load of
the data set therefore was typically around 50% more than that of an auto-refresh. Both
average metric values returned relatively low standard deviations indicating the average
values were fairly typical.
There were a small number of high values within the test results. Generally speaking, the
lower values will likely occur when the AEL is accessing the result set from the WebGUI
server's cache and the higher values when the AEL invokes the WebGUI server to access the
event data from the datasources directly (ie. when cache results have expired in each case).
In a WAN scenario, it would be expected to see occasional high return values due to network
latency. The latency would factor into AEL responsiveness during WAN-based queries ― for
example: queries the WebGUI server makes to the remote datasources. The response time
would ultimately depend on the reliability of the WAN link.
Caching was enabled for the remote datasources during these tests and the WebGUI trace file
reported that the cache was being accessed by the AEL clients about 63% of the time. This
means that AEL refreshes were only creating WAN traffic around a third of the time. This
highlights the value of intelligent filter construction and cache use.
Each datasource only had only one Display layer ObjectServer serving up event data to
WebGUI. It is expected that results would be more favourable on a "real" system where more
powerful hardware and more Display ObjectServers supporting the user load were
provisioned.
19
Licensed Materials – Property of IBM
Notices
This information was developed for products and services offered in the U.S.A.
IBM® may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service may
be used. Any functionally equivalent product, program, or service that does not infringe any
IBM intellectual property right may be used instead. However, it is the user's responsibility
to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in
this document. The furnishing of this document does not grant you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106-0032, Japan
The following paragraph does not apply to the United Kingdom or any other country where
such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES
CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF
ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied
warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are
periodically made to the information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or changes in the product(s)
and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only
and do not in any manner serve as an endorsement of those Web sites. The materials at those
Web sites are not part of the materials for this IBM product and use of those Web sites is at
your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling:
(i) the exchange of information between independently created programs and other
programs (including this one) and (ii) the mutual use of the information which has been
exchanged, should contact:
20
Licensed Materials – Property of IBM
IBM Corporation
958/NH04
IBM Centre, St Leonards
601 Pacific Hwy
St Leonards, NSW, 2069
Australia
IBM Corporation
896471/H128B
76 Upper Ground
London
SE1 9PZ
United Kingdom
IBM Corporation
JBF1/SOM1 294
Route 100
Somers, NY, 10589-0100
United States of America
Such information may be available, subject to appropriate terms and conditions, including in
some cases, payment of a fee.
The licensed program described in this document and all licensed material available for it are
provided by IBM under terms of the IBM Customer Agreement, IBM International Program
License Agreement or any equivalent agreement between us.
Any performance data contained herein was determined in a controlled environment.
Therefore, the results obtained in other operating environments may vary significantly. Some
measurements may have been made on development-level systems and there is no guarantee
that these measurements will be the same on generally available systems. Furthermore, some
measurements may have been estimated through extrapolation. Actual results may vary.
Users of this document should verify the applicable data for their specific environment.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility or any
other claims related to non-IBM products. Questions on the capabilities of non-IBM products
should be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject to change
without notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to change
before the products described become available.
21
Licensed Materials – Property of IBM
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to IBM, for the purposes of
developing, using, marketing or distributing application programs conforming to the
application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions.
IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs.
If you are viewing this information softcopy, the photographs and color illustrations may not
appear.
Trademarks
These terms are trademarks of International Business Machines Corporation in the United
States, other countries, or both:
IBM
Tivoli
Netcool
SunOS, Sun, Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in
the United States, other countries, or both.
Red Hat, RHEL are trademarks or registered trademarks of Red Hat in the United States,
other countries, or both.
Adobe, Acrobat, Portable Document Format (PDF), PostScript, and all Adobe-based
trademarks are either registered trademarks or trademarks of Adobe Systems
Incorporated in the United States, other countries, or both.
Java and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle, Inc. in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft
Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
22
Licensed Materials – Property of IBM