CSE510 Monitoring

Grid Monitoring
By
Zoran Obradovic
CSE-510
October 2007
Grid Monitoring
Reasons for monitoring

Authorization, scheduling, sense of control
Monitoring systems
Globus (Monitoring and Discovery System MDS), Ganglia, Nagios, Inca, MonaLisa
Standards
GIPS compliance verification
Reasons
Monitoring the state of grid resources, services and job activity

is an important part of managing a grid environment
Administrators need a sense of control over

The resources provided in such distributed computing.
It is important for grid administrators to know the current state

of the grid to provide operations and support
*It is also an important tool for grid users
The desire is to develop a system that will give administrators

The ability to look at the grid system, and be able to administer it
As if it were a single workstation.
Monitoring can provide grid administrators, as well as users, with
significant information about what resources are available in the
grid and what state they are in.
Job monitors gather vital information about job submissions

on specific resources by harvesting data from local cluster job
Managers.
Resource
allocation
Monitoring allows for various resources to be dynamically
instantiated and adjusted using constantly running background
Processes.
Security: Keeps track of who is using the grid, permissions,

Data integrity, minimizes possibility of malicious activity,
threats, and accidents,
Monitoring Systems
MonAlisa
Monitoring Agents using a Large Integrated Services Architecture
Built by Caltech and its partners with the support of the U.S. CMS
software and computing program.
The design is built on Dynamic Distributed Service Architecture

Able to provide complete monitoring, control and global
optimization services for complex systems.
It is an group of independent multi-threaded, self-describing
agent-based subsystems which are registered as dynamic
services, and are able to communicate and work together in
performing a range of information gathering and processing tasks

Pool of threads is created once, and the threads are then reused
when a task assigned to a thread is completed.
If a monitoring task fails or hangs due to I/O errors, the other

tasks are not delayed or disrupted, since they are executing in
other, independent threads
Each MonALISA service registers itself with a set of Lookup
Services (LUSs) as part of one or more groups and it publishes
some attributes that describe itself.
Lookup services have replicated information.
MonALISA LUSs restrict the services' registration based on an

authorized X.509 certificate.
The combination of the service architecture and code
mobility makes it possible to build an extensible
hierarchy of services that is capable of managing
very large systems.
Monitoring all aspects of complex systems :
System information for computer nodes and
clusters.
Network information (traffic, flows,
connectivity, topology) for WAN and LAN.
Monitoring the performance of Applications,
Jobs or services.
End User Systems, and End To End performance
measurements.
Globus
The Monitoring and Discovery System (MDS) is a suite

of web services to monitor and determine resources and services
on Grids
Allows users to discover what resources are considered

part of a Virtual Organization
It offers trigger and indexing services

Trigger Service: gathers information and evaluates that data against a
set of conditions defined in a configuration file. When a condition
is met, an action takes place, such as emailing a system administrator
when the disk space on a server reaches a threshold.
Indexing Service: Gathers information and publish that information

as resource properties. Clients use the resource property query and
subscription/notification interfaces to retrieve information from an Index.
Information Providers For Globus Monitoring Toolkit
Hawkeye Information Provider
Ganglia Information Provider
WS GRAM
Reliable File Transfer Service (RFT)

What do they provide?
-basic host data (name, ID)

-processor information
-memory size
-OS name and version
-file system data
-processor load data
-queue information
-number of CPUs available and free
-job count information
-some memory statistics
-status data of the server
-transfer status for a file or set of files
-number of active transfers
Ganglia
Scalable distributed monitoring system for high-performance

computing systems
It uses XML for data representation,

XDR (external data representation)
portable data transport
and RRDtool for data storage and visualization
Uses data structures and algorithms to achieve very low per-node

overheads and high concurrency
It has been used to link clusters across university campuses and
around the world and can scale to handle clusters with 2000 nodes.
Current support comes from Planet Lab, an open platform for developing,
deploying, and accessing planetary-scale services.
Nagios
“Nagios is a host and service monitor designed to inform you of

network problems before your clients, end-users or managers do.”
It is designed to run in Linux operating systems

-works fine under most *nix variants
The monitoring daemon runs intermittent checks on hosts and

services an administrator can specify using external "plugins" which
return status information to Nagios
If a problem arises in a cluster or a grid , the daemon can send notifications out
to administrative contacts in a variety of different ways (email, instant message).
Global Investment Performance Standards
“The principal goal of the Investment Performance Council is to have all countries
adopt the GIPS standards as the standard for investment firms seeking to present
historical investment performance”
GIPS compliance acting as a “passport” allows firms to enter the arena of

investment management competition on a global basis and to compete on an equal
footing.
Today, 25 countries throughout North America, Europe, Africa, and the Asia
Pacific Region have adopted the GIPS standards
-Standard interface for presenting monitoring information about a resource
-GIP sensor suite used as reference implementation
-Information about grids to be returned in LDIF format

standard data interchange format for representing LDAP directory content as
well as directory update
-GLUE Schema: abstract modeling for Grid resources and mapping to concrete
schemas that can be used in Grid Information Services
-Monitoring and Discovery System (MDS) 2.4 Gris

Sources:
http://www.sura.org/cookbook
http://monalisa.cacr.caltech.edu/monalisa.htm
http://www.globus.org/toolkit/docs/4.0/info/key-index.html
http://ganglia.sourceforge.net/
http://www.nagios.org/about/
osg-docdb.opensciencegrid.org/0004/000499/001/OSGMiddleware.pp

CSE510 Monitoring

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE510 Monitoring

Uploaded by

Copyright:

Available Formats

Grid Monitoring

Reasons for monitoring

Monitoring the state of grid resources, services and job activity

Administrators need a sense of control over

It is important for grid administrators to know the current state

The desire is to develop a system that will give administrators

Job monitors gather vital information about job submissions

Security: Keeps track of who is using the grid, permissions,

Monitoring Agents using a Large Integrated Services Architecture

The design is built on Dynamic Distributed Service Architecture

If a monitoring task fails or hangs due to I/O errors, the other

Lookup services have replicated information.

MonALISA LUSs restrict the services' registration based on an

The Monitoring and Discovery System (MDS) is a suite

Allows users to discover what resources are considered

It offers trigger and indexing services

Indexing Service: Gathers information and publish that information

Hawkeye Information Provider

Ganglia Information Provider

Reliable File Transfer Service (RFT)

-basic host data (name, ID)

Scalable distributed monitoring system for high-performance

It uses XML for data representation,

Uses data structures and algorithms to achieve very low per-node

“Nagios is a host and service monitor designed to inform you of

It is designed to run in Linux operating systems

The monitoring daemon runs intermittent checks on hosts and

GIPS compliance acting as a “passport” allows firms to enter the arena of

-GIP sensor suite used as reference implementation

-Information about grids to be returned in LDIF format

-Monitoring and Discovery System (MDS) 2.4 Gris

You might also like