Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Operational Excellence in IT Service

Management

Mehmet Özgür Depren


Technical Sales Manager - IBM Middleware
The Next IT Operations Focus: Big Data

“Focus on operational objectives


has seen significant uptick since
2013”
IBM Continues to Invest Heavily in Analytics
 More than $17B in Acquisitions
Since 2005; more than any other company
 Most comprehensive portfolio, from business
2015
to IT Analytics, while most other vendors
offer only point solutions
Social Analytics/Consumer Insight
 C&SI’s suite of analytics products leverage
best of breed capabilities from across all of Workload Optimized Systems

IBM’s portfolio Advanced Case Management

Content Analytics

Decision Management

Stream Computing

Pervasive Content

pureScale

pureXML

Deep Compression

Developer Productivity

Autonomic Operations

2005
IT Operations Analytics Solves New Challenges
Reducing & Preventing Outages and Slowdowns for the 24/7 Application World

The Network

End users

Devices Web Servers App Servers Databases

IT Operations Analytics can help

Never set performance Identify potential issues Isolate the problem


1 threshold manually again 2 before customers are
impacted
3 through analysis of all your
IT data
Understanding IBM Operations Analytics
Business Proactive Outage Avoidance Faster Problem Resolution Optimized Performance
Outcome

Predict Search Optimize


Capabilities Predict problems Search quickly across Optimize across your
before they occur massive amounts of data IT app infrastructure  

Operations Analytics
IBM
Big Data
Platform

Streams SPSS Cloud Insights InfoSphere BigInsights Rave Watson

IBM or Documentation
3rd Party Application
Performance System & Log Transactions Assets & Alerts, Alarms &
Solutions
Monitoring Workorders Events

Operational
Environment
Applications | Systems | Workloads | Wireless | Network | Voice | Security | Mainframe | Storage | Assets
IBM Solution for IT Operations Analytics
Our Capabilities
Why IBM?
Predict Search Optimize
60%
Faster creation of custom high impact mobile ready
Predict problems before they become Diagnose application & Ensure your IT infrastructure is operations dashboards
service impacting infrastructure issues using all operating as efficiently as
your operational data possible
environments 50%
Faster application diagnostics

Analytics
Avoid Outages While Reducing Threshold
30%
Resolve Problems Faster Improve Operational Efficiency Reduction in operator event load
Management Costs

Consolidated Communications detects 100 percent of


their major incidents, including silent failures, and
Barclay’s Bank was able to search and diagnose
problems 60% faster to quickly resolve
Advanced events analytics has allowed
Claranet to reduce the number of trouble
20%
Reduction in storage requirements over competitive offerings
eliminated the human intensive task of managing application and infrastructure issues. In tickets and focus more time and resources
manual thresholds, saving $300,000 annually addition, they identified customer patterns from on what truly matters to their customers.
log data and applied this to channel intelligence

#1
Leadership position in Operations Management solutions
IBM Operations Analytics – Predictive Insights Predict

Challenge: Reacting to performance thresholds is not enough. IT Staffs must become proactive to ensure mission
critical apps never go down.

Automated Threshold Maintenance


No complex manual intervention to setup & maintain with 5 times faster processing

Anomaly Detection
Alerting before potential issues become service impacting, enabling IT to shift from reactive to
proactive

On-Prem and SaaS


Predictive Insights now available as a Service, providing additional value to our Performance
Management solutions

Supports Heterogeneous Environments


Out-of-the-box integrations to IBM APM/ITM or 3 rd-party monitoring solutions
Why aren’t operations teams proactive today?
 Too much data to analyze manually
 Existing analytic techniques, such as standard thresholds, are not up to the task
 They cannot detect problems while they are emerging (before business impact)
 Set performance threshold too high, insufficient warning before total failure.
 Set performance threshold too low, too much noise, everything is ignored

If no there is no ‘early detection’ before the outage, operations teams can only react while
outage is already in effect and already losing money...
Learn relationships between metrics without static thresholds

• Predicative Insights learns the normal historical range

• It will alarm if it falls outside this range Watson DNA inside

9
European Telco – Flatline
Stopped (crashed) Application - Regular load absent.

Targeting Situation Detections

Customer Relationship Management System for large Telco. 100 applications


monitored by Compuware System. (40 million metrics) In this Example the regular
load on one of the servers has changed indicating application problem.
European Gambling Website – Adaptive Threshold
High disk latency

Automated Dynamic Thresholds and Early Detection

A gambling Website application monitored by HP . Coming up to busy sporting event traffic


increased causing stress on the system and negative customer experience. Using PI early
detection of latency issue could have been tackled to avoid this.
Large US Bank– Adaptive Threshold
Connection Leak

Automated Dynamic Thresholds and Early Detection Insight


These are Websphere metrics taken from CAWily performance management system.. The number Poolsize and Bytesinuse on the same node are also behaving anomalous
of actual connections to the WebSphere application server has increased dramatically. The at the same time and are related to each other.
poolsize and bytesInUse are also affected indicating either increased demand, or a problem with
connections not being freed up.
European Bank – Significant trend.
Disk Thrashing

Targeting Situation Detections

File server under stress as file control operations and bytes per second increase. This
sudden change can be tracked back to a patch applied.
A Sample of technologies Predictive Insights integrates with

IBM ITM/TDD & IBM APM

IBM OMEGAMON

HP BAC, IBM TNPM


Topaz

Aircom Optima
Predictive Insights as a Service
Performance Management + Predictive Insights

 Integrated threshold automation and


maintenance
 Anomaly detection
 Get ahead of potential application and resource
outages
 Learn, Explore, and Try
 Continuous Delivery
IBMPredict
Operations Analytics – Log Analysis Search

Challenge: To diagnose service problems in applications and the infrastructure supporting them involves
quickly analyzing incredible amounts of both structured and unstructured data

Breadth of Searchable Data


Search across all of your IT operational data to quickly resolve issues

Expert Advice
Any competitor can isolate problems. IBM helps clients quickly resolve them.

Mainframe Support
Search System z (zLinux & zOS) logs in addition to all your other data

Embedded Analytics
Out-of-the-box integrations to IBM APM/ITM or 3 rd-party monitoring solutions
Search IBM Operations Analytics – Log Analysis
Collects large volumes of structured and semi-structured data and transforms it
through analytics into actionable intelligence.

Search and Visualize


Insight
Packs IT
Operations
App Support

Service Desk

Normalize

Consolidate

Collect
Documentation Logs Metrics Events
Application owner : I got a trouble ticket on my application. I want to quickly find the root cause and fix it and restore
app/service ASAP

Current Challenge : large volume of data to collect and analyze , manual correlation taking days/hours to find the root
cause of the problem. Cannot find logs for problem window situations. Highly dependent on SME skills. Its an art

010001100011100001110
Core files 011000111110000110001
111111000110011100011

[10/9/12 5:51:38:295
Logs, GMT+05:30] 0000006a
servlet E
Traces,.. com.ibm.ws.webcontainer.serv
let.ServletWrapper service
SRVE0068E:

Events

Metrics

Transactions

Config
Application owner :
I got a trouble ticket on my app. I want to quickly find the root cause, fix it and restore service ASAP
Solution: IBM Operations Analytics – Log Analysis can provide insights from all data in clicks. App
owner can search through the data, leverage Dashboards to find the root cause in minutes

IBM
IBMOperations
Operations Analytics
AnalyticsLog
LogAnalysis
Analysis

metrics
metrics
Expert
Expert
knowledge
knowledge
Tx# date status

108978 23-Jul-2013 started

108978 23-Jul-2013 To IN

Events Transaction
Transactiondetails
details
Events from App DB
[10/9/12 5:51:38:295 GMT+05:30]
[10/9/12 5:51:38:295 GMT+05:30] from App DB
0000006a servlet E
0000006a servlet E
com.ibm.ws.webcontainer.servlet.ServletW
com.ibm.ws.webcontainer.servlet.ServletW
rapper
rapperservice
serviceSRVE0068E:
SRVE0068E:Uncaught
Uncaught
exception created in one of the service
exception created in one of the service
methods of the servlet TradeAppServlet in
methods of the servlet TradeAppServlet in
Tickets
Tickets
application
applicationDayTrader2-EE5.
DayTrader2-EE5.Exception
Exception
created :
logs
javax.servlet.ServletException:
logs
created : javax.servlet.ServletException:
TradeServletAction.doSell(...)
Out of the Box Insight Packs
• Out of the Box Insight Packs (IBM Provided)
• IBM Websphere Application Server
• IBM DB2
• Web Access Logs
• Windows Events
• SysLog
• Java Core
• IBM MQ Series
• IBM Integration Bus (Message Broker)
• Delimiter Separated Value (DSV) log files
• Partner Provided –
• Microsoft Sharepoint, Microsoft Exchange, Microsoft SQL Server,
Microsoft Active Directory
• Tivoli Storage Manager
• IBM Systems Disk Storage 8000
https://developer.ibm.com/itoa/
• IBM AIX Errpt
• IBM HTTP Server
• HP LiveSite , HP TeamSite
• Oracle Database
• VM Ware ESXi
• Oracle Siebel
IBM Netcool Operations Insight
Optimize
Modern Dashboards, Fully Mobile
Visualize the performance and health of your entire operations environment.
Out of the box Integration

• 98% Reduction in Critical events: ~22 critical & ~100 major events per week
• Improved focus and utilization of first- and second-line staff

Analytics to increase event value

v1.1 30% reduction in Events to


Operations

v1.2 Almost 50% reduction in


repeating events

v1.3
90% reduction for
known event classes
Event Analytics – Seasonal Event Identification
Improve efficiency by identifying and resolving recurring problems
Large Bank

7% of Priority 1 Tickets were raised


by events that were highly seasonal

30% of lower severity tickets

 Report on event history identifies seasonal events sorted


by confidence level and frequency
 Drill down shows time distributions of events …investigate
peaks.
 Can better align thresholds to seasonal peaks reducing
events
Seasonality Analysis of events
1
MS SCOM Health Service Heartbeat
failures happen often on Sunday
06.00am, probably due to regular
maintenance

2
A specific Oracle database is not accessible
every day at 21.00pm, probably due to a
daily restart or backup

A node is giving file system alerts every day


around 01.00am, probably due to a daily
batch job
Related Events Grouping
Relationships I know about
Known Event Analysis Grouping and Correlation providing powerful situation management of
active events
• Out of the box domain expertise for known event relationships
• Vendor and technology dependent
• Significant reduction of incidents presented to the operator
• Extendable by Business Partners and clients with no coding required
Event Analytics –Related Event Analytics
Relationships I don’t know about
Improve efficiency - Reduce actionable events by grouping events that always occur together
Automatic detection of event clusters

“It is very beneficial to have a tool that can


turn historical event data into an event group
Leverages machine learning to analyze historical event archive and identify with a single root event. It helps us turn the
groups of events that always occur together data into logic”
• Presents identified relationship to the Administrator
Increase operator efficiency by up to 90% with
• Presents proposed automated actions out-of-the-box alert reduction and advanced
• Watch, Deploy, Archive or Do nothing alert analytics
• Groups events in the Event Viewer
Future of Service Management
Visibility Control Automation

Real-time Analytics and Visualization

Problem Isolation Outage avoidance Optimization Insight & Care

Data Correlation Integration Predictive Analytics


Thank You

You might also like