Root Cause Analysis

Root Cause Analysis for Bottlenecks in Web Applications
www.symphonysv.com
By Sunil Gupta Symphony Services, Bangalore
Sunil Gupta is a Vice President at Symphony Services and leads the companys product verification and validation services, helping clients achieve global market leadership. Sunil has 23 years of product and system engineering experience.
WHITE PAPER
Abstract
As we enter the age of Cloud Computing, where more and more software is delivered as a Web-based service, user response time becomes the most common bellwether for software quality. If a users wait time stretches beyond a typical three-second comfort zone, we all know that somethings wrong.
There are two distinct approaches to this problemand software engineers should be adept at both. The rst is to develop the best and most ecient techniques for nding the bottleneck and xing both the bottleneck and its root cause. The second discipline is to use our skills in development and innovation to build prevention into the DNA of every product. Each skill, tip, technique and bit of knowledge we acquire in problem analysis reinforces our ability to design prevention into every stage of a life cycle.
In this paper, we will show how one specic type of testing adds to the basic skill set of detecting a problem, xing the problem, and then xing the root cause of the problem.
While this paper focuses on techniques and best practices for testing response time bottlenecks at the application level, it should be considered as one of many analysis tools that adds to the ultimate goal of design-level prevention.
1. How to approach application performance

While several things aect application performance, such as throughput, CPU utilization, user load, and memory consumed, Website Response Time is what the user actually sees and reacts to. The industry standard for viewing Web-site page data is three seconds, as users often break o their activity when the response time exceeds this limit. This holds true both for simple activities, such as logging on to an application, & for complex applications, such as creating an online report with multiple data dependencies.
Here we examine the probable causes of high response times, and demonstrate proven techniques for resolving them. The examples provided are from my rst-hand experience.
Application performance should be improved by looking at these ve performance criteria at the same time: . Response Time . Throughput . CPU utilization . Memory consumed . User load
Even though these criteria are interdependent, response time is what the user sees and reacts to, and as such it becomes the glaring issue.
Symphony Services White Paper: Root Cause Analysis for Bottlenecks in Web Applications
Now, while many applications have response time issues, some have not gone through performance testing cycles where the response time of the whole application shows up as bad for a single user. And, too, some applications only experience response time issues while under a load.
In the performance testing cycle for any application, the user scenarios need to be studied. The scenarios which are the most important or frequently used are usually tested.
The user load is studied. Load can be characterized as either peak or normal load. Peak load is when the maximum number of users access the application; normal load is when there is not as much stress on the system.
For this testing, the architecture as well as the network topology need to be understood. The network topology is needed so as to identify whether all the tiers are in the same machine or in dierent machines, whether the dierent machines are in dierent networks, and is there latency between the machines?
The hardware congurations of the machines are also studied. In the performance lab, engineers use this information to benchmark the application, simulate its functions, and then try to optimize the performance. Based on these results, engineers can determine an applications performance. This information is used for capacity planning for the organization.
If a Web application or a scenario, or an action within a scenario, is observed to have a high response time, how does the performance engineeror in some cases the end userdetermine the cause?
All Web applications are N-Tiered. N-Tiered applications have a logical split between the client, server, and database. All the business logic could be at the application server level. How then does the performance engineer or end user know in which tier lies the actual cause of the issue?
The challenge is to nd not only the cause but the location of the problem. The problem could be at the Web server, or at the application server, or at the database layer, or it could be the application itself, or the network. It could even be at the browser level. We also have to watch that the other performance objectives, such as CPU utilization and memory consumed, do not degrade while trying to solve the response time issue.
2. Solutions for analyzing response time problems

The following schematic abbreviates some of the techniques performance engineers can use to drill down into response time issues:
Start
Single/ Multiple User
Multiple?
Possible Solutions
Database Issue?
a. Tune Queries b. Tune Database parameters c. Others
Client Issue?
a. JavaScript or image les download size b. Object or Image caching
Application Server Issue?
a. Application consuming huge memory at startup b. Object pool settings
Network Issue?
a. All tiers on same subnet b. All tiers on same link speed (half duplex). Ideally all should be full
Others
Multiple?
Possible Solutions
Database Issue?
a. Tune Database parameters b. Statistics missing and need to be collected again
Application / Web Server Issue?
a.HTTP connections on Web server to be tuned as per number of users b.Database connection and object pool
Files handled in code?
a.Parsing logic needs to be looked at b. The limit of the number of le handlers
Others
Lets work through an example to see how this works.

Were going to demonstrate how our analysis helps nd the root cause of a performance bottleneck, quickly generating the best resolution of the performance issues.
In this real-life example, one of our clients had severe performance issues with one of their agship products. Specically, they were experiencing the following problems:
1. Customers were concerned that the application would not scale properly, and the client had no data. 2. Customers were experiencing performance degradation, and our client could not tell them what the cause was and whether or not it might happen again. 3. End users were frustrated that pages would not load. 4. And the client did not know the performance limits of their product.
The client asked Symphony to address these performance issues. Our Performance, Scalability, and Reliability team collected relevant data from previous performance runs, and ran additional tests to gather more data for the analysis. The following charts show the results of the performance runs.
User CPU System CPU Idle time Wait Time
From these charts we made the following observations: 1. High CPU usage throughout the run (>95%) a. High response times were due to high CPU usage b. Low transactions/sec c. Low hits per seconds and throughput d. Requests getting queued e. High page swaps f. A few operations were taking as long as 24 to 35 seconds g. Requests were getting queued 2. System CPU taking ~45% leaving the application with ~55% 3. Wait Time was ~15%, waiting for resources to process the application request
Based on the above observations, we made the following analysis: 1. Business Processes were highly CPU intensive a.Analysis i. We ran tests to prole the dierent business processes ii. We isolated two business processes from the group that were taking more CPU cycles for processing Purple and Green curves in Chart-I shows unusually high CPU time for these two business processes iii. And then we ran tests after isolating those two business processes, ensuring that the CPU utilization had come down to 40%, and process queue length had come down from 45 to 7 b. Recommendation i. Code optimize the two business processes c. Results i. After code optimizing the two business processes that had been causing the problems, we ran tests exercising all the business processes. This time overall CPU utilization remained less than 50%
2. Higher page swaps due to low memory a. Analysis i. High page swapping was due to high garbage collection leading to low memory condition ii. We monitored garbage collection during the test run, and noted it was very high b. Recommendation i. Increase RAM size by another 500MB and increase the JVM heap size c. Results i. This resulted in reduced garbage collection and a signicant reduction in page swaps. 3. Once the performance issues were resolved, Symphony ran further tests to baseline the application performance with dierent load types and mixes 4. We also ran capacity planning tests for the Application Server so we could recommend dierent requirements for the clients enterprise customers
4. Top Five Response Time Bottlenecks

Based on our years of performance testing more than 50 major product releases for major corporations, we have learned that the leading top ve causes of response time bottlenecks are: 1. Inecient session and state management 2. Thread contention 3. Poor logical database design, such as bad table design 4. Improper normalization and inecient indexing of tables 5. Inecient queries
And weve also developed a broad base of data & experience that tells us that the top ve indicators of response time bottlenecks are: 1. CPU time often exceeding 80% threshold, or CPU idle time falling below 20% threshold 2. Processor queues queue length of two or more processes showing sustained times 3. Unusually high Context switches per second 4. Low throughput or transactions per second, indicating that the CPU is unavailable to service a request 5. Recurring Timeout Error 404 - indicating high response times leading to the expiration of timers in applications and servers
And nally, we know very well that any one or more of these indicators can lead us to any one or more root causes. This is why developing repeatable, predictable methods for analyzing performance response time problems has become a key part of our engineering performance testing and design work.
5.Future Direction and Conclusion

All applications should go through rigorous performance testing right from the start of the development cycle. This would address the issue of having to solve architectural or design issues later on, where it takes more pain in terms of time and cost to make changes in the products. Also, the data collected during performance tests will become important and useful for later reference.
2475 Hanover St, Palo Alto, CA 94304 Phone: 650.935.9500, Fax: 650.935.9501, info@symphonysv.com, www.symphonysv.com Copyright 2009 All rights reserved. Any reproduction, modification, publication, transmission, transfer, sale, distribution, performance, display, dissemination or exploitation of this document or the information contained herein, directly or indirectly, and in any other medium, wholly or in part, without the express written permission of Symphony Services Corp. is strictly prohibited. Symphony Services is the owner of the intellectual property rights contained in the document.

Root Cause Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Root Cause Analysis

Uploaded by

Copyright:

Available Formats

Root Cause Analysis for Bottlenecks in Web Applications

1. How to approach application performance

2. Solutions for analyzing response time problems

Single/ Multiple User

a. Tune Queries b. Tune Database parameters c. Others

a. JavaScript or image les download size b. Object or Image caching

Application Server Issue?

a. Application consuming huge memory at startup b. Object pool settings

a. Tune Database parameters b. Statistics missing and need to be collected again

Application / Web Server Issue?

Files handled in code?

a.Parsing logic needs to be looked at b. The limit of the number of le handlers

Lets work through an example to see how this works.

User CPU System CPU Idle time Wait Time

4. Top Five Response Time Bottlenecks

5.Future Direction and Conclusion

You might also like