Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2008 IEEE International Conference on Web Services

Dynamic Thread Count Adaptation for Multiple Services in SMP Environments

Takeshi Ogasawara
IBM Tokyo Research Lab
Shimo-tsuruma 1623-14, Kanagawa, Japan
E-mail: takeshi@jp.ibm.com

Abstract Clients

Network
We propose a dynamic mechanism, thread count adap-
tation, that adjusts the thread counts that are allocated
to services for adapting to CPU requirement variations in Service
SMP environments. Our goal is to increase the maximum Queues

throughput available on a system that has multiple dynamic Server


Thread
content services while meeting different service time crite- Pools

ria for these services in dynamic workloads. Our challenge service requests
is to significantly improve response times for dynamic con- threads
tent on a busy well-tuned thread-pool-based system without
prioritizing any specific services. Our experiments demon- Figure 1. Multiple services and thread pools
strate that a prototype using our approach on J2EE mid- of multiple dynamic content services of a dynamic work-
dleware quickly (around every 20 ms) adjusted the thread load that can be handled by a given multiprocessor server,
counts for the services and that it improved the average where thread pool sizes are optimal on average while still
90th -percentile response times by up to 27% (and 22% on meeting the time service factor (TSF) targets1 (or QoS cri-
average) for the SPECjAppServer2004 benchmark. teria) of these services. In other words, for a busy server
that currently can handle M clients at maximum, we want
1. Introduction to support additional N clients while still meeting the TSF
targets by improving the QoS. We seek to achieve this with-
Multiple application services including Web services can
out prioritizing any requests of specific types (or changing
be supported on a single shared-memory multiprocessor
the transaction mix).
(SMP) server. The capacity of such a server is increased
by recent processors that run many hardware threads on a Our challenge is to significantly improve response times
single chip. Recently, even an entry-level server with tens for dynamic content on a busy well-tuned thread-pool-based
of hardware threads (for example 64 threads on a Sun Nia- system without prioritizing any specific services. Accord-
gara2) can handle tens of thousands of clients. Many mod- ing to queuing theory, additional clients on a busy system
ern application services provide dynamic content, which is will exponentially degrade the response times. Therefore,
created at runtime and which consumes variable amounts of to continue meeting the TSF targets with additional clients,
the system’s resources. A typical example of dynamic con- the response times should be significantly improved. For
tent is webpages that are generated by application code that application code that creates dynamic content, threads can
follows the Java Enterprise Edition standard. The code is be blocked when they access the remote database, and such
executed by threads on application servers [41, 42]. Appli- blocked threads affect the CPU resources actually allocated
cation servers receive requests from clients, dispatch them to services. The response times of the database accesses are
to threads to run the code that corresponds to the request unpredictable. Also, it is difficult to predict the execution
types, and send the calculated data back to the clients. times of the application code at runtime. Therefore, along
To optimize the performance, many application servers with dynamic changes in the request arrival rate and trans-
[4, 28, 20, 35, 36] use thread pools for services [25]. A action mix, it is a challenge to adapt CPU shares for dy-
set of threads are created and reused for each service. With namic content compared to adapting CPU shares for static
help from asynchronous I/O [39, 9], the arriving requests content [2]. Also, we need to frequently adapt the distri-
are delivered to a limited number of threads in the thread bution of the shared CPU resource to the changes of CPU
pools. Requests that are dispatched to a service but not im- demands by controlling the threads, since changes in CPU
mediately handled by a thread wait for a thread in a service demand can be observed frequently, such as every 20 ms.
queue. An example application server is shown in Figure 1. 1 A typical example of TSF targets is the 90th -percentile response time,

Our goal is to improve the maximum total throughput which specifies a time limit for 90% of the total responses.

978-0-7695-3310-0/08 $25.00 © 2008 IEEE 585


DOI 10.1109/ICWS.2008.47
However, we should not increase the total number of threads 2. Related Work
to counteract decreases in the number of threads, since the
In this section, we first discuss the prior research of QoS
total number of threads should be minimized for optimal
management. Next we focus on discussing the prior work
performance [39, 40]. Furthermore, improvement by using
that addressed thread pool control, which is the topic most
only middleware is useful, since middleware, in particular
closely related to our approach.
when running on a virtual machine such as a Java VM, is
Admission control limits the number or rate of requests
portable to multiple platforms. Prior research of QoS man-
entering the server [39, 33, 2, 38, 8, 12, 5, 32, 10]. This
agement controlled the rate of requests that can enter the
approach will reject additional requests as we increase the
service queues [39], terminated heavy requests early [41],
workload. Service differentiation prioritizes specific ser-
degraded the service level [1], prioritized specific services
vices over the other services [43, 33, 27, 2, 38, 37, 34, 3].
[2, 38], allocating threads to bottleneck stages [39], ensured
This approach is not applicable in our scenario, in which
process bandwidth by the CPU scheduler [22], or isolated
the services have different TSF targets but have equal pri-
the performance of services when using the excess capacity
ority. Selective early request termination [41] aborts heavy
[2]. These approaches alone are not sufficient to resolve our
requests. Since it can change the transaction mix, we cannot
problem.
take this approach. Service degradation [1, 2, 38, 19] avoids
We propose a dynamic mechanism, thread count adap- server overload by reducing the numbers of transferred data
tation, that detects overloaded services and quickly re- objects. It is not applicable to our services whose service
solves overloads by moving threads from the pools for non- levels cannot be degraded. Dynamic resource share adjust-
overloaded services to the pools for overloaded services. ment [22, 6] periodically adjusts the CPU shares for ser-
This approach is based on our observation that, on a busy vices in the CPU scheduler so that the TSF targets can be
application server, overload constantly occurs and the over- met. This cannot appropriately change CPU shares in many
loaded services change quickly over time. Such overloads scenarios covered by our problem description. This is be-
are due to blocked threads as well as the dynamic nature cause threads that create dynamic content can be blocked
of the workload. Addressing such overloads will be more for many reasons (such as remote database operations, lock
important as applications and middleware become more dy- contentions, or garbage collection [15]). Excess-capacity
namic and complex. sharing [2] ensures performance isolation among services
In summary, the contributions of our paper are as fol- when it allocates unused system capacity to the services that
lows: use the system over their pre-allocated shares. It is not ap-
plicable to our scenario where there is no unused capacity
• We designed a dynamic mechanism that can quickly and the required CPU shares are changing. We accept that
address overloads that constantly occur in multiple ser- request scheduling [7, 11, 8, 26, 42, 21, 34], which reorders
vices on a busy multithreaded server in an SMP envi- requests in the queues to improve the response time of short
ronment. services, is already being done. Though this approach has
• We demonstrated that a prototype of our approach ac- been studied mainly for static data, it could complement our
tually responds to such overloads by controlling the approach. We also accept that approaches that reduce the
threads. execution time by using various techniques such as content
caching are already being used.
• We demonstrated that addressing such overloads can Schmidt and Vinoski presented a design for CORBA
significantly improve the average 90th -percentile re- based on thread pools to improve server run-time perfor-
sponse time by up to 27% (22% on average) for an mance by removing the overhead of thread creation [25].
industry-standard benchmark, SPECjAppServer2004, Thread pools are widely used in many servers [4, 24, 29,
on a practical 3-tier system. With this extra room un- 18, 28, 20, 35, 36].
der the TSF targets, we successfully supported addi- Ling et al. analyzed the optimal fixed size of a single
tional clients on a busy server and sustained 3% more thread pool for a given probability distribution of requests
throughput. [16]. Their system suffers from the overhead of creating ad-
ditional threads whenever the number of arriving requests
The rest of this paper is organized as follows. We first exceeds the thread pool size. Our target systems do not in-
discuss related work. We next explain the model of our tar- cur this overhead since requests will wait for pool threads
get thread-pool-based system. Then we describe the main in service queues if the CPU is busy.
transient-overload problem occuring in the system, and then Xu and Bode showed an approach for dynamic adjust-
our approach to solving the problem. Finally, we evaluate a ment of a thread pool size from a given initial size [40].
prototype of our approach by using a benchmark and con- They changed the thread pool size so that the total time in
clude the paper. which threads are active but not dispatched to processors

586
was reduced. Their approach can find an optimal pool size, target applications, the workload is dynamic and the CPU
starting from any initial pool size. However, the benefit is resource allocations among the services can change. We
small if the initial pool size is already near optimal. can achieve a better performance than these approaches by
Welsh et al. proposed thread pool controllers [39] on a dynamically adjusting the CPU resource allocations among
staged event-driven architecture (SEDA). They increase the the services.
thread pool size for each stage up to the maximum number 3. Overview of Thread-Pool Based Middleware
in order to allocate more CPU time whenever the TSF target
of the stage is not met. Threads are removed if they become Before discussing our problem in CPU sharing on thread
idle. Li et al. adjusted the thread pool sizes on the SEDA so pools, we first explain how threads handle requests, how the
that the average response time met the TSF targets by sup- thread pool size is set, and how the actual CPU shares for
posing that the throughput was proportional to the thread services constantly change.
pool size and that the average response time could be con- 3.1. Thread Pool Bound to Service
trolled by the throughput [15]. These approaches control
the thread pools for stages that depend on each other. Since Services have service queues and their own thread pools.
a single task is split into stages on the SEDA, the output A thread pool is a set of threads that are created in advance
rate of each stage is matched with the input rate for the next and reused by the middleware to reduce the overhead of
stage [14]. In contrast, we are controling thread pools for thread creation, whose overhead is generally large for an
independent services. Also, the throughput is saturated on operating system.
busy servers. Threads in each thread pool obtain requests from the ser-
vice queue, execute the application code that provides the
Lu et al. proposed a control-theoretic approach based on
service, and send the results back to the clients. Threads
linear system models for controlling a pool of Web server
are not bound to connections in contrast to the process-per-
processes [17]. They reported that they could not control
connection mode of widely used Apache. After threads
fluctuations in the waiting times of requests because of non-
complete the requests, they obtain the next requests and
linear properties. Also, they showed that a system model
continue processing. On a busy server, where the CPUs are
based on low-load conditions will break down [38]. Our
fully used, since the service queues tend to have requests
approach can respond to nonlinear properties such as thread
that are waiting for threads (or CPUs), threads can usually
blocking on a busy server.
obtain requests immediately.
Pyarali et al. proposed thread borrowing, which moves A server can have multiple thread pools. In practice,
threads from pools with low priorities to pools with high many vendors [20, 35, 28, 36] use multiple thread pools for
priorities in a real-time system, RT-CORBA, on a single multiple services on a single server when submitting their
server [23]. This solves a priority inversion problem when SPECjAppServer2004 results [30].
services with high priorities have no pool threads causing
them to stop functioning even though services with low pri- 3.2. Thread Pool Size
orities are still running. Since we dynamically determine Thread pool size can be larger than the number of pro-
which service should be prioritized based on the TSF, the cessors if threads can be blocked waiting for the remote data
prioritized services can be changing in our approach, while and other threads. For example, the application code can
the prioritized services are fixed in RT-CORBA. communicate with a database to create dynamic content and
Kourai et al. proposed a framework for application-level acquire the lock of a hash table for mutual exclusion. If the
thread scheduling for J2EE applications [13]. They can in- thread pool size is the same as the number of CPUs, then
sert scheduler code into the existing code without rewriting the CPUs with blocked threads become idle. Therefore, we
programs. They demonstrated that a scheduler can prior- need additional threads to fully utilize the CPUs.
itize threads in a high-priority group over those in a low- The thread pool size tends to reach the maximum value
priority group. In our prototype, we modified the middle- and stay there on a busy server even if the thread pool size is
ware to obtain the middleware-specific data such as the av- dynamic. The thread pool size can start at a minimum value
erage wait times for the service queues. Our mechanism that and increase to a maximum value. Threads are removed
decreases the number of active threads is similar to theirs. if they become idle. On a busy server, however, threads
Much work has been done in the area of proportional- rarely become idle. Therefore, application servers may fix
share scheduling to assign the CPU resources to services the thread pool size for a busy server to reduce the overhead
in the proportions specified by applications at the kernel of managing the number of threads.
level and at the application level. The fixed proportions of The thread pool size should be minimized as long as all
the CPU resource allocations among the services work effi- of the CPUs are fully utilized. Excess threads can degrade
ciently if the workload of each service is static and the CPU the throughput and response time [39, 40]. The thread pool
resources assigned to the services do not change. For our size is initially estimated based on experience (e.g., two

587
times the number of processors [16]) and tuned via experi- Enq/DeqA Enq/DeqB
Avg Wait
Service Queue A Service Queue B
ments. An optimal thread pool size depends on many fac- Time Profiler

tors in the system environment: the characteristics of the WA WB

application code, the implementation of the runtime sys- Overload Monitor


∆WA ∆WB
tem (middleware, Java VM, and operating system), and the
bA Thread bB
transaction mix, as well as the number of processors. Thread Pool A Count Thread Pool B
Adaptor
3.3. CPU Time Sharing among Thread Pools
The CPU shares that are allocated to services constantly Figure 2. Thread count adaptation structure
change, although the thread pool sizes tend to be fixed as 5. Dynamic Thread Count Adaptation
explained in Section 3.2. On a server with multiple thread
pools, the thread pool sizes greatly affect how much CPU We adjust the number of pool threads based on the CPU
time is allocated to services. The total CPU time is shared requirements for services. Figure 2 shows the structure of
among multiple thread pools (or services). When a thread in our approach. There are three main components: the aver-
the thread pool for service A is blocked and a thread in the age wait time profiler (AWTP), the overload monitor (OM),
thread pool for service B is in turn dispatched to a CPU, a and the thread count adaptor (TCA). We explain how these
slot of CPU time moves from service A to B. Suppose that components work to change thread pool sizes depending on
many threads for service A are blocked because of wait- the demands for CPU resources.
ing for the remote database during some period. Service A 5.1. Average Wait Time Profiler
cannot use an average fraction of the total CPU time for A.
When the blocked threads receive the responses from the The average wait time profiler (TCA) tracks the average
database, service A can again use its average fraction of the wait time of the requests waiting in each service queue.
total CPU time. It needs to maintain only the total number of requests,
N , and the total of arrival times of the requests in the
4. QoS Problem for Multiple Thread Pools queue, Tarrival , for
PNthis purpose. The average wait time W
To achieve the optimal maximum performance on a is calculated as ( k=1 (Tcurrent − tarrival (k)))/N , where
server, one problem in the current CPU allocation methods Tcurrent is the current time and tarrival (k) is the time when
is that CPU shares for services are not proportional to the k-th request in the queue arrived at the server. This formula
actual computational power required by those services. For can be simplified as Tcurrent − Tarrival /N , where Tarrival
PN
each service that cannot obtain a sufficient CPU share, the is k=1 tarrival (k). We add the current time to Tarrival
QoS (the TSF targets) is degraded, since the waiting times and associate it with the request as tarrival every time a re-
in the service queues become long. quest is enqueued. Also, we subtract the associated tarrival
There are two major sources of this problem. First, from Tarrival every time a request is dequeued.
the computational power required by multiple services con- 5.2. Overload Monitor
stantly changes. The required computational power de-
pends on the workload characteristics such as the mixture The overload monitor (OM) looks for additional CPU re-
ratios of request types and the arrival rates of requests. sources for each service by detecting increases in the wait-
For each service, the workload characteristics constantly ing times. It is invoked at uniform time intervals (around 20
change in the real world. With multiple independent ser- milliseconds on our prototype system) and obtains the av-
vices, such dynamic workloads are combined together and erage waiting times of the waiting requests for each queue
the characteristics become more complex. Second, the from the average wait time profiler. The OM calculates the
number of threads that can be dispatched to each CPU con- change from the previous value of the average wait time to
stantly changes, as explained in Section 3.3. The number of the current value for each service. After the OM calculates
ready threads determines the actual CPU shares. However, these changes for all of the queues, if the change for some
if we simply add threads to compensate for each blocked queue exceeds the sensitivity thresholds (the err parameters
thread, the exessive number of running threads will cause in [38]), it sends these changes to the controller, the thread
performance degradation since the threads that calculate dy- count adaptor, to make adjustments.
namic contents can be blocked many times during the pro-
5.3. Thread Count Adaptor
cessing of each request.
By addressing this problem by using our approach, The thread count adaptor (TCA) adapts the thread counts
thread count adaptation, we can optimize the maximum to- to the dynamic changes in the CPU resource demands. Ac-
tal throughput of multiple services, which is currently done tually, it redistributes some of the threads among the thread
by tuning the thread pool sizes based on the averaged be- pools depending on the wait time changes detected by the
havior of threads and workloads. OM. Table 1 summarizes the parameters used in the follow-

588
Nsvc the number of service queues
Nthd the total number of threads switch overhead between processes (less than 1 µs [31]), C3
B the number of threads that are used for thread redistribution is smaller. Supposing that C3 is at most 1 µs in our exper-
out of the total Nthd threads
∆Wi change in the average wait time for service i imental environment (B = 1, I = 0.02, and P = 4), the
wi weight value for normalizing the absolute values of ∆Wi
Ẇi weighted wait time change, wi ∗ ∆Wi third cost also consumes at most only 0.00125% of the total
M the median point of Ẇi execution time and is negligible2 .
bi the number of threads to be added or removed
P+ /− a set of services that can obtain or lose threads 6. Evaluation
N+ /− the number of services in P+ and P− , respectively
Table 1. Definitions of parameters We prototyped our approach on our J2EE middleware,
the IBM WebSphere Application Server (WAS), version
ing calculations for the redistributions. First, TCA calcu- 6.0 [36]. We used the workload of an industry-standard
lates thePmedian point of the weighted wait time changes, J2EE benchmark, SPECjAppServer2004 (SjAS) [30], for
Nsvc
M , as i=1 Ẇi /Nsvc . The B redistributable threads can our tests. We ran the tests on the original and modified
be moved from services whose Ẇi values are less than M versions of the J2EE middleware. We first evaluated im-
to services whose Ẇi values are larger than M . For thread provements in the response time by comparing the average
pools that can obtain threads (P+ ), the number of additional response times between the original and modified versions.
threads is proportional to the distance of Ẇi from M and is For the tests, we used the injection rate3 that yielded the
PN+
calculated as bi = B ∗ |Ẇi − M |/ j=1 |Ẇi − M |. Sim- maximum throughput on the original version. Then we
ilarly, the number of removed threads for P− can be calcu- evaluated how the improved response time contributed to
lated using N− . The calculated value of bi is a real number, improvement of the overall throughput by increasing the in-
and therefore we round it to the nearest integer. Next, us- jection rate. The following sections explain the experimen-
ing bi , TCA adjusts the thread pool sizes. To add threads to tal environment and discuss the results in detail.
thread pools, TCA activates idle threads by sending instruc-
6.1. Experimental Environment
tions to them. To remove threads from thread pools, TCA
sends requests to stop threads in those thread pools. Each In this section, we describe the SjAS benchmark, the ma-
thread checks for such a request after processing its task chine configuration, the J2EE middleware, and the proto-
and, if found, that thread accepts the request and becomes type code used for our tests.
idle. 6.1.1. SPECjAppServer2004. The SjAS benchmark is a
J2EE application that emulates an automobile manufactur-
5.4. Discussion of Overhead ing company and its associated dealerships. SjAS calls
The overhead of our approach is very small. Our ap- for a 3-tier machine configuration: driver machines, J2EE
proach has overhead for (1) calculating the average wait server machines, and database machines. The application
time, (2) calculating the wait time changes and the require- has multiple services that manage information about cus-
ments for thread count adaptation, and (3) the actual con- tomers, manufacturers, and suppliers and that display the re-
trol of thread pool size. The first cost is the cost of main- sults via a Web interface. The workload is dynamic. The re-
taining Tarrival and N for each queue at every queue op- quest arrival rates are dynamic because of the varying think
eration. Though this is the most frequently incurred of the times. The various request types also affect the execution
three costs, it is negligible, since only a few machine in- times for request processing. The numbers of database ac-
structions are performed and they are trivial compared to cesses also vary randomly. The J2EE code that calculates
the number of instructions for calculating the dynamic con- the dynamic content consumes most of the computational
tent. The second cost is proportional to the number of thread power of the J2EE server machines. The driver requests
pools. When we measured the cost on our machine (with two types of transactions: one via ORB and the other via
POWER5 1.65 GHz CPUs), the cost was under 1 µs for up the Web. The 90th -percentile response time (RT) is speci-
to 128 thread pools, which are sufficient for the SMP envi- fied for each type. The response time for each transaction
ronments that are currently available. The second cost that is calculated based on the total of the response times for
is incurred every second is calculated as C2 /I/P where C2 multiple requests. We ran the unmodified code of SjAS five
is the second cost, I is the time interval between invocations times and calculated the average 90th -percentile RT. When
of OM, and P is the number of hardware threads. Suppos- we evaluated how the improved response times contributed
ing that C2 is at most 1 µs for our experimental environment to improvements of the overall throughput, we compared
(I = 0.02 and P = 4), and the second cost consumes at the maximum JOPS (jAppServer Operations Per Second)
most only 0.00125% of the total execution time and is neg- whose average 90th -percentile RT was satisfied between the
ligible. The majority of this third cost is the context-switch original and the modified versions of WAS.
overhead. The third cost that is incurred every second is 2 The third cost will still be low for large B since P is also large for
calculated as C3 ∗ B/I/P where C3 is the context-switch such B.
overhead between threads. Compared with the context- 3 The number of clients is proportional to the injection rate.

589
6.1.2. Three-Tier Machine Configuration. We set up a 6

number of non-blocked threads


three-tier environment, including a driver machine (3 CPUs 5

with 1,920 MB physical memory), a J2EE server machine 4

(2 CPUs with 2,368 MB), and a database machine (2 CPUs 3

with 6,400 MB), using three partitions created on a pSeries 2

570 (1.65 GHz POWER5 with 2-way SMT). The AIX 5.3 1

operating system was installed on each machine. 0


500000 550000 600000 650000 700000 750000 800000

execution time (ms)

6.1.3. J2EE Middleware. WAS is a J2EE application 18


(a) ORB service
server based on a thread-pool-based architecture, as de- 16

number of non-blocked threads


scribed in Section 3. Services have their own thread pools. 14

12
For SjAS, the two services, ORB and Web, are the major 10

services that consume most of the CPU time. By experi- 8

6
menting with many combinations of thread pool sizes, we 4

determined initial sizes for these thread pools (5 for ORB 2

0
and 16 for Web) sufficiently large to fully utilize the CPUs 500000 550000 600000 650000 700000

execution time (ms)


750000 800000

and to show the maximum JOPS satisfying the TSF tar- (b) Web container service
gets. The J2EE applications access databases as they pro- Figure 3. Numbers of non-blocked threads
cess each request. The J2EE applications running on the with such additional threads, then we remove those extra
worker threads send requests to the database servers and threads but do not add them to any other service, since they
then wait for responses from the database servers. To fully were only created to compensate for the lower limits on the
utilize the CPUs of the J2EE servers, more threads than the number of threads that could be moved. Once all of these
number of CPUs are used, since threads can be blocked extra threads have been removed, then the total number of
while accessing the database. pool threads will return to the initial value.
6.1.4. Prototype of Our Approach. We implemented
6.2. Experimental Results
thread count adaptation by modifying the code of WAS.
The OM is invoked at approximately 20-ms intervals. Our We first show how the numbers of threads that can pro-
approach monitors the average wait time of requests even cess requests constantly change. As explained in Section 4,
though the TSF targets on SjAS are not specified for re- one reason that services cannot obtain CPU shares in pro-
qeusts, but for the transactions, which consist of multiple re- portion to their workloads is the blocking of threads. To
quests. We cannot monitor the average wait times of trans- investigate how threads are blocked during the tests, we
actions by correlating the requests to transactions since the tracked the number of threads that are not blocked due to
correlation is only known by the clients. However, address- I/O or resource contention (the runnable threads). Figures
ing the overloads that are monitored at the request level 3(a) and 3(b) show how the numbers of threads that are
reduced the wait times at the transaction level, as demon- not blocked changes for the ORB service and the Web con-
strated in the next section. We controlled the thread pools tainer service, respectively. The x-axis shows the execution
for the ORB and Web container services, because these ser- time in milliseconds and the y-axis shows the number of
vices consume most of the CPU resources. Based on ex- runnable threads. Though the pool sizes are a constant 5
periments, we determined that the sensitivity thresholds are for ORB in Figure 3(a) and 16 for Web container in Figure
100 ms for ORB and 50 ms for a Web container and that 3(b), the numbers of runnable threads are constantly chang-
one thread is sufficient for thread redistribution. The thread ing. As a result, the ratio of runnable threads between ORB
pool size has a minimum value of two to avoid starving re- and the Web container fluctuates a great deal, though the ini-
quests for the service. Since our server with 4 hardware tial ratio is 5/16 = 0.3 as shown in Figure 4. The runnable
threads is not a large system, the initial thread pool size threads are scheduled to hardware threads, thus determining
for ORB of five threads is rather small. Therefore, only up the CPU resources allocated for services.
to three threads can be removed from ORB. However, we Next we show that our approach improved the average
sometimes observed situations calling for larger changes in wait times of the queued requests. We first show snapshots
the thread pool sizes by moving threads from ORB even of the variations in the average wait times of the queued
though its pool size had already reached the minimum. To requests measured for ORB and the Web container on the
address such situations, we created extra threads in addition original WAS in Figures 5(a) and 5(b), respectively. The
to the original pool threads. If the situation calls for remov- x-axis shows the execution time in milliseconds and the y-
ing threads from a pool which has reached its minimum, axis shows the average wait time in milliseconds. There
then we can add up to four additional threads to the other are clear variations, especially for the Web container, which
service. If we are later moving threads away from a service communicates with many clients and consumes more CPU

590
1 Service Average Std. deviation
ORB 52.2 41.4

ratio of non-blocked threads (orb/web)


0.9 Original WAS Web container 131.6 51.4
0.8

Our approach ORB 53.7 39.1


0.7
Web container 40.1 32.4
0.6

0.5
Table 2. Statistics for Figures 5 and 6
0.4

0.3

0.2 25

0.1

0 20
500000 550000 600000 650000 700000 750000 800000

execution time (ms)

thread pool size


15

Figure 4. Ratio of non-blocked threads be- 10

tween services 5

0
500000 550000 600000 650000 700000 750000 800000
execution time (ms)

400

350
Figure 7. Adapting the thread pool sizes
average wait time (ms)

300

250
maizes the fundamental statistics of the wait times corre-
200 sponding to Figures 5 and 6. Note that SjAS uses these wait
150
times for its TSF targets as explained in Section 6.1.4.
100

50
For the TSF targets, our approach reduced the average
0
500000 550000 600000 650000 700000 750000 800000
90th -percentile response time by up to 27% with an av-
execution time (ms)
(a) ORB service erage of 22%. By improving the 90th -percentile response
400
time, WAS with our approach can still satisfy the TSF tar-
350 gets as we increased the workload or the numbers of clients.
average wait time (ms)

300

250
Summarizing our results, we were able to support a 3%
200 higher throughput, though such additional workload on a
150
busy server should exponentially degrade the response time.
100

50
How we controlled the thread pool sizes is shown in Figure
0
500000 550000 600000 650000 700000 750000 800000
7. The x-axis shows the execution time and the y-axis shows
execution time (ms)
(b) Web container service the thread pool sizes that our prototype used during each in-
Figure 5. Wait time without adaptation terval. Note that more threads than the number specified
by our prototype are running. As explained in Section 5.3,
threads cannot stop immediately when the pool size is de-
400

350
creased. For example, there is a period where we used extra
threads and the thread pool size exceeded the total number
average wait time (ms)

300

250
of the initial and movable threads near 520,000 ms of exe-
200

150
cution time.
100 We still observe spikes exceeding 300 ms in the average
50

0
wait times of the queued requests in Figures 6(a) and 6(b),
500000 550000 600000 650000 700000

execution time (ms)


750000 800000
even with our approach. Most of these spikes were due
(a) ORB service
to the garbage collection (GC) of Java objects. Since GC
400

350
stops applications, it increases the wait time of the requests
that are queued when GC occurs. For the remaining spikes,
average wait time (ms)

300

250
the threads were waiting for responses from the database or
200

150
contending for locks on shared resources.
100

50
7. Conclusions
0
500000 550000 600000 650000 700000

execution time (ms)


750000 800000
We proposed a dynamic mechanism, thread count adap-
(b) Web container service
tation, that can adapt thread pool sizes to respond to over-
Figure 6. Wait time with adaptation loads that frequently and randomly occur among multiple
resources than ORB. Next we show the results with our ap- services. Our approach can detect such transient overloads
proach. Figures 6(a) and 6(b) show snapshots for the same by monitoring the average wait time of the queued requests
period as Figures 5(a) and 5(b) for the modified WAS with for each service and by redistributing the threads among the
our prototype. As the figures show, our approach signifi- thread pools depending on the CPU demands for the ser-
cantly reduced the variations in wait times. Table 2 sum- vices. We demonstrated that a prototype of our approach

591
actually responds to the frequently occurring overloads of [19] H. Naccache, G. C. Gannod, and K. A. Gary. A self-healing
services and can adjust the thread pool sizes. We confirmed Web server using differentiated services. In ICSOC, volume
4294 of LNCS, pages 203–214. Springer, 2006.
that the quick responses significantly improved the average [20] Oracle®Containers for J2EE Configuration and Administra-
90th -percentile response time by up to 27% (22% on an tion Guide 10g (10.1.3.1.0). Configuring OC4J thread pools,
average) by using an industry-standard J2EE benchmark, Oct. 2006.
[21] L. F. Orleans and P. Furtado. Fair load-balancing on parallel
SPECjAppServer2004. With the extra room created un- systems for QoS. In ICPP, page 22. IEEE, 2007.
derneath the TSF targets, we successfully supported more [22] P. Pradhan, R. Tewari, S. Sahu, A. Chandra, and P. Shenoy.
clients on a server that was optimally tuned to show the An observation-based approach towards self-managing Web
maximum throughput and gained 3% more throughput. servers. In IWQoS, pages 13–22. IEEE, 2002.
[23] I. Pyarali, M. Spivak, R. Cytron, and D. C. Schmidt. Eval-
References uating and optimizing thread pool strategies for real-time
[1] T. F. Abdelzaher and N. T. Bhatti. Web content adaptation to CORBA. In LCTES, pages 214–222. ACM, 2001.
improve server overload behavior. Computer Networks, 31(11- [24] Rock Web Server User Guide. Worker thread configuration,
16):1563–1577, 1999. 2007.
[2] T. F. Abdelzaher, K. G. Shin, and N. T. Bhatti. Performance [25] D. C. Schmidt and S. Vinoski. Object interconnection: com-
guarantees for Web server end-systems: A control-theoretical paring alternative programming techniques for multi-threaded
approach. IEEE Trans. Parallel Distrib. Syst., 13(1):80–96, CORBA servers: thread pool (column 6). SIGS C++ Report
2002. Magazine, 8:1–12, 1996.
[3] J. Alonso, J. Guitart, and J. Torres. Differentiated quality [26] B. Schroeder and M. Harchol-Balter. Web servers under
of service for e-commerce applications through connection overload: How scheduling can help. ACM Trans. Internet
scheduling based on system-level thread priorities. In PDP, Techn., 6(1):20–52, 2006.
pages 72–76. IEEE, 2007. [27] K. Shen, H. Tang, T. Yang, and L. Chu. Integrated resource
[4] Apache Tomcat 6.0. The Executor (thread pool), 2006. management for cluster-based Internet services. In OSDI,
[5] J. M. Blanquer, A. Batchelli, K. E. Schauser, and R. Wolski. 2002.
Quorum: Flexible quality of service for internet services. In [28] Sun Java System Application Server 9.1 Administration
NSDI. USENIX, 2005. Guide. Thread pools, July 2007.
[6] A. Chandra, W. Gong, and P. J. Shenoy. Dynamic resource [29] Sun Java System Web Server 7.0 Performance Tuning, Siz-
allocation for shared data centers using online measurements. ing, and Scaling Guide. Understanding threads, processes, and
In IWQoS, volume 2707 of LNCS, pages 381–400. Springer, connections, 2007.
2003. [30] The Standard Performance Evaluation Corporation
[7] M. Crovella, R. Frangioso, and M. Harchol-Balter. Connec- (SPEC®). SPECjAppServer®2004, 2004.
tion scheduling in Web servers. In USITS, 1999. [31] D. Tsafrir. The context-switch overhead inflicted by hard-
[8] S. Elnikety, E. M. Nahum, J. M. Tracey, and W. Zwaenepoel. ware interrupts (and the enigma of do-nothing loops). In Ex-
A method for transparent admission control and request perimental Computer Science. ACM, 2007.
scheduling in e-commerce Web sites. In WWW, pages 276– [32] B. Urgaonkar and P. J. Shenoy. Cataclysm: policing extreme
286. ACM, 2004. overloads in internet applications. In A. Ellis and T. Hagino,
[9] S. M. Fontes, C. J. Nordstrom, and K. W. Sutter. WebSphere editors, WWW, pages 740–749. ACM, 2005.
connector architecture evolution. IBM Syst. J., 43(2):316–326, [33] T. Voigt, R. Tewari, D. Freimuth, and A. Mehra. Ker-
2004. nel mechanisms for service differentiation in overloaded Web
[10] P. Furtado and R. Antunes. Deadline and throughput-aware servers. In Y. Park, editor, USENIX Annual Technical Confer-
control for request processing systems. In ISPA, volume 4742 ence, General Track, pages 189–202. USENIX, 2001.
of LNCS, pages 383–394. Springer, 2007. [34] W. Wang, W. Zhang, L. Zhang, and T. Huang. WMQ: To-
[11] M. Harchol-Balter, B. Schroeder, N. Bansal, and wards a fine-grained QoS control for e-business servers. In
M. Agrawal. Size-based scheduling to improve Web per- ICEBE, pages 139–146. IEEE, 2007.
formance. ACM Trans. Comput. Syst., 21(2):207–233, 2003. [35] WebLogic Server®Performance and Tuning. Tune pool
[12] A. Kamra, V. Misra, and E. M. Nahum. Yaksha: a self-tuning sizes, Nov. 2006.
controller for managing the performance of 3-tiered Web sites. [36] WebSphere®Application Server Network Deployment, Ver-
In IWQoS, pages 47–56. IEEE, 2004. sion 6.1. Thread pool settings, June 2007.
[13] K. Kourai, H. Hibino, and S. Chiba. Aspect-oriented [37] J. Wei and C.-Z. Xu. eQoS: Provisioning of client-perceived
application-level scheduling for J2EE servers. In AOSD, pages end-to-end QoS guarantees in Web servers. IEEE Trans. Com-
1–13. ACM, 2007. puters, 55(12):1543–1556, 2006.
[14] Z. Li, D. Levy, S. Chen, and J. Zic. Auto-tune design and [38] M. Welsh and D. E. Culler. Adaptive overload control for
evaluation on staged event-driven architecture. In MODDM, busy Internet servers. In USITS, 2003.
pages 1–6. ACM, 2006. [39] M. Welsh, D. E. Culler, and E. A. Brewer. SEDA: An ar-
[15] Z. Li, D. Levy, S. Chen, and J. Zic. Explicitly controlling the chitecture for well-conditioned, scalable Internet services. In
fair service for busy Web servers. In ASWEC, pages 159–168. SOSP, pages 230–243, 2001.
IEEE, 2007. [40] D. Xu and B. Bode. Performance study and dynamic opti-
[16] Y. Ling, T. Mullen, and X. Lin. Analysis of optimal thread mization design for thread pool systems. In CCCT, 2004.
pool size. ACM SIGOPS Operating Systems Review, 34(2):42– [41] J. Zhou and T. Yang. Selective early request termination for
55, 2000. busy Internet services. In WWW, pages 605–614. ACM, 2006.
[17] C. Lu, Y. L. 0002, T. F. Abdelzaher, J. A. Stankovic, and [42] J. Zhou, C. Zhang, T. Yang, and L. Chu. Request-aware
S. H. Son. Feedback control architecture and design method- scheduling for busy internet services. In INFOCOM. IEEE,
ology for service delay guarantees in web servers. IEEE Trans. 2006.
Parallel Distrib. Syst., 17(9):1014–1027, 2006. [43] H. Zhu, H. Tang, and T. Yang. Demand-driven service dif-
[18] Microsoft®Windows Server®2003 TechCenter. Web and ferentiation in cluster-based network servers. In INFOCOM,
application server infrastructure - performance and scalability, pages 679–688, 2001.
Apr. 2003.

592

You might also like