Dissertation 2014

Teknik och samhlle
Datavetenskap
Examensarbete
15 hgskolepong, grundniv
Performance impacts of proling multi-threaded

applications with exible analysis tools
Prestandapverkan av prolering p ertrdade applikationer vid anvndning av
exibla analysverktyg
Alexander Hardwicke
Examen: Kandidatexamen 180 hp

Huvudomrde: Datavetenskap
Program: Data och Telekom
Datum fr slutseminarium: 2014-05-xx
Handledare: Bengt J. Nilsson

Andrabedmare: Banafsheh Hajinasab
Abstract
The aim of this study was to investigate how the use of prolers impacted the performance
of multi-threaded applications. This study was undertaken in the framework of a project
for Edument AB, creators of one of the prolers investigated in this study. Specically, the
study aimed to nd out what additional overhead can result in proling multi-threaded
applications, with regards to CPU use, memory use, and additional time taken to complete
execution. The paper hypothesised that the selection of data that each proler recorded
would impact the performance, with those that recorded a very large amount of detail
about the application being proled would introduce a higher overhead.
A selection of ve prolers was made, each proler oering a range of features and
functionality, and a test application was written to simulate a multi-threaded application.
A very lightweight application, referred to as the monitoring application, was also written which recorded the overhead each proler used, along with the overhead of the test
application and the time taken to run. Essentially, each proler was itself proled by the
monitoring application, ensuring consistent overhead data was gathered for each proler.
The results of the study showed that the choice of proler can have a substantial impact
on the performance of the application being proled. One proler resulted in execution of
the test application taking 513% as much time to run, and adding an overhead of 1400%
memory use. After analysing this data, there appeared to be a correlation between what
features the prolers recorded and the overhead, consistent with the hypothesis put forward
at the beginning of the study.
Sammanfattning
Syftet med denna studie var att underska hur anvndningen av prolers pverkade prestandan hos ertrdade applikationer. Studien genomfrdes inom ramen fr ett projekt
med Edument AB som skapat en av de prolers som underskts i studien. Mer specikt
s syftade studien till att ta reda p vilken ytterligare CPU-tid och RAM som anvndes
och hur mycket lngre tid det tog att exekvera en ertrdad applikation vid anvndningen
av prolers. Uppsatsen hypotes var att valet av data som varje proler registrerade skulle
pverka prestanda, och att de som registrerade er detaljer om applikationen skulle ha
strst pverkan.
Fem prolers valdes ut med olika egenskaper och funktioner och en testapplikation
skrevs fr att simulera en ertrdad applikation. En vldigt minimal applikation skrevs
ocks och anvndes fr att registrera varje prolers pverkan p RAM och CPU, samt
hur testapplikationen pverkades av prolern. Allts, varje proler har var fr sig blivit
prolerad fr att frskra att samma data konsekvent samlats in.
Resultaten visade att valet av proler kan ha stor pverkan p den prolerade applikationens prestanda. Anvndningen av en av prolerna ledde till att testapplikationen
tog 513% s lng tid att exekvera och lade ocks till 1400% ytterligare RAM-anvndning.
Efter en analys av insamlade data verkade det nnas ett samband mellan de funktioner
som varje proler erbjd och pverkan p applikationens prestanda vilket stmmer verens
med uppsatsens hypotes.
Acknowledgements
I should like to thank Bengt J. Nilsson for all of the help and assistance he has given me
when writing this dissertation, as well as Tore Nestenius for the inspiration to perform this
study.
On a more personal note, my thanks go out to my wife for all the support and help,
and for not entering labour before I nished writing my dissertation.
Contents
1 Introduction
1.1
1.2
1.3
1.4
1.5
1.6
1.7
Background . . . . . . . .
Previous Research . . . .
Denition of the Problem
Aim . . . . . . . . . . . .
Hypothesis . . . . . . . .
Limitations . . . . . . . .
Competing Interests . . .
2 Method
2.1
2.2
2.3
2.4
Description . . . . .
Selection of Prolers
Analysis . . . . . . .
Reliability . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
3
3
3
4
5
5
6
8
9
3 Results
10
4 Discussion
16
4.1
4.2
4.3
4.4
4.5
Time Taken . .
CPU Usage . .
Memory Usage
Conclusion . . .
Limitations . .
5 Conclusions
5.1
Future Work
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
17
19
20
21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
A Appendix: Tables
25
1.0
Introduction
1 Introduction
1.1 Background
Software is becoming more and more common in society across an ever growing range
of devices, most recently handheld devices like smartphones and tablets. Much of this
software is exposed to laymen users and is typically expected to be easy to use, powerful,
and to always have a responsive user interface.
The ever-growing capabilities and user experiences in software requires more and more
processing power. Developers have traditionally relied on the fact that the central processing units (CPUs) in these devices have, with great regularity, increased in processing
power, for decades following the predictions made by Moore [13], known as Moore's Law.
However, this trend is slowing and the clock speed of CPUs is no longer increasing as
rapidly, as CPUs are hitting what is described by Hennessy and Patterson [8] as the Power
Wall and performance is improving less and less each year [1].
Manufacturers of CPUs have moved their production model to having multiple cores
in each CPU, as described by Asanovic, Bodik, Catanzar
[1], allowing the CPU to
execute multiple commands in parallel, rather than concurrent tasks having to run serially
as on a single core CPU [8]. However, traditionally written software that executes its
code in serial and does not take advantage of multiple cores does not gain any increase
in performance; to take advantage of the multiple cores, developers need to write their
applications to run in parallel via the use of multi-threading techniques [8].
Multi-threaded software is simply dened as software in which dierent parts of the
application are executed concurrently instead of all code being run serially. Many computers and devices run multiple pieces of software concurrently, as users expect multiple
applications to be running simultaneously and each application should be able to perform
multiple tasks in parallel. If the CPU has multiple cores, these dierent applications and
threads can run these commands in parallel, which means that the commands can be executed simultaneously. This can provide an increase in performance and an improved user
experience [8]). According to Asanovic, Bodik, Catanzar
[1], a feasible future could
involve what they call manycore devices, with hundreds or even thousands of cores on a
CPU. Taking advantage of this will become something future developers will have to do if
they wish to program eciently.
Many developers nd that writing multi-threaded applications is much more dicult
than single-threaded and it introduces a large variety of problems [6, 8]. Barberis, Cuva
and Malnati [2] as well as Larson and Palting [7] have documented the development of
multi-threaded proling tools to help teach students about the problems of threading due
to this diculty. This is because as soon as an application becomes multi-threaded, the
developer can not guarantee the order in which code executes, which can cause severe
problems in badly written multi-threaded software, such as deadlocks and race conditions
[6, 7, 15]. Various techniques have been developed to help write more ecient multithreaded code, and to avoid the pitfalls. However, monitoring what multi-threaded code
does, how it performs, and observing the potential improvements to performance can still
be dicult and requires specialised tools, referred to in this paper as prolers.
Prolers are tools that can monitor and record various data about software running on
et al.
et al.
1.2
Previous
Research
a computer. They can prole applications in dierent ways (described by Diep and Elbaum
[5] as Full Proling, Targeted Proling and Proling with Sampling), but what they all have
in common is that they record data about the proled application. This typically means
recording numerical data such as how much of the processor or memory the application
is using, or recording what method calls the application takes, often visualising them in
UML [2, 3]. Larson and Palting [7] describe the problems with many prolers and the lack
of support for multi-threading, and that others have been deprecated and are no longer
supported.
1.2 Previous Research

Investigations into alternative methods of proling that record a narrower set of data have
been undertaken. Kazi, Jose, Ben-Hamida
[3] describe the creation of the JaViz
proler, which displays graphical representations of the application's execution tree and
highlights potential problems such as threading deadlocks. Barberis, Cuva and Malnati
[2] document the creation of a similar tool, JThreadSpy, which has a particular focus on
multi-threading and visualising the problems that can occur, such as deadlocks. They also
discuss the potential overhead of proling, and conclude that using an external library in
the application to be proled adds the lowest amount of overhead.
Chen and Tsao [4] investigated proling embedded systems and discuss in depth the importance of a low overhead when proling, particularly on embedded devices. As software
overhead increases, the power consumption increases, and if the overhead grows too much,
embedded devices will not be capable of running the proling tools. Their conclusion is
that, particularly for embedded devices, keeping overhead low is of such importance that
they recommend combining hardware and software proling so that as little software as
possible is needed, while still keeping the advantages software brings.
Diep and Elbaum [5] discuss in detail the problems of proling tools and the cost of
overhead. Their presentation of dierent types of prolers (Full Proling, Targeted Proling, Proling with Sampling and Triggered Proling) shows that each way of proling
presents a tradeo in terms of overhead and features. They conclude that using targeted
and sampling proling techniques can reduce the number of probes by an order of magnitude, which greatly increases proling's viability.
Krintz, Mousa, Nagpurkar
[6] come to similar conclusions, nding that samplebased proling is the most eective as it achieves the highest accuracy with the least overhead. They discuss how proling software that uses sampling oers exibility, portability
and great accuracy, but that signicant sampling introduces large overheads which can
aect the software's behaviour and reduce the accuracy of the gathered information. They
conclude that selective sampling provides accurate data while introducing signicantly less
overhead. Thain and Wheeler [15] come to similar conclusions.
et al
et al
1.3 Denition of the Problem

As mentioned, previous studies regarding prolers have typically been focused on writing
a proler for a specic language or problem [2, 3, 7]. However, these tools are becoming
more and more required, due to the increase in devices with multiple cores [8] and software
2
1.4
Aim
with multiple threads to take advantage of the hardware. The choice of proler is a key
decision in the development process, as a wrong choice could hinder the developer. There
is also little research comparing dierent prolers and how the prolers themselves can
aect the application they are proling.
The degree to which the proler aects the software it is proling can lead to a variety
of impacts and potential problems. If the proler uses too much processing power, the
software being proled may use a diering number of cores or cache more data than it
would without the proler. Especially when dealing with multiple threads, issues like
these can change the behaviour of the threads and the order they execute because thread
execution is non-deterministic [2], and potentially conceal problems that a less invasive
proler would reveal. As such, awareness of the impact of the proling software is key to
ecient proling of multi-threaded software.
Being able to investigate and discover performance problems quickly is, as described
by Krintz, Mousa, Nagpurkar
[6], becoming a requirement of many developers, even
proling software that is actually deployed and not in development [5]. This is especially
true with embedded systems and mobile devices [4, 6].
A study investigating this impact should help lead to better proling and help use the
tools in a more ecient way, leading to software that is improved with regards to performance and the way it uses threading. As described by Diep and Elbaum [5], evaluating
the benet and applicability of proling techniques is a signicant challenge and research
into this eld can help to reveal new techniques and knowledge about proling.
et al
1.4 Aim
This dissertation investigated the dierence between various prolers, with the aim of
nding out how large proler overhead is and whether it aects the proled software. The
overhead was measured by how much the use of the CPU (which, if high, can change the
way the applications threads are handled and severely aect results) increased, how much
RAM overhead was added and how much more time it took to execute a test application.
1.5 Hypothesis
the prolers that oer more features and more
detailed data results will have the largest impact on the overhead and running
time of the proled software.
This hypothesis will be tested by attempting to answer how much impact in terms
of CPU time, memory overhead and additional time is taken the use of proling
software introduces.
The hypothesis of the paper is that
1.6 Limitations
The research undertaken in this dissertation is aimed the use of proling applications.
Other proling methods, such as probing the CPU directly and monitoring what happens
at a hardware level are not in the scope of the dissertation and will not be examined.
1.7
Competing
Interests
1.7 Competing Interests

This study was performed within the framework of a project for Edument AB, creators of
one of the prolers investigated in the application.
2.0
Method
2 Method
2.1 Description
This dissertation involved an investigation of the impact of various proling tools in terms
of the additional memory and processor overhead their use resulted in.
One possibile way to investigate the impact of prolers is to analyse the source code of
the software itself. As described by McQuaid [9], the more interesting and complex parts
of source code typically give more qualitative results and do not have precise meanings
that can be described mathematically. This would make direct comparisons between prolers more complex and introduce bias, particularly if the prolers are written in dierent
programming languages. This would also be dicult as a lot of prolers are professional
software and the source code to the applications is not available to be investigated.
Another way to investigate that would be independent of the proler's programming
language would be to perform a statistical survey amongst programmers that have used a
selection of prolers. Surveys are, as stated by Oates [14], common and useful for empirical
research. Pertinent information, such as the overhead each proler adds, could be gathered
via surveys and the data investigated. However, there are a large amount of factors that
the survey would struggle to take into account, such as user bias, the users' experiences
with prolers involving dierent applications being proled or dierent hardware being
used for dierent prolers.
The very nature of this sort of investigation leads one to attempt to gather empirical
evidence, by experimentating on and observing each proler being tested, which can then
be used to compare the prolers in a clear, unbiased and veriable way [14]. Once this
became clear and an experiment had been chosen as the method, a selection of prolers
was made, with a range of capabilities and from a range of manufacturers. The choice of
prolers is further discussed in the following section of this paper.
A test application, written in C# and using .NET, was written as an example of a
multi-threaded application, which will hereafter be referred to as the test application.
This test application creates and starts a certain number of thread groups serially,
starting each group when the previous group nishes. Each thread group contains three
threads that execute simultaneously. Each thread group reads data (a set of 1,000,000
integers) from a data le on disk, processes the data and sorts it into an ascending list.
Each thread then executes some high-intensity CPU work by running a crunch method,
which performs 10,000 loops of several high-complexity mathematical operations. The nal
result of each mathematical operation is synchronised with the rest of the thread group.
Once all of the thread groups complete, the synchronised data is output by the main
application thread, which then repeats the process with a dierent number of thread groups.
The number of thread groups is written in the application and averages around ten thread
groups (resulting in 30 threads) running simultaneously, with peaks as high as 32 thread
groups and as low as 3 thread groups. Once all of these thread groups have completed, the
test application exits.
A second application, referred to as the monitoring application, was written to monitor
both the test application and the proler being used at the time. This application, when
launched, ceases further execution until both the current proler and test application are
5
2.2
Selection
of
Profilers
open.
When both the test application and proler have opened, the monitoring application
records the memory usage and CPU usage of the proler and the test application, separately. It also records the exact time the test application starts and nishes, and calculates
the average, minimum and maximum values for both CPU usage and memory usage, separately for the proler and test application.
A nal set of tests were run, using the monitoring and test applications but with no
proler running. This meant that there was a set of baseline data of the testing application
with which to compare to the results of each proler test.
Two of the prolers (Proler 1 & Proler 3) only record data when the user requests
that the proler do so. In the case of Proler 1, a snapshot of memory is taken and stored,
and in the case of Proler 3, the data sampling is programmed in the application's code
and is gathered by the proling client on user request. These prolers are what Diep and
Elbaum [5] refer to as Proling with Sampling. The other three prolers t the pattern of
Full Proling.
In the case of Proler 3, unlike all of the other prolers, the application being proled
is expected to expose the data to the proler, rather than the proler using .NET to get
the information it requires. This means that the test application included the library that
Proler 3 provides when being used for Proler 3. It also had a few extra lines of code used
to tell the library to record the application's CPU and memory use, and to record when
each method in the test application started and nished execution. This data is exposed
to Proler 3 by a small HTTP Web Service that the Proler 3s library automatically starts
when it is used.
For both of these prolers, a snapshot was taken when every fourth thread group
completed.
2.2 Selection of Prolers

Choosing a good selection of prolers was a vital part of this experiment. As the hypothesis
of the paper supposes that the capabilities of the proler aect the overhead, the selection
of prolers needed to contain several dierent prolers, each with dierent capabilities and
methods of working.
As this paper focused on the impact of the choice of proler, as many other variables
as possible needed to be kept constant. As such, the choice was made to have only one test
application written in one language. Multiple test applications could have been written, but
the dierences this would have introduced (e.g. the dierent implementations of threading,
the dierence between compiled and interpreted languages, whether interpreted languages
use features like JIT) would have introduced an unacceptable amount of varying data that
would have modied the results and made them much less accurate. Therefore, to ensure
that the only varying source of data was the prolers, each proler would have to work
with the same test application, which means the prolers all needed to work with one
programming language or framework. As such, a popular language or framework needed
to be chosen; languages that are used less frequently amongst developers are less likely to
have a selection of prolers to choose from.
2.2
Selection
of
Profilers
The best candidate was C#: it provides modern, easy to use threading APIs that
permit an ecient, quick testing application to be written. C# is one of several languages
that support .NET and all use the Common Language Infrastructure (CLI), and there are
a large range of prolers that support any CLI application, as described by Microsoft [10].
This broad selection of prolers and the capability to write a good testing app meant that
the concluded choice was C#. The selection of prolers contains commercial products as
well as free alternatives, to ensure a broad spectrum of data sources.
The prolers that were chosen, along with the numbers that will be used to refer to
them throughout this paper, were:
1. ANTS Memory Proler
2. ANTS Performance Proler
3. Edument CodeProbe
4. Microsoft Concurrency Visualiser Tools as a part of Visual Studio
5. YourKit .NET Proler
These ve prolers provided a good sample of the dierent features that prolers can
oer. Prolers 1 and 2 are prolers from the same source, but each with a dierent focus:
one focuses on the memory usage of the proled app, looking to nd memory leaks and how
eciently the memory is used, while the other looks at processer performance, looking for
inecient code or threading problems. Both share a common code base and functionality,
and as such comparing both prolers can show the dierence between two very similar
prolers, purely based on what type of data.
This selection of prolers cover both Full Proling and Proling with Sampling as described by Diep and Elbaum [5]. Prolers 1 and 3 are prolers that use sampling techniques,
while Prolers 2, 4 and 5 use full proling and record all data at all times.
The following table provides the pertinent details about each proler:
2.3
Analysis
Proler
Data Gathered
Proler 1
Number of and size of objects in memory,

Number of and size of used assemblies,
Dierence in memory usage between snapshots
Call tree,
File I/O,
Processor use over time
Proler 2
Proler 3
When programmer-dened events begin & end

Memory use over time
Proler 4
Number of cores used over time,

Number of threads used over time,
What each thread is doing at any time,
Application is also a full programming IDE,
with full debugging functionality
Call tree,
Method list,
Number of threads over time,
Processor use over time,
Memory use over time,
Garbage collection over time
Proler 5
Real-Time Visualisation/Recording?
No. Only records and displays when the user elects
to take a snapshot
Yes
Recording is in real-time,
but visualisation is only
when the user has elected
to connect and receive data
No. Proling gathers data
until either the proled application ends or the proler stops, after which the
data is processed and rendered.
Yes
Table 1: Description of Selected Prolers
2.3 Analysis
The nature of the experiments performed resulted in data on a ratio scale, which permitted
direct comparisons to be made between the dierent prolers.
The baseline tests performed on the test application showed the memory and processor
use that the application, the threads and the .NET Framework required. Each proler was
then compared to the baseline data, giving the amount of overhead each proler added.
The data was collated and presented in various graphs. As described by Oates [14], graphs
and charts enable the researcher and the reader to quickly see patterns amongst the data,
and notice extreme values, which was ideal. The data was then analysed mathematically,
calculating percentage overheads and total CPU usage by combining the per-second CPU
values with the time the test took to run.
This data was then further analysed to nd out what the ultimate overhead costs were
of each proler, and the results were compared to the paper's hypothesis.
8
2.4
Reliability
2.4 Reliability
Each test on each proler was performed three times and the data sets were averaged,
to diminish the eect of minor inconsistencies or outliers. Any result set with a large
inconsistency of data was discarded and performed again. This helps ensure that the data
is reliable and consistent.
All of the prolers run on the .NET Framework. To reduce unpredictable behaviour,
the .NET optimization features were all disabled via the .NET Framework Debugging
Context options, as described by Microsoft [12]. This disables features such as caching
and other optimizations and guarantees that each time the tests were run, they ran in the
same way, and that no data could be cached and used by other prolers.
Rather than running the test of each proler three times consecutively, each proler
was tested in turn, three times, in a unique order each time, with each proler being closed
after each test. This ensures that the prolers themselves cannot cache data or otherwise
aect the results.
Batch One
Proler 5
Proler 4
Proler 2
Proler 1
Baseline (No Proler)
Proler 3
Batch Two
Proler 1
Proler 3
Proler 4
Proler 2
Proler 5
Batch Three
Proler 4
Proler 2
Proler 5
Proler 3
Proler 1
Table 2: Order of Test Groups
3.0
Results
3 Results
For each proler, the following data was gathered:
Time taken to run test application (seconds)

RAM use of the test application
CPU use of the test application
RAM use of the proler
CPU use of the proler
Throughout the results section, CPU usage is displayed as the percentage used per core.
The computer the tests were run on has a dual core Intel processor with HyperThreading,
which means that the operating system reports that the computer has four CPU cores. As
such, the maximum percentage that could be reported is 400%, if all four cores were fully
utilized.
Each chart contains baseline data. This is the recorded data for the test application
without any proler running.
All of the data in the charts has been averaged between the three times each test
was performed. All of the data (except for total time taken to run) was then averaged
out each group of fteen data points were averaged into one data point. This was done
because some of the prolers resulted in over 1000 data points, which was far too high a
degree of accuracy for the charts.
10
3.0
Results
600
Time (seconds)
500
400
300
200
100
0
Baseline
Profiler 1
Profiler 2
Profiler 3
Profiler 4
Figure 1: Time Taken to Complete Tests for Each Proler
11
Profiler 5
3.0
Results
400.0
Percentage of One Core Used
350.0
300.0
250.0
200.0
150.0
100.0
50.0
0.0
Baseline
Profiler 1
Profiler 2
Average Used CPU: Test App
Profiler 3
Profiler 4
Profiler 5
Average Used CPU: Profiler
Figure 2: Average CPU Usage by Test and Proler

The darker colour shows the CPU usage of the test application, and the lighter colour
shows the additional CPU used by the proler itself.
12
3.0
Results
600
500
RAM (MB)
400
300
200
100
0
Baseline
Profiler 1
Profiler 2
Average Used RAM: Test App
Profiler 3
Profiler 4
Profiler 5
Average Used RAM: Profiler
Figure 3: Average Memory (RAM) Usage by Test and Proler

The darker colour shows the RAM usage of the test application, and the lighter colour
shows the additional RAM used by the proler itself.
13
3.0
Results
Percentage of One Core Used
400
350
300
250
200
150
100
Baseline
13
17
Profiler 1
21
25
29
33
37
41
Time (seconds)
Profiler 2
45
Profiler 3
Profiler 4
Figure 4: CPU Usage over Time

This chart shows the CPU usage of each proler over time.
14
49
53
57
61
Profiler 5
65
3.0
Results
650
600
550
500
RAM (MB)
450
400
350
300
250
200
150
100
50
0
0
30
Baseline
60
90
120 150 180 210 240 270 300 330 360 390 420 450 480 510
Time (seconds)
Profiler 1
Profiler 2
Profiler 3
Profiler 4
Figure 5: Memory (RAM) Usage over Time

This chart shows the RAM usage of each proler over time.
15
Profiler 5
4.0
Discussion
4 Discussion
In this section, the results of the experiment will be analysed to nd out the answers to
the posed questions. The section is divided into several sub-sections, separating the time,
CPU and memory data for a more detailed analysis, and combining the data points at the
end for a conclusion.
4.1 Time Taken

Figure 1 shows that the choice of proler can have a large impact on the time it takes the
test application to run. The following table shows the prolers ordered by time taken.
Proler
Baseline
Proler 3
Proler 4
Proler 1
Proler 5
Proler 2
Time Taken (minutes) Relative to Baseline

1:42.1
1:43.4
1:50.1
2:02.5
6:17.2
8:44.6
100%
101.2%
107.8%
120%
369.4%
513.7%
Table 3: Time Taken, Ordered

From this data, we can see that the additional time taken ranges from 1.2% longer
(Proler 3) to 413.7% longer (Proler 2). This suggests that there is a correlation between
the number of features oered by the proler and time time taken, as Prolers 2 and 5,
which are the prolers with the widest range of features (such as real-time visualisation
and code stack recording) are those that have by far the largest impact.
4.2 CPU Usage

Figure 2 shows that the average CPU Core Usage for the most part is not aected by the
choice of proler. In the following table, the relative percentage has been calculated and
put in ascending order.
Again, Prolers 2 and 5 are those that deviate most from the baseline. Proler 5 uses
the most amount of CPU usage per second and takes longer to execute the test application
than any other proler, which means that the total CPU usage is very high. Interestingly,
Proler 2 uses the lowest amount of CPU time per second, actually using a signicant
amount less CPU per second than the test application on its own does. This implies that
a non-CPU intensive task is occuring while the test application executes, which is delaying
how quickly the test application can execute further code.
All of the other prolers use within 2% as much CPU as the test application on its
own, with Prolers 1 and 3 using 0.2% and 0.4% more CPU per second respectively. The
gure also shows that the CPU use of the prolers, even for Proler 2 and Proler 5 is
minimal and that they aect the amount of CPU that the test application itself uses.
16
4.3
Memory
Proler
Proler 2
Baseline
Proler 1
Proler 3
Proler 4
Proler 5
Usage
Average CPU Core Usage /s Relative to Baseline

169.8
240.8
241.3
241.6
244.9
344.5
70.5%
100.0%
100.2%
100.4%
101.7%
143.1%
Table 4: CPU Core Usage, Ordered

Figure 4 shows that Prolers 2 and 5, which have stuck out before are clearly those
that have the most impact. The other prolers appear to have very similar CPU use and
time scales as the test application itself.
Proler 2 uses roughly 70% of the CPU per second that the test application itself takes,
but it takes over 500% as long to complete the test run as without the prolers. Proler 5
uses roughly 140% of the CPU per second that the test application takes and takes 370%
more time to complete.
If we multiply the Core CPU Usage per second of each Proler by the time each proler
takes to run, the results will let us compare how much CPU each proler used in total over
the entire run. This data is presented in the following table.
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Average CPU Use Time Taken (s) Additional CPU Time

240.8%
241.3%
169.8%
241.6%
244.9%
344.5%
102.1
122.5
524.6
103.4
110.1
377.2
0.0%
20.2%
262.3%
1.6%
9.7%
428.5%
Table 5: Additional CPU Usage Over Time

Prolers 2 and 5 are again the largest values by far, and here we can see that although
Proler 1 is the closest in CPU use to the test application, using only 0.5% more, the
additional time it takes means that Proler 3 has the least CPU impact of all of the
prolers. This also shows that despite Proler 2 using the least amount of CPU time per
second, the additional CPU time it consumes during test execution is substantial.
4.3 Memory Usage

Figure 3 shows that the memory overhead is dramatically aected by the choice of proler.
This data has been collated and the overhead calculated and is shown in the table below.
17
4.3
Memory
Proler
Baseline
Proler 5
Proler 3
Proler 1
Proler 4
Proler 2
Usage
Average Memory Usage (MB) Relative to Baseline

34.8
68.7
92.7
289.4
406.7
487.4
100.0%
197.6%
266.8%
832.6%
1170.2%
1402.4%
Table 6: Memory Usage, Ordered

Yet again, Proler 2's usage is signicant, using 1402.4% as much RAM as the test application. However, the proler that uses the least amount of additional RAM is Proler 5,
using 97.6% more. Proler 3 again has a lower overhead, second only to Proler 5.
None of the prolers except for Proler 2 aect the RAM use of the test application
itself to any large degree, using between 0% and 560.9% more RAM. The RAM overhead
mostly comes from the proler itself. Proler 5 has a very low overhead here, taking only
12.8 MB more RAM. Prolers 1, 2 and 4 have very large overheads, at 245.5 MB, 257.5 MB
and 372.0 MB additional RAM on top of what the test application already uses with each
proler attached.
That Proler 2 has a very excessive memory overhead suggests that it is using large
amounts of memory to prole and record what the test application is doing. This recording
is likely what is delaying the execution of the test application substantially, which in turn
causes the CPU usage per second to be so low.
Figure 5 shows that although Proler 5 doesnt add much memory overhead, the additional time it takes to run means that much more RAM is consumed by the proler and
test application than would be otherwise. It also shows that the memory usage of all of
the prolers over time, except for Proler 2, is fairly level and constant. Proler 2's use of
memory appears to grow and grow, dropping substantially halfway through the execution
before beginning to grow again.
Microsoft [11] discuss how memory consumption can aect the performance of applications. As more RAM is used, the operating system will store less data in the CPU's cache,
and will instead store it in the RAM, or even a virtual memory page le on the computer's
internal storage. Each of these stages adds delays to reading memory, particularly internal
storage which is 10,000 times slower than RAM. This could explain why Prolers 2 and
5 take a signicant amount of extra time to execute the test. This also agrees with the
previous research taken by C. Krintz, H. Mousa, P. Nagpurkar
[6], who further discuss
the problems of additional memory overhead.
Lastly, it also appears to show that again, Proler 3 has the least impact on the
execution of the test application and is the closest in memory usage to the baseline data.
Overall, the large impacts in overhead are caused by Prolers 2 and 5, and Proler 3
seems to have the least impact.
et al
18
4.4
Conclusion
4.4 Conclusion
The following table lists the prolers, rst ordered by their CPU impact (Additional CPU
Time Consumed) and then by their memory impact (Average Additional RAM Used).
Proler
Proler
Proler
Proler
Proler
Proler
3
4
1
2
5
CPU Impact Memory Impact Proler

1.6%
9.7%
20.2%
262.3%
428.5%
Proler
Proler
Proler
Proler
Proler
5
3
1
4
2
97.6%
166.8%
732.6%
1070.2%
1302.4%
Table 7: Impact of Prolers

From all of the gathered data, we can see that Proler 3 has the lowest CPU impact
taking time into account (and the impact is almost negligible) and has the second lowest
memory overhead. As such, it appears to be the most light-weight proler. This agrees
with the original hypothesis, as Proler 3 provides the least number of features, and the
proling is done by the application itself, rather than an external proler: the external
client simply retrieves the data from the application. This agrees with the research taken
by Barberis, Cuva and Malnati [2], who nd that using an external library introduces the
least overhead.
However, Proler 3 requires substantial work on the part of the developer, as an external
library has to be used and the developer must state in the application's code exactly what
data should be gathered and where. In a larger application, particularly one that is already
established with a lot of source code, Proler 3's low overhead may not be worth the
additional time and developer hours it would take to implement the library.
At the other end, Proler 2 uses the second highest amount of CPU during execution,
and has by quite a bit the largest memory overhead. As such, it appears to be the proler
with the largest impact.
Both Proler 2 and Proler 5 substantially increase the time it takes to execute the
application being proled. This can have drastic eects on the way the applications run,
particularly with multi-threading: in a non-proled context, a thread could complete execution before an I/O operation completes, but the large execution delays these prolers
introduce could cause the thread to complete afterwards. This could conceal bugs in the
application or give misleading performance data, which defeats the purpose of using a
proler.
However, one limitation of this experiment is that the test application was run with a
rather large number of threads, substantially more than one might expect in small consumer
applications. This high number of threads may have aected the performance of Prolers
2 and 5 that record extensive threading data. For applications that frequently run many
threads (such as web applications), however, this data is extremely relevant.
Additionally, the large overhead added by Prolers 2 and 5 would, particularly for
remote proling and embedded devices, likely make them unusable, as both Chen and
19
4.5
Limitations
et al
Tsao [4] and Krintz, Mousa, Nagpurkar

[6] found in their previous research.
As Proler 1 and Proler 2 share a common core code base, but provide dierent features and record in dierent ways, the dierence between the two prolers is interesting.
The substantial dierences show that Proler 1, which proles via sampling, has a substantially smaller overhead than Proler 2, which uses full proling. That Proler 3 has
the lowest overhead of all of the tested prolers also reinforces this conclusion, as it is also
a sampling proler. This agrees with the research by Diep and Elbaum [5], Krintz, Mousa,
Nagpurkar
[6] and Thain and Wheeler [15], who all found that proling through
sampling was a method that could substantially decrease the overhead of proling.
Proler 4 has the second lowest impact on CPU, but has the second highest impact
on memory use. A large portion of the memory overhead is likely due to the fact that
Proler 4 is a fully functional IDE and provides a mass of features far beyond any of the
other prolers. Being built in to the IDE also makes it much easier to use the proled
data, as the application can be written and proled in the same tool, sharing performance
data between dierent areas of the tool. Proler 4 had no memory impact on the test
application at all, with all of the memory overhead coming from the proler itself. As
such, for development purposes appears to be the best choice for low-overhead alongside
Proler 3, as a static memory overhead coming from the development tools used to write
the application is something that can be taken into account and is not likely to aect the
application being proled. In particular it doesn't have the disadvantages of Proler 3 and
can be used at any point in development without any additional work by the developer.
et al
4.5 Limitations
This study only investigated one testing application. The application used a very small
amount of memory, a large number of short-life threads, and a high amount of CPU. The
high number of short-lived threads and CPU usage can have been problematic for the
prolers that track thread life and the call stack, as many threads running at the same
time means a very complicated and deep call stack was made. As such, the analysed data
may not be representative of common real-world applications.
20
5.0
Conclusions
5 Conclusions
This dissertation aimed to investigate and nd out how much prolers impact the performance of the software they are proling and how much overhead they add, with a focus
on multi-threaded applications. This was due to the increasing popularity of multi-core
devices and that software is expected to be more responsive, which requires threading.
The hypothesis was that the prolers that oer the largest amount of data about the
software they prole would add the largest overhead. To test this, a test application was
written that used varying amounts of CPU and threads to provide a range of data for
the prolers to gather. A selection of prolers, both commercial and free, were chosen to
prole this test application. These prolers had a varied degree of functionality, ranging
from minimal proling to extensive tracking. The way they recorded data was also dierent,
some fully proling all data, some taking samples [5]. The test application was then run
while proled by each proler.
The tests showed that the prolers do add overhead, the degree to which they do appearing to depend upon the features they oer. All of the prolers added a substantial
memory overhead, with the proler with the lowest RAM usage still doubling what's required. The CPU usage per second for each proler was not dramatically aected by the
choice of proler, but the time the test application took to complete was substantially
increased for certain prolers, resulting in a much larger amount of CPU usage over the
duration of the test application's execution. Two of the prolers, use the same code base
but record dierent data and in dierent ways: the one that recorded more data was a
full proler, while the other was a sampling proler. The full proler's footprint was substantially larger than the sampling proler in every metric, which is what the hypothesis
predicted.
5.1 Future Work

The data presented in this dissertation reveals that certain prolers and techniques they
use can have substantial impact on the applications they are proling. However, more
questions have been raised than answered and further research is needed.
The test application can be expanded and improved so that it provides a better representation of a real-world multi-threaded application. There could be more le I/O, memory
management and simulated reactions to incoming data. Interactions with databases and
other systems would also add additional realism to the test application. A larger test
application with a more realistic memory footprint could show that the prolers memory
overhead is static and independent of the size of the test application, in which case on
larger test applications, with at least some of the prolers it would become insignicant.
Additionally, the test application could also be run multiple times using a varying
number of threads, to see to what degree each additional thread impacts each proler.
This could reveal what proling features cause the largest amount of increased overhead,
which could assist with proling massively-multi-threaded software.
Alternatively, one or several test applications could be written in dierent programming
languages, particularly languages of a dierent paradigm such as an application written
using a functional programming language. This could use the same prolers and use F#, a
21
5.1
Future
Work
functional programming language that uses the .NET Framework, or could use an entirely
dierent set of prolers. Using a dierent language or framework would reveal whether the
choice of .NET had a particular impact on the proler performance and overhead.
Lastly, a wider range of prolers could be tested. Open source prolers would add not
only provide extra information, but also permit a code review to examine what specic
parts of the proler are causing the overhead. Additionally, comparing multiple prolers
that provide very similar or identical features would reveal to what degree the choice of
proler is important, as it is possible that the prolers that had the highest overhead were
not as well written as those that used less.
22
5.1
Future
Work
References
[1] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A.
Patterson, W. L. Plishker, J. Shalf, S. W. Williams, K. A. Yelick. The Landscape of
Parallel Computing Research: A View from Berkeley
(2006)
EECS Department, University
of California, Berkeley
[2] C. Barberis, C. M. Cuva, G. Malnati, JThreadSpy: Teaching Multithreading Programming by Analyzing Execution Traces,
(2007)
313
PADTAD `07 Proceedings of the 2007

ACM workshop on Parallel and distributed systems: testing and debugging
[3] B. Ben-Hamida, C.J. Hescott, D.P. Jose, J. A. Konstan, I. H. Kazi, C. Kwok, D. J.

Lilja, P.C. Yew, JaViz: A client/server Java proling tool,
(2000) 96-117
IBM Systems Joruanl,
Vol. 39, No. 1
[4] J. J. Chen, S. Tsao, SEProf: A high-level software energy proling tool for an embedded processor enabling power management functions,
(2012) 17571769
Journal of Systems and
Software, Vol. 85, No. 8
[5] M. Diep, S. Elbaum, Proling Deployed Software: Assessing Strategies and Testing
Opportunities,
(2005)
312327
IEEE Transactions on Software Engineering, Vol. 31, No. 4
[6] C. Krintz, H. Mousa, P. Nagpurkar, T. Sherwood, Ecient remote proling for

resource-constrained devices,
(2006) 3566
ACM Transactions on Architecture and Code Optimiza-
tion, Vol 3, No. 1
[7] E. Larson, R. Palting, MDAT: A Multithreaded Debugging and Testing Tool,
[8]
[9]
SIGCSE `13 Proceeding of the 44th ACM technical symposium on Computer science
education (2013) 403408
J. L. Hennessy, D. A. Patterson. Computer Organization and Design, 5th ed. Oxford:
Elsevier Inc. (2003).
A. McQuaid, Proling Software Complexity, ProQuest Dissertations and Theses
(1996)
[10] Microsoft, C# Language Specication,
Microsoft Press (2006)
[11] Microsoft [ Internet ], Memory Usage Auditing For .NET Applications. [ cited 201404-20 ].
Available from: http://msdn.microsoft.com/en-us/magazine/dd882521.aspx
[12] Microsoft [ Internet

cited
2014-04-20
].
],
Making
an
Available from:
us/library/9dd8z24x%28v=vs.110%29.aspx
Image
Easier
to
Debug.
http://msdn.microsoft.com/en-
[13] G. E. Moore, Cramming More Components onto Integrated Circuits,

(1998) 8285
ings of the IEEE, Vol. 86, No. 1
23
Proceed-
5.1
Future
[14] B. J. Oates, Researching Information Systems and Computing.

(2003).
cations Ltd
Work
London: Sage Publi-
[15] D. Thain, K. B. Wheeler, Visualizing massively multithreaded applications with

ThreadScope
(2010) 4567
Concurrency Computation: Practice and Experience
24
A.0
Appendix:
Tables
A Appendix: Tables
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Time Taken (minutes)

1:42.1
2:02.5
8:44.6
1:43.4
1:50.1
6:17.2
Table 8: Time Taken to Run Tests
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Average CPU Core Usage /s

240.8%
241.3%
169.8%
241.6%
244.9%
344.5%
Table 9: CPU Core Usage per Second
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Test Application (MB) Proler (MB) Total (MB)

34.8
43.9
230.0
47.9
34.7
55.9
0
245.5
257.5
44.8
372.0
12.8
Table 10: Memory Usage
25
34.8
289.4
487.4
92.7
406.7
68.7
A.0
Appendix:
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Tables
Minimum CPU Core Usage Maximum CPU Core Usage

218.0%
227.8%
166.2%
229.8%
229.4%
318.9%
251.3%
263.6%
175.3%
252.2%
255.5%
360.4%
Table 11: Minimum and Maximum CPU Core Usage per Second
Proler
Baseline
Proler 1
Proler 2
Proler 3
Proler 4
Proler 5
Minimum Memory Usage (MB) Maximum Memory Usage (MB)

30.3
273.8
330.6
78.8
403.1
62.3
39.2
310.8
628.9
103.1
415.3
76.9
Table 12: Minimum and Maximum Memory Usage per Second
26

Dissertation 2014

Uploaded by

Copyright:

Available Formats

You might also like

Dissertation 2014

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dissertation 2014

Uploaded by

Copyright:

Available Formats

Teknik och samhlle

Performance impacts of proling multi-threaded

Examen: Kandidatexamen 180 hp

Handledare: Bengt J. Nilsson

1.2 Previous Research

1.3 Denition of the Problem

1.7 Competing Interests

2.2 Selection of Prolers

Number of and size of objects in memory,

When programmer-dened events begin & end

Number of cores used over time,

Table 1: Description of Selected Prolers

Table 2: Order of Test Groups

Time taken to run test application (seconds)

Figure 1: Time Taken to Complete Tests for Each Proler

Percentage of One Core Used

Average Used CPU: Test App

Average Used CPU: Profiler

Figure 2: Average CPU Usage by Test and Proler

Average Used RAM: Test App

Average Used RAM: Profiler

Figure 3: Average Memory (RAM) Usage by Test and Proler

Percentage of One Core Used

Figure 4: CPU Usage over Time

Figure 5: Memory (RAM) Usage over Time

4.1 Time Taken

Time Taken (minutes) Relative to Baseline

Table 3: Time Taken, Ordered

4.2 CPU Usage

Average CPU Core Usage /s Relative to Baseline

Table 4: CPU Core Usage, Ordered

Average CPU Use Time Taken (s) Additional CPU Time

Table 5: Additional CPU Usage Over Time

4.3 Memory Usage

Average Memory Usage (MB) Relative to Baseline

Table 6: Memory Usage, Ordered

CPU Impact Memory Impact Proler

Table 7: Impact of Prolers

Tsao [4] and Krintz, Mousa, Nagpurkar

5.1 Future Work

EECS Department, University

PADTAD `07 Proceedings of the 2007

[3] B. Ben-Hamida, C.J. Hescott, D.P. Jose, J. A. Konstan, I. H. Kazi, C. Kwok, D. J.

IBM Systems Joruanl,

Vol. 39, No. 1

Journal of Systems and

Software, Vol. 85, No. 8

IEEE Transactions on Software Engineering, Vol. 31, No. 4

[6] C. Krintz, H. Mousa, P. Nagpurkar, T. Sherwood, Ecient remote proling for

ACM Transactions on Architecture and Code Optimiza-

tion, Vol 3, No. 1

[7] E. Larson, R. Palting, MDAT: A Multithreaded Debugging and Testing Tool,

[10] Microsoft, C# Language Specication,

Microsoft Press (2006)

Available from: http://msdn.microsoft.com/en-us/magazine/dd882521.aspx

[12] Microsoft [ Internet

[13] G. E. Moore, Cramming More Components onto Integrated Circuits,

ings of the IEEE, Vol. 86, No. 1

Performance impacts of proling multi-threaded

1.3 Denition of the Problem

2.2 Selection of Prolers

When programmer-dened events begin & end

Table 1: Description of Selected Prolers

Figure 1: Time Taken to Complete Tests for Each Proler

Figure 2: Average CPU Usage by Test and Proler

Figure 3: Average Memory (RAM) Usage by Test and Proler

CPU Impact Memory Impact Proler

Table 7: Impact of Prolers

[6] C. Krintz, H. Mousa, P. Nagpurkar, T. Sherwood, Ecient remote proling for

[10] Microsoft, C# Language Specication,

Test Application (MB) Proler (MB) Total (MB)