Supporting The Memory System Evaluation With A Monitor Simulator

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Supporting the Memory System Evaluation with a Monitor Simulator

Jie Tao
LRR-TUM, Institut für Informatik,
Technische Universität München, 85748 Garching, Germany
E-mail: tao@in.tum.de

Abstract hardware counters [7], software profiling [3], and simula-


tion [5]. Compared with the first two approaches, simula-
Memory system is one of the most important issues which tion has the advantages of being able to support not only the
exhibit a critical impact on the performance of applica- evaluation of existed systems but also the prediction of new
tions running on both uni- and multi-processors. This pa- architectures. This approach is hence used within the work
per presents a simulation approach capable of collecting presented here.
detailed information about the complete memory hierarchy, The simulation approach has long been used for pre-
providing thereby feedback that can support explicit opti- dicting and analyzing performance on target architectures.
mizations as well as automatic tuning of programs with re- Among the developed simulation tools, the most compre-
spect to memory efficiency. The deployed simulator is a flex- hensive system is SimOS [4], which models the entire
ible and independent component modeling an existed hard- uniprocessor and multiprocessor computer systems allow-
ware monitor which observes the inter-node communica- ing a collection of performance data on different hardware
tions. The monitor simulator is also capable of tracing the components. The data collection is done by annotations
transactions within a single processor. In addition, it can which are simple Tcl scripts executed when an event of in-
be optionally combined with any level of the entire mem- terest occurs in the simulator. Another well known simu-
ory system and used in combination with various simulation lation system is SIMICS [9] which accurately models the
tool. This allows a collection of accurate and comprehen- system view of the target machine. A profiling approach is
sive performance data, while keeping the monitoring com- deployed within SIMICS for data acquisition. Besides these
ponent extensible, portable, and well applicable. simulation tools which usually target the complete architec-
ture, several simulators address special issues like network
routing [6] and I/O devices [1].
However, all existed systems acquire performance data
1 Motivation via inserting annotations or instrumentation instructions in
the simulator. As a full instrumentation is not allowed, the
Memory access latency is a general performance issue on collected data can be incomplete. In addition, performance
various compute systems including uniprocessors and mul- data can be only provided at the end of the simulation pro-
tiprocessors especially those with distributed shared mem- cess, rendering any on-line performance analysis as impos-
ory. For the former, the gab between processors and mem- sible.
ory speeds continues to grow, incurring a drastic impact on Hence, we develops a monitor simulator specially for
the overall performance. For the latter, the problem is more collecting performance data about the complete memory
severe since accesses to remote memories usually show a system including caches, memory accesses, and inter-node
significant high latency. Data locality is therefore becoming communications. The monitor concept is based on a hard-
increasingly addressed. In uniprocessors, this is presented ware monitor [5] implemented for observing the intercon-
as cache locality which can be improved by reducing the nection traffic on a NUMA-like cluster. It is designed to
cache misses [8, 14]. In shared memory multiprocessors, snoop a local bus over which inter-node transactions are
additional locality issues exist which are related to the data transfered. It then extracts information from the transac-
distribution capable of being optimized by limiting the ne- tions. The acquired information, i.e. access source, desti-
cessity to access remote memories [11, 13]. nation, and access address, is first stored in its registers and
The basis for such optimizations, however, is the ability further delivered to a ring buffer. From the ring buffer, data
to acquire performance data. Common used approaches are can be accessed by the users and system softwares.

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
This concept is extended in the monitor simulator in or- ulator is the feasibility, flexibility, and the portability. Fea-
der to be able to observe the buses between CPU and the L1 sibility means that the gathered performance data is accu-
cache, L1 and the L2 cache, as well as L2 and the local main rate and complete due to the full trace of memory transac-
memory. In addition, the monitor simulator post-processes tions. Flexibility means that the user can optionally insert
the monitoring information generating coarse-grained per- the monitor simulator in different levels of the memory hi-
formance data such as access histograms. This enables an erarchy in order to focus the study on a specific component
easy evaluation of the target component and a detection of or the entire memory system. Finally, the monitor simu-
access hot spots. lator is an independent software package and can be used
More specially, the monitor simulator is an independent with various simulation tools forming another feature: the
component and can be combined with any simulation tool portability.
capable of modeling the memory system. It can also be se-
lectively inserted in any level of the memory hierarchy. This 2.1 Hardware Model
allows the user to flexibly choose the hardware of interest.
In addition, monitoring data is stored in a specific physical The monitor simulator uses a monitoring concept based
memory space and periodically refreshed allowing an on- on the hardware device [5] developed for an SCI-based 1
line analysis by performance tools. Besides these, the mon- PC cluster. This monitoring facility is developed for trac-
itor simulator can be restarted during the simulation proce- ing NUMA interconnection transactions in order to provide
dure. This enables the acquisition of per-phase information information about inter-node communications.
which can be used to optimize the synchronization opera- This hardware monitor comprises three components: a
tions, like locks and barriers, which generally have a critical link interface, a counter array, and a PCI interface. The
performance impact on shared memory applications. link interface is used to snoop a local bus, which connects a
The monitor simulator has been combined with a sim- single node to the actual interconnection fabric, and extract
ulation tool called SIMT [12]. SIMT is an event-driven information from the transactions transfered over this bus.
multiprocessor simulator modeling the parallel execution The counter array, the primary part of the hardware monitor,
of shared memory applications on SMPs and NUMA ma- consists of registers for storing the information extracted by
chines, with a focus on the memory hierarchy. For the mon- the link interface. The PCI interface, an interface between
itor simulator, SIMT serves as a base simulation platform the PCI bus and the monitor, offers direct access to the host
contributing memory references. First results have shown node enabling the users or system software modules to con-
the feasibility and applicability of this monitoring approach. figure the hardware monitor and read the gathered monitor-
The remainder of this paper is structured as follows. Sec- ing data.
tion 2 describes the monitor simulator in detail, including The hardware monitor can be configured into two work-
the monitoring concept, the simulation implementation, the ing modes: static mode and dynamic mode. The static
processing of the monitoring information, and the interface mode allows to explicitly program the hardware for event
to a simulation tool. Section 3 briefly outlines the simula- triggering and action processing on special memory regions
tion environment SIMT on which the monitor simulator is of interest, while the dynamic working mode is based on
validated. In Section 4, first experimental results with an histogram-driven monitoring, in which all memory transac-
initial evaluation of the memory system are discussed. The tions through the local interconnect bus are monitored in
paper concludes with some concluding remarks in Section order to provide fine-grain monitoring statistics across the
5. complete application’s working set. In order to be able to
record all remote memory accesses to and from a node with
2 The Monitor Simulator only limited hardware resources, a large user-defined ring
buffer is maintained in the main memory. Whenever all
Simulation is regarded as a valuable approach for study- counters of the hardware monitor are filled or one counter
ing the target architectures. Developed simulation systems, is about to overflow, a counter is evicted from the hardware
however, usually embed the data collection in the simula- monitor and stored in the ring buffer. The free counter is
tors themselves preventing a full acquisition of performance then reclaimed by the monitoring hardware for the further
data as well as an on-line analysis. monitoring process.
This work deploys a single data collection mechanism, As the goal of this work is to provide comprehensive in-
which uses a monitor simulator to allow the generation of formation about memory operations, it is not necessary to
complete information about hardwares. As the focus is on simulate the static mode which is specially considered for
the memory system, this monitoring component is designed 1 SCI (Scalable Coherent Interface): an IEEE-standardized interconnec-
to observe the memory traffic performed on caches and the tion technology with high bandwidth, low latency, and global physical ac-
distributed memories. The benefit of using a monitor sim- cess space.

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
event triggering. In addition, the detailed hardware design This granularity control is done in two steps. First, each
of the link and the PCI interface is not relevant to the ac- transaction is filtered using a predefined mask allowing to
curacy of the monitoring data. Both interfaces are hence delete unimportant transactions. In the second step transac-
simplified using flexible models. tions on neighboring memory addresses or cache lines are
aggregated during the monitoring process. This aggrega-
2.2 Simulation Concept tion is controlled by a user–definable parameter specifying
the maximal range of addresses which are allowed to be
combined.
The simulation concept is exactly adopted from the hard-
In summary, the simulation concept is close to the hard-
ware facility with a few extensions. First, the link interface
of the hardware monitor is modeled as a packet generator. It ware design. This can be used to help the hardware imple-
mentation with respect to the trade-off between hardware
receives the access address and transaction type as the input
and for each transaction creates a packet which is actually a costs and runtime overheads for delivering the monitoring
data. In addition, a few extensions have been added to the
data structure for information contained in the transactions.
monitor simulator. The extensions lie mainly on the moni-
Stored information includes access source, transaction type,
destination, and access address which are page number and toring of cache transactions which are out of the capability
of the hardware. For this corresponding mechanisms have
offset for memory references and the number of cache lines
been added to both the packet generator and the counter ar-
for accesses to cache memories.
ray in order to record cache specific information.
The packets, generated by the packet generator, form the
input to the module simulating the counter array. As the
counter array is the main component of the monitoring fa- 2.3 Processing the Monitoring Data
cility, it is also the primary part of the monitor simulator.
Besides inheriting the full ability of the hardware, the sim- Performance data collected by the monitor simulator is
ulated counter array is capable of distinguishing between finally stored in the ring buffer which is a specific physi-
memory accesses and cache transactions. In addition, it col- cal space in the main memory. This enables the monitoring
lects more information, e.g. transaction type, which is not information to be accessed by the users and various perfor-
contained in the hardware registers. mance tools. The monitor simulator itself provides func-
The structure of the monitor simulator is shown in Figure tionality to process the monitoring data and generate access
1. As can be seen, the packet-generator handles the transac- histograms in the following way.
tions delivered by the base simulation system. The result is The individual data items directly gathered during the
information in the form of single items including the trans- monitoring process are first sorted into a buffer organized as
action type, the source node or process ID depending on a histogram chain. Within this buffer, data is indexed using
the location of the monitor, the number of virtual pages or page (cache line) numbers where all items with the same
cache lines, and the page offset which is empty for caches. page/line number are linked to the same chain. This enables
The transaction type can be read, write, lock, and unlock for a fast searching of the buffer since information about all
memory accesses, and load, hit, and replace for caches. transactions performed on a page or cache line can be found
Whenever a transaction is issued, this information is in a single chain.
compared with the corresponding parts of the counter array. Based on the histogram chain, the low-level monitoring
If there is a matching entry, the corresponding counter is in- information can be transformed to a higher-level and more
cremented. For example, a new packet is generated which readable form at various granularity. For this, a number of
corresponds to a read operation issued by node 2 to page functions have been provided which compute numbers of
15 with an offset of 32. While there already exists a corre- accesses on a specific memory region, a complete virtual
sponding entry (the first one) in the counter array, this trans- page or cache line, and a whole processor node. The type of
action is recorded in the same counter. Otherwise a new accesses can be specified allowing the statistical data about
counter will be initialized. In the case that no more space specific transactions.
is available within the counter array, the least recently used
counter is flushed to the ring buffer and the write-pointer of 2.4 Interface to Simulation Tools
the buffer is incremented.
As described above, each transaction is recorded using a The monitor simulator is intended for the combination
single counter entry. Actually, the user does not need this use with a simulation tool which is capable of generating
fine granularity which can lead to large histograms. The memory transactions. For this the address and transaction
monitor simulator therefore includes the ability to influence type are needed from the base simulation tool. On the side
the granularity of the monitoring data and thereby to adjust of the monitor simulator, an C-API defines functions for
the monitor behavior to the intended use of the data. operating the monitoring. For example:

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
access address and transaction type

packet−generator counter array

trans_type source/PID page_num/line_num offset counter


read 2 15 32 counter 1

read_ptr write 1 24 4 counter 2

lock 0 10 8 counter 3
unlock 4 15 4 counter 4

load 1 30 counter 5
read hit 2 3 counter 6

write hit 1 5 counter 7


replace 1 10 counter 8
write_ptr

ring buffer
specific physical space

Figure 1. Working principle of the monitor simulator.

 Monitor config(void *monin, int gran, int mask, int memory. SIMT runs applications from the SPLASH-2
counters): allocates space for the counters and initial- Benchmark suite [15] and any other application written in
izes the parameters for granularity control. C/C++ using m4 macros [2] in a fashion similar to SPLASH
and SPLASH-2 applications. It supports a thread-based
 Monitor on(void *monin): switches the specified programming model with a shared address space and a pri-
monitor on or off. In the case of an off operation, all vate stack space for each thread.
monitoring data in the counter array is delivered to the SIMT comprises a front-end (a memory reference gen-
ring buffer. erator) and a backend modeling the target system. This is
 Packet generate(int source, unsigned long addr, int shown in Figure 2. The front-end is based on Augmint [10],
transtype, Paket *paket): generates an information a multiprocessor simulation toolkit for Intel x86 architec-
packet for the corresponding transaction. tures. It simulates the parallel execution of multiple pro-
cesses running on multiple processors and generates events
 Monitor(Paket paket, void *monin): records the packet of interest. The main contributions of the front-end are
in the corresponding monitor. memory references which can be captured by the backend.
The latter is a user-defined target system simulator invoked
Besides these functions for operating the monitor, the C- every time a significant event (memory access and synchro-
API contains other routines for processing the performance nization operation) occurs.
data obtained during the monitoring process. These rou-
tines fulfill the complete data handling described in the pre-
vious subsection, from raw data reading up to the creation
Memory events Target
of memory access histograms. The latter can be used to
analyze the access pattern and improve the data locality of Reference System
applications.
Generator Simulator
3 SIMT as an Evaluation Platform (frontend) process (backend)
control
The monitor simulator has been combined with a simu-
lation system, called SIMT (SIMulation Tool)[12]. SIMT Figure 2. The two simulation components of
simulates NUMA-characterized PC clusters with a focus on SIMT.
the behavior of remote memory accesses and the goal of aid-
ing the data locality optimization on the distributed shared

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
CPU L1 cache L2 cache local memory
inter−

connection

network

monitor monitor monitor monitor


simulator simulator simulator simulator

L1 access L2 access histogram of histogram of


histogram histogram local accesses remote accesses

Performance data: memory access histograms

Figure 3. Using multiple monitor simulators to observe the memory system.

The backend consists of functionality representing the bined with SIMT and added on each level of the memory
handling of these events on real hardware. As SIMT is in- hierarchy (see Figure 3) in the same way.
tended primarily for research in the area of memory system As shown in Figure 3, each monitor simulator gener-
design, the main components of the backend are a cache ates independent memory access histogram occurring at
simulator modeling a two level cache hierarchy with sev- the corresponding locations in the system. These access
eral cache consistency protocols, a shared memory simu- histograms can be individually used for analyzing specific
lator modeling the distributed shared memory with a spec- component and also be combined into a final result, which
trum of data distribution policies, and a network mechanism shows the combined memory access information serving as
modeling the data transfer across processor nodes. the base of evaluating the complete memory system.
Memory references delivered by the front-end are first
filtered by the cache simulator. Those references, which are
not satisfied within L1 or L2, are handled by the memory 4 First Experimental Result
simulator as accesses to the main memory. Depending on
the data distribution policies, these accesses can be either
local accesses or remote ones. For a local access, a bus- Using the monitor simulator in combination with SIMT,
request is issued and scheduled to be completed in a con- a set of benchmark applications taken from the SPLASH-
stant time corresponding to the simulated CPU. For a non- II suite [15] have been simulated. These codes are intended
local memory reference, a network request with a times- for shared memory machines and capable of being executed
tamp is generated and inserted into a priority queue. This on top of SIMT. A few of them are chosen for showing.
request is handled when it has the lowest timestamp in the These include FFT, a fast Fourier transformation, LU, an
queue and arranged to be completed in the time specified LU-decomposition for dense matrices, RADIX, an integer
for remote memory accesses, according to the property of radix sort, and WATER for evaluation of water molecule
the simulated interconnection fabric. Numbers of local and systems.
remote accesses across the whole shared virtual space are For the following experiments, a cluster system is used
counted and the memory access histogram of an application on which shared data is distributed over the system in a
is formed. round-robin fashion. Each processor node has an 8 KB
As SIMT aims at supporting the study of remote mem- L1 cache and a 64 KB L2 cache, each with 32 byte cache
ory accesses, it offers only hit and miss ratios with respect lines and a two–way associative organization. L1 uses the
to caches. In order to be capable of supplying complete in- write-through scheme, while L2 uses the write-back pol-
formation about the various levels contained in the memory icy. Caches in different processors are maintained coherent
hierarchy including CPU–L1, L1–L2, L2–memory, and lo- with a hardware-like coherence protocol which invalidates
cal memory–remote memory, the monitor simulator is com- all cache copies by a write operation to shared data.

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
80
RM
Number of accesses 70 LM
60 L2
L1
50

40

30

20

10

0
0

6
44

92

40

88

36

84

32

80

28

76

24

72

20

68

16

64
44

89
13

17

22

26

31

35

40

44

49

53

58

62

67

71

76

80
Address of accessed location

Figure 4. Parts of a complete memory access histogram (FFT on a 8 node system).

4.1 Access Histogram on the Complete Memory Cache line No. of accesses Hit Load hit ratio
System 0 121 117 4 0.96
1 96 91 5 0.94
The first experiment is done for acquiring complete ac- 2 90 54 36 0.6
cess histograms on the whole memory system. For this, 3 122 119 3 0.97
each module in the memory hierarchy is connected with a 4 82 38 44 0.46
monitor simulator. The granularity has been set to a single 5 87 48 39 0.55
word and numbers of accesses on different memory words 6 48 41 7 0.85
are computed. 7 74 22 52 0.29
Figure 4 visualizes parts of the complete access his- ... ... ... ... ...
togram obtained by simulating the FFT code on a 8 node
system. The x-axis shows the accessed locations, while the Table 1. L1 access histogram for WATER.
corresponding numbers of accesses hit in L1, L2, the local
memory (LM), and remote memories (RM) respectively are
presented on the y-axis. While a great L1 hit ratio has been
shown, it can also be observed that more accesses target- shows the percentage of cache hits to the total accesses on a
ing main memories are remote when compared to the local single cache line. It can be seen that some cache lines have a
accesses and accesses to the L2 cache. low number of hits and a high number of loads, hence a low
hit ratio. This indicates that such cache lines are highly re-
4.2 Individual access histogram quired, but frequently replaced and reloaded without a con-
tinuous residence in the cache. This scenario can cause sig-
The complete access histogram described above aims at nificant performance loss and has to be avoided.
giving a first overview of the memory behavior. For opti- While the cache histogram supports the optimization of
mizing specific components, various individual access his- data distribution in caches, the histogram of accesses to the
tograms are more usable which can be acquired by a single distributed memory aids in the improvement of memory lo-
monitor. cality on shared memory systems. Such a histogram can
Table 1 shows an access histogram of the L1 cache ac- be achieved by monitoring the interconnection network to-
quired by simulating the WATER code on a uniprocessor. gether with the local memories.
This table presents the first several cache lines with their Figure 5 shows such a histogram acquired by simulating
numbers shown on the first column, the total accesses on the RADIX code on a 4 node system. The figure presents all
the second column, and the number of hit and loading on virtual pages located on node 0. The number of accesses to
the third and fourth column individually. The last column these pages are illustrated using columns, each representing

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
8000
Number of accesses 7000 local node1
6000 node2 node3

5000

4000

3000

2000

1000

0
0 4 8 12 16 20 24 28 32 36 40 44 48 52

Number of pages

Figure 5. Histogram of accesses to the distributed memory (RADIX).

a single access source. It can be seen that most pages are have been presented in this paper. Besides these full ac-
required by multiple nodes. However, they are not equally cess histograms, the monitor simulator allows a generation
accessed, rather a node performs dominant accesses. This of per-phase monitoring results which is useful for under-
information is therefore capable of directing an optimized standing the temporal behavior of applications with distinct
placement of data on a distributed system. phases. Overall, these results have proven the applicability
and efficiency of the explored monitoring approach.
5 Conclusions
References
Performance of the memory system generally shows a
significant impact on the overall performance of a sys- [1] R. Bagrodia, E. Deeljman, S. Docy, and T. Phan. Perfor-
tem. In order to support the evaluation and optimization of mance Prediction of Large Parallel Applications Using Par-
this kernel, various approaches have been proposed among allel Simulations. ACM SIGPLAN Notices, 34(8):51–162,
which simulation is particularly usable due to its ability to August 1999.
model new architectures. Such a simulation approach is [2] J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Over-
beek, J. Patterson, and R. Stevens. Portable Programs for
presented in this paper.
Parallel Processors. Holt, Rinehart, and Winston Incorpora-
Common simulation systems, however, usually integrate
tion, New York, NY, 1987.
the data allocation with the simulation itself and can thereby [3] D. Cortesi. Origin2000 and Onyx2 Performance Tuning and
provide only performance data at the end of the simulation. Optimization Guide, chapter 4. Silicon Graphics Incorpora-
In order to allow a flexible and on-line acquisition of perfor- tion, 1998.
mance data, this work deploys a single monitoring compo- [4] S. A. Herrod. Using Complete Machine Simulation to Un-
nent which can deliver the monitoring data at run-time dur- derstand Computer System Behavior. PhD thesis, Stanford
ing the execution and be inserted in any level of the memory University, February 1998.
hierarchy. As this monitor simulator is independent of the [5] R. Hockauf, W. Karl, M. Leberecht, M. Oberhuber, and
target system, it can be selectively added and can be used M. Wagner. Exploiting Spatial and Temporal Locality of Ac-
with any simulation tool capable of modeling the memory cesses: A New Hardware-based Monitoring Approach for
DSM Systems. In Proceedings of Euro-Par’98 Parallel Pro-
system. This allows the investigation of various memory
cessing / 4th International Euro-Par Conference Southamp-
architectures as well as the optimization of individual mod-
ton, volume 1470 of Lecture Notes in Computer Science,
ules and the complete memory system. pages 206–215, UK, September 1998.
The first experimental results show that the monitor sim- [6] H. C. Hsiao and C. T. King. MICA: A Memory and Inter-
ulator is able to supply a variety of access histograms at connect Simulation Environment for Cache-based Architec-
different granularity. A few sample results which can be tures. In Proceedings of the 33rd IEEE Annual Simulation
used to optimize the cache locality and memory locality Symposium (SS 2000), pages 317–325, April 2000.

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE
[7] Intel Corporation. Intel Architecture Software Developer’s
Manual for the PentiumII, volume 1–3. published on Intel’s
developer website, 1998.
[8] C. Luk and T. C. Mowry. Memory Forwarding: En-
abling Aggressive Layout Optimizations by Guaranteeing
the Safety of Data Relocation. In Proceedings of the
26th International Symposium on Computer Architecture
(ISCA’99), pages 88–99, May 1999.
[9] P. S. Magnusson and B. Werner. Efficient Memory Simula-
tion in SimICS. In Proceedings of the 8th Annual Simulation
Symposium, Phoenix, Arizona, USA, April 1995.
[10] A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The
Augmint Multiprocessor Simulation Toolkit for Intel x86
Architectures. In Proceedings of 1996 International Confer-
ence on Computer Design, pages 486–491. IEEE Computer
Society, October 1996.
[11] S. Tandri and T. S. Abdelrahman. Automatic partitioning
of data and computations on scalable shared memory multi-
processors. In Proceedings of the 1997 International Con-
ference on Parallel Processing (ICPP ’97), pages 64–73,
Washington - Brussels - Tokyo, Aug. 1997.
[12] J. Tao, W. Karl, and M. Schulz. Using Simulation to Un-
derstand the Data Layout of Programs. In Proceedings of
the IASTED International Conference on Applied Simula-
tion and Modeling (ASM 2001), pages 349–354, Marbella,
Spain, September 2001.
[13] J. Tao, W. Karl, and M. Schulz. Visualizing the Memory Ac-
cess Behavior of Shared Memory Applications on NUMA
Architectures. In Proceedings of the 2001 International
Conference on Computational Science (ICCS), volume 2074
of LNCS, pages 861–870, San Francisco, CA, USA, May
2001.
[14] D. Truong. Considerations on Dynamically Allocated Data
Structure Layout Optimization. In Proceedings of the first
Workshop on Profile and Feedback-Directed Compilation,
in conjunction with PACT’98, Paris, France, October 1998.
[15] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta.
The SPLASH-2 Programs: Characterization and Method-
ological Considerations. In Proceedings of the 22nd Annual
International Symposium on Computer Architecture, pages
24–36, June 1995.

Proceedings of the Eleventh Euromicro Conference on Parallel,Distributed and Network-Based Processing (Euro-PDP’03)
0-7695-1875-3/03 $17.00 © 2003 IEEE

You might also like