Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Organization of 3D-stacked memory

Arkaprava Basu Rathijit Sen


Department of Computer Sciences, University of Wisconsin-Madison
E-mail: {basu,rathijit}@cs.wisc.edu

Abstract

The three dimensional stacking of memory on top of the


processor is emerging as one of the promising technology
to mitigate the Memory wall problem of computer systems’
performance. Here we present an efficient way to organize
and utilize 3-D die stacked memory technology, that tries
to address a few drawbacks of the technology, as well. We
describe a novel way of partitioning the physical address Figure 1. 3D Die stacked memory
space of the system between faster 3D die stacked DRAM
and the traditional 2D main memory. Based on the page
usage pattern, we dynamically alter the virtual-to-physical
address translation mechanism to contain most of the appli-
cation’s memory accesses within the faster 3D DRAM. We
found out that our scheme can almost double the speedup ural way to address the Memory Wall problem. But as with
compared to dumb organization of 3D DRAM, for moder- any emerging technology, 3D memory stacking technology
ately sized 3D DRAM for some of the benchmark and on comes with its own set of possible bottlenecks. The fore-
average it increases the speedup by around 30% for a set of most among them is the thermal problem. Due to higher
four commercial workload studied here. More interestingly power density in vertical stacking and relative increase in
we have demonstrated in this work that full 3D DRAM only thermal resistance because of higher distance for few layers
memory hierarchy may not be the best possible way to effe- to the heat sink, the peak temperature in the 3D die stacking
ciently utilise the new technology. technology tends to increase. Moreover it is also reported
to have lower yield due to rather fledgling technology of
3D integration. Both of these problems aggravates when
more and more layers of vertical memory is added. In this
1 Introduction work we explore an holistic way of effective organization of
the 3D DRAM in the memory hierarchy, that tries to better
Over the last decade, DRAM latency and bandwidth leverage the benefits of this new technology, but at the same
emerged as one of the biggest performance bottleneck in time it tries to limit effect of its drawbacks as well. We look
the evolution of high-end computers. The advent of multi- into how 3D RAM can be kept as part of the processor’s
core processors has further fueled the need for higher mem- physical address space, yet how can we dynamically mod-
ory bandwidth. 3D die stacked DRAM has emerged as ify the virtual-to-physical address mapping to contain most
a promising new technology that can address this grand of the application’s memory request in the faster 3D RAM,
challenge. In contrast to conventional memory, 3D mem- without requiring to fetch data from slower main memory.
ory is stacked on top of the processor through face-to-face In this way we can have hit latency of a direct mapped struc-
bonding, reducing wire delay between the two and alleviat- ture instead of larger associative structure if we had used the
ing the requirement of going off-chip on a last level cache 3D DRAM as cache. On the other hand, having relatively
miss[9]. The figure 1, logically describes the vertical stack- smaller size of 3D DRAM compared to the case of having
ing of memory on top of the processor. The 3D die stacking whole 3D DRAM based main memory architecture helps
technology allows multiple layers of active silicon bonded to mitigate lots of drawbacks faced by today’s 3D integra-
with dense vertical interconnect that allows higher memory tion technology and thus likely to help faster but incremen-
bandwidth [7][5]. Thus 3D DRAM is emerging as the nat- tal embarcement of this new technology.
traditional 2D main memory lies in the same level of the
memory hierarchy and there is no data duplication between
them. We assume that the OS maintains a set of free physi-
cal pages at any particular point of time. User applications
are oblivious of the presence of 3D memory, although the
Operating System will be aware of it. The physical pages
mapped to the 3D memory (say page numbers 0..X-1) can
be accessed faster and would be able to support higher band-
width compared to pages mapped to main memory (page
numbers ≥ X). Hence, the goal of the proposed system
would be to keep as many pages as possible from the current
working set of the process, in the 3D memory. Conceptually
this is done by finding out Hot pages that are currently lying
Figure 2. Basic organization of the memory in the slower 2D main memory and then bringing it into the
hierarchy faster on-chip 3D DRAM by sacrificing one of the pages
currently present in the 3D DRAM. In this way the scheme
tries to keep most of the working set of the application in
2 Related Work the 3D DRAM. We implemented a simple demand paging
based virtual memory and as we keep the 3D DRAM in the
Multi-level memory hierarchy has been studied by Ek- lower portion of the physical address space, we implicitly
man et.al [4]. In this paper the authors showed that a large fill up the 3D DRAM before overflowing to the slower tra-
proportion of the application’s memory footprint can sus- ditional main memory.
tain large latency accesses, thus making it possible to keep The proposed mechanism will respond in two events:
most of application’s working set in a relatively smaller and page fault and when a page from the slower memory (main
faster main-memory level. This work thus argues for having memory) is accessed. In the following we describe the
multiple levels of main memory, just like caches to have bet- events, the possible circumstances and the requisite action
ter performance. There is also a rich set of work on NUMA from the system.
architecture for dynamic page migration [3][11], which re-
lates to our scheme of page movement between the 3D and 1. PAGE FAULT : There may be two different conditions,
the traditional main-memory. COMA architecture goes a under which the system will behave accordingly.
step further to organize whole of the main memory avail- (a) There is free page in 3D RAM (i.e. physical page
able in cache like fashion, thus allowing transparent data number < X): allocate the page in the 3D mem-
movement among memory of different nodes of the system. ory and return.
Out of different flavors of COMA, S-COMA relates most (b) There is no free page with page number < X ,
closely to our proposed design, as in S-COMA the page but there is free page in the slower main memory
allocation and replacement is handled in the Operating Sys- : Allocate an available page in the main memory
tem. and return.
2. PAGE ACCESS: Access to a page with page number
3 Our Approach ≥ X, say Y:
(a) Service the data request.
In this work we assume a single core processor with stan- (b) Update access information for Y.
dard 2-level cache hierarchy supported by a 3D die stacked (c) Decide whether the page Y is a candidate for
DRAM as well as traditional main memory. The basic ar- placement in the 3D memory.
chitecture is described in the figure 2.The physical address If Yes,
space is partitioned between the faster on-chip 3D DRAM i. Find a victim page number in the 3D Mem-
and the traditional off-chip 2D main memory. So a last level ory. Let it be Z.
cache miss can be serviced directly by either the 3D DRAM
ii. Trigger TLB shoot down and swap the
or the traditional main memory, depending on the physical
virtual-to-physical mappings of Z and Y.
page number of the requested data. Exclusion between the
iii. Invalidate cache blocks present in the corre-
data contained in the main memory and the 3D DRAM is
sponding to the swapped pages.
mainted automatically. A 1:2 decoder decides where a par-
ticular memory request will go depending a single bit in iv. Swap data between Z and Y.
the page number. Notice that both the 3D DRAM and the Else, Do Nothing.
We note that whether we require a software handler to tional off-chip main memory), we might decide to
enforce the victim selection policy on a swap or a moder- fetch this page to the 3D memory or might decide to
ate hardware is good enough to do the job is dependent on do so after we see few more subsequent accesses to
the policy of victim selection, which is dealt in the section the same page. We argue that blindly fetching any
4. Moreover we assume that DMA will supervise the data page from the main memory to the 3D memory on
swapping between the two pages in 3D DRAM and the off- its first touch may be counter-productive, given the
chip main memory. We also note that we would require overhead associated in moving the page and chang-
to invalidate any cache blocks residing in the cache hierar- ing the page-address mapping both in terms of latency
chy corresponding to the participating cache blocks in the and power. This, in turn, requires to have per-page ac-
pages participating in the swap operation. The other way cess counter that keeps track of frequency of accesses
to achieve the same effect was to use software handler to to a page in the traditional main memory. The size of
write these two pages explicitly to achieve the swapping of this per-page saturating counter and the threshold (UP-
the data, in which case explicit invalidation to cache blocks GRADE THRESHOLD) that triggers migration of a
are not required. But we decided not to choose this option page from slower memory to faster 3D DRAM are pa-
as it may unnecessary pollute the cache with useless data rameters that are to be tuned through empirical results.
that is not used by the program in near future. Moreover the
• Once we decide to migrate a page from slower memory
DMA transfer is more efficient as it relieves the processor,
to the faster memory, we also need to select a victim
allowing it to carry on with its other tasks.
page in the faster memory that will make space for the
newly arriving page in the 3D memory. This victim
4 Design and implementation Issues selection can be done by employing various different
polices. Random and LRU are the two policies that
As mentioned earlier, for realization of the project goals we explore in this work. Notice that the per page ac-
we need to seamlessly modify the virtual-to-physical map- cess counters described above, might be reused here
ping to allow pages to move between 3D memory and the as well. We also note that Random replacement pol-
traditional main memory. This requires modifying corre- icy can be easily implemented in the software, but im-
sponding code section in the Operating System that handles plementing pure LRU on the whole 3D DRAM might
virtual memory. But we hypothesize that this can also be require complicated hardware. We can approximate
modeled in the simulator with enough fidelity but without LRU via non-MRU, pseudo-LRU schemes or software
requiring to modify the operating system kernel, by defin- handled second-chance algorithm. But elaborate dis-
ing another level of mapping; physical-to-real address map- cussion on this is beyond the scope of this work. Our
ping. The memory hierarchy, including caches would only work primarily implements the Random algorithm, but
see real addresses rather than the physical addresses gener- compares it with pure LRU, to explain the tradeoff in
ated by the operating system. This allows us with easier and the section 6.2.
faster implementation of our ideas inside the simulator. We
also identified several possible policy/design choices that • We expect that the cost of migrating a page from
might impact the ultimate performance gain of the overall slower main memory to the faster 3D DRAM will be
system. In the following we discuss them in the context of non-trivial. To amortize this cost of migration, we hy-
our proposal. pothesize that, rather than migrate a page immediately
upon it being considered as candidate for migration to
• We implemented simple demand paging based faster memory, we may wait for few more pages being
physical-to-real address mapping. Here we keep a selected as candidate for migration and then pipeline
counter which is initialized to zero. An access for a the process of migrating the pages. One of the pri-
physical page for the first time will have its real page mary cost that can be amortized in this way is the cost
number equal to current value of this counter and we of TLB shootdown and the DMA transfer initiation
increment this counter thereafter. This simple scheme cost. We term this as LAZY policy of migration in
thus assigns sequential real address page number on contrast to more simple IMMEDIATE policy for mi-
the basis when a physical page is first accessed. One grating pages to faster memory. We plan to explore
implication of this scheme on our system is, it always the effect of both LAZY and IMMEDIATE policy on
fills out the memory allocated to 3D RAM first, before the overall system performance as part of future work.
overflowing to the traditional two dimensional main We note that in the section 3 the algorithm actually de-
memory, as 3D RAM is kept at the lower portion of scribes the IMMEDIATE policy rather than the LAZY
the real address space (0 to X-1). policy.
• On an access to a page in slower memory (tradi- • One more subtle issue regarding the victim selec-
tion and the decision policy for migrating a page is implemented as 3D on-chip memory. In both the base-
about when to to re-initialize the per-page access coun- lines and in the proposed system we used our physical-to-
ters. Not clearing the counters regularly enough would real address mapping to ensure comparability among the
make all of them saturate, thus triggering swap opera- data. We evaluated our system only for single core system
tion too freqeuntly. In our implementation whenever a but there is no fundamental issues in our proposed system
page is swapped to the slower main memory from the that may cause it to behave differently under CMP envi-
on-chip 3D memory we re-initialize its access counter. ronment. Some of the important relevant parameter of the
simulated system is mentioned in table 1. Unless otherwise
5 Current Implementation mentioned, our default configuration for all experimental
data presented henceforth sets the victim selection policy to
RANDOM and UPGRADE THRESHOLD to 3.
In the following we describe the current implementation
of the proposed system in the GEMS full system simulation For calculation of different latency numbers we use
framework [8]. CACTI 5.3 tool[1]. We assumed that for the all the 3D
DRAM sizes in our experiments to have only one layer as
• We implemented physical-to-real address mapping and their sizes are relatively small ( maximum up to 128 MB ),
all the memory accesses uses their physical address to except one case ( where we model full 3D DRAM of size
lookup the corresponding real address before access- 4GB). In the all 3D DRAM configuration, having 4GB of
ing the cache hierarchy. 3D DRAM stacked on top of the die, we assume that we
have 4 layers with each layer having capacity of 1 GB. In
• Our current implementation framework supports both all these case we conservatively assume that latency of ac-
RANDOM and LRU victim selection policy to decide cess of a 3D DRAM is 32 rd of the normal DRAM access
the page that would be swapped from the 3D memory latency as in [7] . In the case where we simulated 4GB of
to slower main memory to make space for a Hot page 3D DRAM in 4 layers, we optimistically made its access
from the main memory. The framework can support latency equal to 32 rd of the normal DRAM access latency
both the IMMEDIATE and LAZY swap policies ( see for 1 GB, ignoring the latency due to vertical vias, to make
section 4 ) , but simulation results are reported only for our proposed system in relatively disadvantageous position.
IMMEDIATE policy due to time constraint. This makes our study more robust. For each swapping oper-
• Current implementation also supports static modifi- ation we stall the whole processor for 150 cycles to account
cation of the UPGRADE THRESHOLD for different for the overhead of the swapping operation.
simulation runs ( see section 4 ) .
6.2 Simulation results
• One caveat in the current implementation is that it
does not fully implement the cache invalidation on the
In this section we present various results obtained from
swapping of pages. This leads to small amount of infi-
studying the proposed system. The graphs presented in the
delity in the results, but is upperbounded by a constant
figure 3, figure 4 and in the figure 5 are for simulation runs
multiple of the number of swaps.
with 10000 transactions each for four commercial work-
loads - Apache, Jbb, Oltp and Zeus. The results presented
6 Evaluation in these three figures uses RANDOM victim selection
policy and with UPGRADE THRESHOLD value set to
6.1 Simulation Methodology 3. In the figure 3 we present four graphs, each for one of
the commercial benchmarks, that specifies what percentage
As mentioned above, we implemented our proposed sys- of the total memory accesses are contained within the 3D
tem in the GEMS [8] full system simulation framework. We DRAM for varying size of it. The X axis of each of these
have evaluated the system with four different commercial graphs captures the different sizes of 3D DRAM, while
workloads, namely - Apache, Jbb, Oltp and Zeus, by run- the Y axis gives the corresponding percentage of memory
ning 10000 transactions for each of these benchmarks for accesses contained within the on chip 3D DRAM. The two
the primary graphs showing performance improvements. lines in each of these graphs represent the corresponding
We ran the simulations for 1000 transaction each for runs percentage of accesses contained in 3D with and without
that demonstrate different tradeoffs of choosing different our swapping mechanism. “SW” and “NS” stand for the
values of tunable parameters or of choosing different pos- data with our page swapping scheme enabled and without
sible policies. For purpose better comparison, we have two our swapping policy, respectively. Thus this graph tries to
baseline systems. One is a system with full off-chip 2D capture the efficacy of our swapping scheme over dumb
main memory and the other is with full physical memory organization of the 3D DRAM. In this graph, the higher is
Table 1. Basic parameters of simulations
Processor In-order, single issue, 2.0 Ghz Frequency
L1 cache 32KB 4-way I/D
L2 cache 1MB 8-way unified
Main memory 4GB
Page size 4KB

Table 2. Memory Access Latencies in ns


XXX
XX Latency 2D 3D
XXX
Size XX
8MB 14 8
16MB 17 10
32MB 19 12
64MB 22 14
128MB 27 18
4GB 102 40

Figure 3. Percentage of all memory accesses contained within the 3D DRAM

the difference between these two lines is better, with “SW” of accesses are contained in the 3D DRAM, while with
being the higher value. From the graph we see substantial the same configuration and with our swapping policy in
increase in 3D DRAM hit rate with our swapping policy. place the accesses to 3D DRAM goes upto 88% of all
For example, for Apache, with 16MB of added on-chip memory accesses. We may recall that the primary objective
3D DRAM, dumb organization (“NS”) ensures only 32% of this work is to capture as much of the working set of
Figure 4. Speedup over pure 2D main memory model

the application as possible in the faster 3D RAM. We also swapping scheme it improves the speedup to as much as
observe a uniform trends across all the applications that 2.44.
for smaller sizes of the 3D DRAM there are more benefits In the figure 5, we present the figure where we normalize
of employing our swapping policy, while at around 128 the speed up over the configuration where we have full 3D
MB there is hardly benefit of swapping except Oltp. This DRAM without any off-chip main memory. In this graph
can be expected as our simple demand paging scheme most of the data are counter intuitive. In general graphs
automatically place any new page in the 3D DRAM before seems to suggest that with much smaller 3D DRAM but
overflowing it to slower off-chip main memory. So it seems with our intelligent swapping scheme, we are likely to out-
that for most of the applications except Oltp, the data set perform even the scheme where the whole memory is orga-
hardly croses the 128 MB mark. It can also be noted that nized as on-chip 3D memory. Closer introspection suggests
smaller sizes of the 3D DRAM provide bigger opportunity that it seems to be the classic case of capacity versus access
for swapping policy to show its efficacy over the dumb time tradeoff. The applications does not gain much when
organization as it can much better utilize the scarce on chip the 3D DRAM size is extended beyond certain size as most
memory available compared to the dumb organization. of their accesses are contained within the given 3D DRAM
size anyways. But due to larger size of the 3D DRAM ,
In figure 4, we demonstrate how much this savings in each access to it now becomes costlier. So we can observe
terms of off-chip memory accesses translates to the actual a break-even point after which it does not make sense to
speedup over the baseline where we have no 3D main mem- increase the available size of the 3D DRAM.
ory. The data is thus normalized by number of cycles of ex- In the figure 6, we show the geometric mean of the
ecution cycles required by the full off-chip 2D main mem- speedups we obtained over the baseline full 2D main mem-
ory implemenation. This also shows impressive gains. For ory and full 3D main memory. For moderate sizes of 3D
example, for application Zeus with 32MB of on-chip 3D DRAM, we achieve significant more speedup over the full
DRAM, if we have used dumb 3D DRAM organization 2D memory configuartion with our swapping mechanism
without any swapping then, the speed up obtained is 1.56 than dumb organization of 3D DRAM. More interestingly
, while with same 3D DRAM configuration but with our we even show non-trivial performance improvement over
Figure 5. Speedup over full 3D DRAM memory model

Figure 6. Geometric means of speedups

Table 3. Ratio of Swaps to Memory Accesses


```
``` 3D Size
` 8MB 16MB 32MB 64MB 128MB
Benchmarks `````
apache 0.064 0.045 0.024 0.006 0.001
jbb 0.088 0.062 0.030 0.002 0.000
oltp 0.057 0.042 0.028 0.018 0.012
zeus 0.058 0.036 0.016 0.003 0.001
Figure 7. Total number of memory accesses

Figure 8. Effect of varying the UPGRADE THRESHOLD

full 3D main memory implementation with much smaller below 6.5%.


3D memory size, but aided with our swapping mechanism.
In figure 7 we present the absolute number of memory
Table 3 shows the ratio of the number of swaps to the accesses for different benchmarks and sizes of 3D memory.
total number of memory accesses. As the size of the 3D It is observed that in most of the cases, at least for smaller
memory increases, a larger portion of the working set of the sizes of 3D memory, the number of memory accesses in-
process is contained within the 3D memory. So, the number creases with swapping enabled. We note that the cache
of swaps decreases with increase in the size of 3D memory. behavior can change due to virtual-to-physical remapping
The data shows the maximum number of swaps can only done during swapping. For example, let us assume that
go upto 8.8% of total memory accesses, with most of them we swap a page numbered X not residing in 3D memory
Figure 9. Comparison between RANDOM and LRU victim selection policy

with 3D page Y selected as victim. Now, if more of the In figure 9 and figure 10, we demonstrate the effect of
cache blocks corresponding to page Y were present in the choice of victim selection policy in terms of both percent-
cache than for page X before the swapping, then after swap- age of accesses contained in 3D memory and execution time
ping less blocks of Y appear to be present in the cache. If (measured in terms of millions of ruby cycles). We can see
we continue to access blocks of Y then the number of L2 that there is improvement in terms of both percentage of
misses will increase. The exact opposite can happen as well accesses contained in the 3D memory and execution time
if the roles of X and Y are changed. We observe that as if LRU is used instead of the RANDOM victim selection
we increase the size of the 3D memory, there is less num- policy. But this improvement is not significant enough with
ber of swaps and so the two lines in the graphs converge. improvement mostly confined to 2-3%. We attribute this to
Note that in the graphs of Figure 7 the number of memory the fact that most of the temporal locality is already filtered
accesses with swapping disabled do not remain constant, al- out by the cache hierarchy and thus not enough room for im-
though the change is small. This is attributable to the fact provement left for LRU to exploit. We have to bear in mind
that with changes in the latency of the memory hierarchy, that RANDOM is significantly simpler to implement and
the number of instructions executed slightly varies due the LRU can actually be practically infeasible to implement.
presence of synchronization primitives in the commercial Also in this experiment, we do not penalize LRU scheme on
workloads. This leads to a small variation in the number of latency of victim selection as we tried to demonstrate only
L2 cache misses. the scope of improvement possible with implementation of
more complex victim selection policy.
In figure 8, we demonstrate the effect of varying the UP-
GRADE THRESHOLD in terms of both percentage of ac-
cesses contained in 3D DRAM and execution time ( mea- 7 Future Work
sured in terms of millions of ruby cycles ). We can observe
that there is mostly steady but small decrease in the per- In this section we present some of the proposed future
centage of memory acceses contained within the 3D DRAM work required to take this work to its logical conclusion and
with increasing size of the 3D memory. works that are under progress but could not have been com-
Figure 10. Comparison between RANDOM and LRU victim selection policy

pleted due to time constraints. better model the bandwidth advantages of 3D DRAM
as well.
• The current implementation of the proposed system
has little bit of infidelity in the number of memory ac- • For making the case more cogent, we plan to do in
cesses it reports. This is due to the fact that on swap- depth thermal and power study on the proposed sys-
ping a page from traditional main memory to the faster tem.
3D memory, we need to invalidate all the cache blocks
resident in the cache hierarchy of the system. Oth- • We also plan to extend the work for CMP environment.
erwise the processor may consume wrong values as
the addresses of the data got altered. The full support
8 Conclusions
is currently missing in the implementation, but is un-
der progress. Note that, this does not prevent us from
simulating the benchmarks as GEMS merely does the We clearly observe the efficacy of our scheme in terms
performance modeling while the functionality is taken of performance gain on top of both dumb 3D DRAM orga-
care by the Simics. We also note that the impact of nization and full on-chip 3D DRAM implementation of the
this is not likely to be huge as worst case number of memory. But apart from the performance gain our proposed
extra cache misses due to this is limited by a constant system has following advantages as well.
multiplier on the number of swaps done.
• As we use smaller sizes of 3D DRAM and thus less
• We plan to collect data for effect of “LAZY” (see sec- number of vertical layers of active silicon, the thermal
tion 4) in amortizing the cost of swapping on the sys- issue of 3D DRAM is largely mitigated, as peak tem-
tem. This portion is already mostly completed in terms perature increases with number of layers of 3D layers
of implementation. of active silicon.

• We have mostly explored the latency advantages of the • Similarly, one of the other problem of 3D integration
3D DRAM in this work. We plan to extend this work to technology is lower yield, which is also worsens with
increasing number of layers in 3D. Thus our proposal [4] M. Ekman and P. Stenstrom. A case for multi-level main
also helps in this regard as well. memory. In WMPI ’04: Proceedings of the 3rd workshop
on Memory performance issues, pages 1–8, New York, NY,
• With entire memory implemented as on-chip 3D USA, 2004. ACM.
DRAM, it is never possible to upgrade the memory [5] P. Jacob, O. Erdogan, A. Zia, P. M. Belemjian, R. P. Kraft,
system. But in our proposed system we have a fair and J. F. McDonald. Predicting the performance of a 3d
amount of traditional off-chip memory which is easily processor-memory chip stack. IEEE Des. Test, 22(6):540–
extendable. Thus we can at least upgrade the off-chip 547, 2005.
memory of the system in our proposed system. [6] P. Jacob, O. Erdogan, A. Zia, P. M. Belemjian, R. P. Kraft,
and J. F. McDonald. Predicting the performance of a 3d
processor-memory chip stack. IEEE Des. Test, 22(6):540–
Acknowledgments 547, 2005.
[7] G. H. Loh. 3d-stacked memory architectures for multi-core
We thank Jayaram Bobba of Multifacet group for en- processors. SIGARCH Comput. Archit. News, 36(3):453–
lightening us about different aspect of GEMS simulation 464, 2008.
framework. We also thank Andy Phelps of Google, Madi- [8] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty,
son for his initial comments about the work. And at last but M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and
not least, we thank Professor David Wood of UW-Madsion D. A. Wood. Multifacet’s general execution-driven multi-
processor simulator (gems) toolset. SIGARCH Comput. Ar-
for the project idea and teaching us “Computer Architec-
chit. News, 33(4):92–99, 2005.
ture”. [9] L. A. Polka, H. Kalyanam, G. Hu, and S. Krishnamoorthy.
Package technology to address the memory bandwidth chal-
References lenge for tera-scale computing. Intel Technology Journal,
2007.
[1] Cacti 5.3. http://www.hpl.hp.com/research/ [10] K. Puttaswamy and G. H. Loh. Thermal herding: Mi-
cacti/. croarchitecture techniques for controlling hotspots in high-
[2] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, performance 3d-integrated processors. In HPCA ’07: Pro-
L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nel- ceedings of the 2007 IEEE 13th International Symposium on
son, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, High Performance Computer Architecture, pages 193–204,
and C. Webb. Die stacking (3d) microarchitecture. In MI- Washington, DC, USA, 2007. IEEE Computer Society.
CRO 39: Proceedings of the 39th Annual IEEE/ACM Inter- [11] A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An
national Symposium on Microarchitecture, pages 469–479, argument for simple coma. Future Gener. Comput. Syst.,
Washington, DC, USA, 2006. IEEE Computer Society. 11(6):553–566, 1995.
[3] F. Dahlgren and J. Torrellas. Cache-only memory architec-
tures. Computer, 32(6):72–79, 1999.

You might also like