Professional Documents
Culture Documents
Organization of 3d-Stacked Memory
Organization of 3d-Stacked Memory
Abstract
the difference between these two lines is better, with “SW” of accesses are contained in the 3D DRAM, while with
being the higher value. From the graph we see substantial the same configuration and with our swapping policy in
increase in 3D DRAM hit rate with our swapping policy. place the accesses to 3D DRAM goes upto 88% of all
For example, for Apache, with 16MB of added on-chip memory accesses. We may recall that the primary objective
3D DRAM, dumb organization (“NS”) ensures only 32% of this work is to capture as much of the working set of
Figure 4. Speedup over pure 2D main memory model
the application as possible in the faster 3D RAM. We also swapping scheme it improves the speedup to as much as
observe a uniform trends across all the applications that 2.44.
for smaller sizes of the 3D DRAM there are more benefits In the figure 5, we present the figure where we normalize
of employing our swapping policy, while at around 128 the speed up over the configuration where we have full 3D
MB there is hardly benefit of swapping except Oltp. This DRAM without any off-chip main memory. In this graph
can be expected as our simple demand paging scheme most of the data are counter intuitive. In general graphs
automatically place any new page in the 3D DRAM before seems to suggest that with much smaller 3D DRAM but
overflowing it to slower off-chip main memory. So it seems with our intelligent swapping scheme, we are likely to out-
that for most of the applications except Oltp, the data set perform even the scheme where the whole memory is orga-
hardly croses the 128 MB mark. It can also be noted that nized as on-chip 3D memory. Closer introspection suggests
smaller sizes of the 3D DRAM provide bigger opportunity that it seems to be the classic case of capacity versus access
for swapping policy to show its efficacy over the dumb time tradeoff. The applications does not gain much when
organization as it can much better utilize the scarce on chip the 3D DRAM size is extended beyond certain size as most
memory available compared to the dumb organization. of their accesses are contained within the given 3D DRAM
size anyways. But due to larger size of the 3D DRAM ,
In figure 4, we demonstrate how much this savings in each access to it now becomes costlier. So we can observe
terms of off-chip memory accesses translates to the actual a break-even point after which it does not make sense to
speedup over the baseline where we have no 3D main mem- increase the available size of the 3D DRAM.
ory. The data is thus normalized by number of cycles of ex- In the figure 6, we show the geometric mean of the
ecution cycles required by the full off-chip 2D main mem- speedups we obtained over the baseline full 2D main mem-
ory implemenation. This also shows impressive gains. For ory and full 3D main memory. For moderate sizes of 3D
example, for application Zeus with 32MB of on-chip 3D DRAM, we achieve significant more speedup over the full
DRAM, if we have used dumb 3D DRAM organization 2D memory configuartion with our swapping mechanism
without any swapping then, the speed up obtained is 1.56 than dumb organization of 3D DRAM. More interestingly
, while with same 3D DRAM configuration but with our we even show non-trivial performance improvement over
Figure 5. Speedup over full 3D DRAM memory model
with 3D page Y selected as victim. Now, if more of the In figure 9 and figure 10, we demonstrate the effect of
cache blocks corresponding to page Y were present in the choice of victim selection policy in terms of both percent-
cache than for page X before the swapping, then after swap- age of accesses contained in 3D memory and execution time
ping less blocks of Y appear to be present in the cache. If (measured in terms of millions of ruby cycles). We can see
we continue to access blocks of Y then the number of L2 that there is improvement in terms of both percentage of
misses will increase. The exact opposite can happen as well accesses contained in the 3D memory and execution time
if the roles of X and Y are changed. We observe that as if LRU is used instead of the RANDOM victim selection
we increase the size of the 3D memory, there is less num- policy. But this improvement is not significant enough with
ber of swaps and so the two lines in the graphs converge. improvement mostly confined to 2-3%. We attribute this to
Note that in the graphs of Figure 7 the number of memory the fact that most of the temporal locality is already filtered
accesses with swapping disabled do not remain constant, al- out by the cache hierarchy and thus not enough room for im-
though the change is small. This is attributable to the fact provement left for LRU to exploit. We have to bear in mind
that with changes in the latency of the memory hierarchy, that RANDOM is significantly simpler to implement and
the number of instructions executed slightly varies due the LRU can actually be practically infeasible to implement.
presence of synchronization primitives in the commercial Also in this experiment, we do not penalize LRU scheme on
workloads. This leads to a small variation in the number of latency of victim selection as we tried to demonstrate only
L2 cache misses. the scope of improvement possible with implementation of
more complex victim selection policy.
In figure 8, we demonstrate the effect of varying the UP-
GRADE THRESHOLD in terms of both percentage of ac-
cesses contained in 3D DRAM and execution time ( mea- 7 Future Work
sured in terms of millions of ruby cycles ). We can observe
that there is mostly steady but small decrease in the per- In this section we present some of the proposed future
centage of memory acceses contained within the 3D DRAM work required to take this work to its logical conclusion and
with increasing size of the 3D memory. works that are under progress but could not have been com-
Figure 10. Comparison between RANDOM and LRU victim selection policy
pleted due to time constraints. better model the bandwidth advantages of 3D DRAM
as well.
• The current implementation of the proposed system
has little bit of infidelity in the number of memory ac- • For making the case more cogent, we plan to do in
cesses it reports. This is due to the fact that on swap- depth thermal and power study on the proposed sys-
ping a page from traditional main memory to the faster tem.
3D memory, we need to invalidate all the cache blocks
resident in the cache hierarchy of the system. Oth- • We also plan to extend the work for CMP environment.
erwise the processor may consume wrong values as
the addresses of the data got altered. The full support
8 Conclusions
is currently missing in the implementation, but is un-
der progress. Note that, this does not prevent us from
simulating the benchmarks as GEMS merely does the We clearly observe the efficacy of our scheme in terms
performance modeling while the functionality is taken of performance gain on top of both dumb 3D DRAM orga-
care by the Simics. We also note that the impact of nization and full on-chip 3D DRAM implementation of the
this is not likely to be huge as worst case number of memory. But apart from the performance gain our proposed
extra cache misses due to this is limited by a constant system has following advantages as well.
multiplier on the number of swaps done.
• As we use smaller sizes of 3D DRAM and thus less
• We plan to collect data for effect of “LAZY” (see sec- number of vertical layers of active silicon, the thermal
tion 4) in amortizing the cost of swapping on the sys- issue of 3D DRAM is largely mitigated, as peak tem-
tem. This portion is already mostly completed in terms perature increases with number of layers of 3D layers
of implementation. of active silicon.
• We have mostly explored the latency advantages of the • Similarly, one of the other problem of 3D integration
3D DRAM in this work. We plan to extend this work to technology is lower yield, which is also worsens with
increasing number of layers in 3D. Thus our proposal [4] M. Ekman and P. Stenstrom. A case for multi-level main
also helps in this regard as well. memory. In WMPI ’04: Proceedings of the 3rd workshop
on Memory performance issues, pages 1–8, New York, NY,
• With entire memory implemented as on-chip 3D USA, 2004. ACM.
DRAM, it is never possible to upgrade the memory [5] P. Jacob, O. Erdogan, A. Zia, P. M. Belemjian, R. P. Kraft,
system. But in our proposed system we have a fair and J. F. McDonald. Predicting the performance of a 3d
amount of traditional off-chip memory which is easily processor-memory chip stack. IEEE Des. Test, 22(6):540–
extendable. Thus we can at least upgrade the off-chip 547, 2005.
memory of the system in our proposed system. [6] P. Jacob, O. Erdogan, A. Zia, P. M. Belemjian, R. P. Kraft,
and J. F. McDonald. Predicting the performance of a 3d
processor-memory chip stack. IEEE Des. Test, 22(6):540–
Acknowledgments 547, 2005.
[7] G. H. Loh. 3d-stacked memory architectures for multi-core
We thank Jayaram Bobba of Multifacet group for en- processors. SIGARCH Comput. Archit. News, 36(3):453–
lightening us about different aspect of GEMS simulation 464, 2008.
framework. We also thank Andy Phelps of Google, Madi- [8] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty,
son for his initial comments about the work. And at last but M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and
not least, we thank Professor David Wood of UW-Madsion D. A. Wood. Multifacet’s general execution-driven multi-
processor simulator (gems) toolset. SIGARCH Comput. Ar-
for the project idea and teaching us “Computer Architec-
chit. News, 33(4):92–99, 2005.
ture”. [9] L. A. Polka, H. Kalyanam, G. Hu, and S. Krishnamoorthy.
Package technology to address the memory bandwidth chal-
References lenge for tera-scale computing. Intel Technology Journal,
2007.
[1] Cacti 5.3. http://www.hpl.hp.com/research/ [10] K. Puttaswamy and G. H. Loh. Thermal herding: Mi-
cacti/. croarchitecture techniques for controlling hotspots in high-
[2] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, performance 3d-integrated processors. In HPCA ’07: Pro-
L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nel- ceedings of the 2007 IEEE 13th International Symposium on
son, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, High Performance Computer Architecture, pages 193–204,
and C. Webb. Die stacking (3d) microarchitecture. In MI- Washington, DC, USA, 2007. IEEE Computer Society.
CRO 39: Proceedings of the 39th Annual IEEE/ACM Inter- [11] A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An
national Symposium on Microarchitecture, pages 469–479, argument for simple coma. Future Gener. Comput. Syst.,
Washington, DC, USA, 2006. IEEE Computer Society. 11(6):553–566, 1995.
[3] F. Dahlgren and J. Torrellas. Cache-only memory architec-
tures. Computer, 32(6):72–79, 1999.