Jonathan Woodruff - Memory Tech and Arch

Efficient Tagged Memory
for the
CHERI Capability System
Jonathan Woodruff, Alexandre Joannou, Simon W. Moore,
Robert Kovacsics, Hongyan Xia, Robert N. M. Watson,
David Chisnall, Michael Roe, Brooks Davis, Peter G. Neumann,
Edward Napierala, John Baldwin, A. Theodore Markettos, Khilan Gudka,
Alfredo Mazzinghi, Alexander Richardson, Stacey Son and Alex Bradbury
University of Cambridge and SRI International
Approved for public release; distribution is unlimited. This research is sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air
Force Research Laboratory (AFRL), under contract FA8750-10-C-0237. The views, opinions, and/or findings contained in this article/presentation are those of
the author(s)/presenter(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Johnny Proposes Tagged Memory
1-bit Tag Per Word!
• Tag pointers?
• Enable unforgeable pointers!
• Protect both data and control flow!
• Tag allocated memory?
Only 1-bit per word!
2
1-bit Tag Per Word!
• Tag capabilities?
• Enable unforgeable pointers!
• Enable natural compartmentalization!
• Tag allocated memory?
Only 1-bit per word!
Non-standard Memory?
• Custom cache width is possible
• Registers could preserve the bit
• But custom DRAM is a non-starter
We can’t even afford ECC!
• Security must be free!
3
A Tag Table in DRAM!
• Put table in standard DRAM
• It will be really small (1-bit per word!)
• Emulate wider memory, fetch tag and data
on cache miss
• Keep them together on-chip
4
A Tag Table in DRAM!
• Put table in standard DRAM
• It will be really small (1-bit per word!)
• Emulate wider memory, fetch tag and data
on cache miss
• Keep them together on-chip
Double the Memory Accesses?

• Access both the table and the data on every
cache miss?
• No
5
A Cache for the Tag Table!
• Use a dedicated cache for the tags!
• It will hold tags for loads of data
(1-bit per word! Covers megabytes of data!)
• Only do DRAM table lookup on a miss
6
A Cache for the Tag Table!
• Use a dedicated cache for the tags!
• It will hold tags for loads of data.
(1-bit per word! Covers megabytes of data!)
• Only do DRAM table lookup on a miss.
Last-level Caches Aren’t that Effective

• This is logically a last-level cache
• LLC has low hit-rates: 40-60% for SPEC
We only see accesses that have missed in primary caches…
• +50% memory accesses isn’t going to fly
7
The Tagged Memory Challenge
1. Add 1 bit per word of memory
2. Make it “free”
8
Re-cap Simple Tag Hierarchy
CPU
• Store tags with data in cache
hierarchy I$ D$
• Tag controller does tag table L2$

lookup on DRAM access
Tag Table
Manager
• Cache lines of tags from DRAM
Tag $
DRAM
9
An Experiment in Gem5
• Trace all DRAM accesses
• Replay against a tag controller + cache model
• Measure tag-cache hit-rate
• Using ARMv8 Gem5
• Google v8 engine running Earley-Boyer Octane (x3)
• FFMPEG
• 4-core, 8MiB L3 with prefetching
10
Tag Table Cache Properties
DRAM traffic overhead vs. tag cache size, 64-byte lines
50%
Tag DRAM traffic overhead
Earley-Boyer (big) FFMPEG (big)

40%
30%
Only 12% overhead with same
reach as last level cache!
20%
<5% traffic overhead at
256KiB with 16MiB capacity
10%
0%
64KiB (for 4MiB of data) 128KiB (for 8MiB of data) 256KiB (for 16MiB of data)
Tag cache size

Why is tag cache more effective than a traditional last-level cache?
11
Tag Table Cache Locality Analysis
Temporal and Spatial Hits vs. Line Size
for Earley-Boyer, 256KiB tag cache, 8-way set associative
100%
misses spatial hits temporal hits
80%
Tag cache accesses
60%
40%
64-byte line of tags
= covers
64-byte line
20% 4KiB
4KiB Pageof data
of data
1-byteline
1-byte lineofcovers
tags =
64 data
64B Line bytes
of data
0%
1 2 4 8 16 32 64 128 256 512 1024
Tag cache line size (bytes)
12
boot up, we must clear only the root level of the table to clear
the tag bits on all of memory.
Tag Compression
1 bit per page of data: 0tag-cache
for no tagsline
set
root table
Tags for a 1 bit ...
page of data
tag-cache line leaf table
512 bits
64 bytes ...
2-level
• Fig. tag table table structure for grouping factor of 512
4. Hierarchical
• Each bit in the root level indicates all zeros in a

Crucially, this scheme eliminates table-cache pressure for
leaf group
applications that do not use tagged pointers. In addition, this
• Reduces tag cache footprint
1 WHISK demonstrates that the root level of a two-level tag table can be
• Amplifies
managed in softwarecache capacity
at the cost of flushing tag caches on root updates [31].
13
Use-case 1: Pointer Integrity
• All virtual addresses are tagged
All words that match successful TLB translations
• Similar to our CHERI FPGA implementation

100
90 Cache Lines Containing Pointers
Percent of Cache Lines
80
70
60
50
40
30
20
10
0
bzip2 vlc chromium Octane Earley- Octane Splay
Boyer
C C++ C++ & Javascript
14 Pointer-heavy Javascript
Use-case 2: Zero Elimination
• Tag cache lines that contain zeros

• Eliminate zero cache lines from DRAM traffic
• Can we eliminate more data traffic than the tag
table generates?
• 1.5-2.5% of lines in DRAM traffic are all zero
(in our workloads)
• If we use less than 1% for table traffic, we

improve performance!
15
Overhead with Compression
4.0%
3.5%
DRAM Traffic Overhead
FFMPEG Earley-Boyer
3.0%
2.5%
2.0%
1.5%
1.0%
0.5%
0.0%
No Compression Compressed
Pointers Compressed
Zeroes
No Compression
Pointers Zeros
16
CHERI FPGA Implementation
• 64-bit MIPS implementation with tagged pointers
• 256KiB, 4-way set associative L2 cache
• Parameterisable hierarchical tag controller
backed by 32KiB 4-way associative tag cache
17
Benchmarks in Hardware
DRAM Traffic Overhead in FPGA Implementation
Note: MiBench overheads with compression are approximately zero
Uncompressed
8% Compressed
DRAM traffic overhead
6%
4%
2%
0%
o rt peg tra ha C32 FT yer lay dfjs mu
s j s s
q ijk R F
-Bo sp p
gbe
d C y
MiBench rle Octane
Ea
18
Things We’ve Learned
• A tag table caches extremely well
Spatial locality pays off for very wide lines
• Simple compression works well for sparse tags
Only pay for the cost of tags when used
• Single-bit tags in standard memory can require nearly zero
overhead in the common case
Pointer tags + zero line elimination could actually net reduce
memory accesses for most cases!
• Tagged memory should not be a barrier to adoption of CHERI
capabilities
Questions?
Jonathan.Woodruff@cl.cam.ac.uk
19

Jonathan Woodruff - Memory Tech and Arch

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jonathan Woodruff - Memory Tech and Arch

Uploaded by

Copyright:

Available Formats

Efficient Tagged Memory

University of Cambridge and SRI International

Double the Memory Accesses?

Last-level Caches Aren’t that Effective

1. Add 1 bit per word of memory

• Tag controller does tag table L2$

Earley-Boyer (big) FFMPEG (big)

Tag cache size

• Each bit in the root level indicates all zeros in a

• Similar to our CHERI FPGA implementation

• Tag cache lines that contain zeros

• If we use less than 1% for table traffic, we

You might also like