Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Efficient Tagged Memory

for the
CHERI Capability System
Jonathan Woodruff, Alexandre Joannou, Simon W. Moore,
Robert Kovacsics, Hongyan Xia, Robert N. M. Watson,
David Chisnall, Michael Roe, Brooks Davis, Peter G. Neumann,
Edward Napierala, John Baldwin, A. Theodore Markettos, Khilan Gudka,
Alfredo Mazzinghi, Alexander Richardson, Stacey Son and Alex Bradbury

University of Cambridge and SRI International

Approved for public release; distribution is unlimited. This research is sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Air
Force Research Laboratory (AFRL), under contract FA8750-10-C-0237. The views, opinions, and/or findings contained in this article/presentation are those of
the author(s)/presenter(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Johnny Proposes Tagged Memory
1-bit Tag Per Word!
• Tag pointers?
• Enable unforgeable pointers!
• Protect both data and control flow!
• Tag allocated memory?
Only 1-bit per word!

2
Johnny Proposes Tagged Memory
1-bit Tag Per Word!
• Tag capabilities?
• Enable unforgeable pointers!
• Enable natural compartmentalization!
• Tag allocated memory?
Only 1-bit per word!

Non-standard Memory?
• Custom cache width is possible
• Registers could preserve the bit
• But custom DRAM is a non-starter
We can’t even afford ECC!
• Security must be free!
3
Johnny Proposes Tagged Memory
A Tag Table in DRAM!
• Put table in standard DRAM
• It will be really small (1-bit per word!)
• Emulate wider memory, fetch tag and data
on cache miss
• Keep them together on-chip

4
Johnny Proposes Tagged Memory
A Tag Table in DRAM!
• Put table in standard DRAM
• It will be really small (1-bit per word!)
• Emulate wider memory, fetch tag and data
on cache miss
• Keep them together on-chip

Double the Memory Accesses?


• Access both the table and the data on every
cache miss?
• No

5
Johnny Proposes Tagged Memory
A Cache for the Tag Table!
• Use a dedicated cache for the tags!
• It will hold tags for loads of data
(1-bit per word! Covers megabytes of data!)
• Only do DRAM table lookup on a miss

6
Johnny Proposes Tagged Memory
A Cache for the Tag Table!
• Use a dedicated cache for the tags!
• It will hold tags for loads of data.
(1-bit per word! Covers megabytes of data!)
• Only do DRAM table lookup on a miss.

Last-level Caches Aren’t that Effective


• This is logically a last-level cache
• LLC has low hit-rates: 40-60% for SPEC
We only see accesses that have missed in primary caches…
• +50% memory accesses isn’t going to fly

7
The Tagged Memory Challenge

1. Add 1 bit per word of memory

2. Make it “free”

8
Re-cap Simple Tag Hierarchy
CPU
• Store tags with data in cache
hierarchy I$ D$

• Tag controller does tag table L2$


lookup on DRAM access
Tag Table
Manager
• Cache lines of tags from DRAM
Tag $

DRAM
9
An Experiment in Gem5
• Trace all DRAM accesses
• Replay against a tag controller + cache model
• Measure tag-cache hit-rate
• Using ARMv8 Gem5
• Google v8 engine running Earley-Boyer Octane (x3)
• FFMPEG
• 4-core, 8MiB L3 with prefetching

10
Tag Table Cache Properties
DRAM traffic overhead vs. tag cache size, 64-byte lines
50%
Tag DRAM traffic overhead

Earley-Boyer (big) FFMPEG (big)


40%

30%
Only 12% overhead with same
reach as last level cache!
20%
<5% traffic overhead at
256KiB with 16MiB capacity
10%

0%
64KiB (for 4MiB of data) 128KiB (for 8MiB of data) 256KiB (for 16MiB of data)

Tag cache size


Why is tag cache more effective than a traditional last-level cache?
11
Tag Table Cache Locality Analysis
Temporal and Spatial Hits vs. Line Size
for Earley-Boyer, 256KiB tag cache, 8-way set associative
100%
misses spatial hits temporal hits

80%
Tag cache accesses

60%

40%
64-byte line of tags
= covers
64-byte line
20% 4KiB
4KiB Pageof data
of data
1-byteline
1-byte lineofcovers
tags =
64 data
64B Line bytes
of data
0%
1 2 4 8 16 32 64 128 256 512 1024
Tag cache line size (bytes)
12
boot up, we must clear only the root level of the table to clear
the tag bits on all of memory.
Tag Compression
1 bit per page of data: 0tag-cache
for no tagsline
set
root table
Tags for a 1 bit ...
page of data
tag-cache line leaf table
512 bits
64 bytes ...

2-level
• Fig. tag table table structure for grouping factor of 512
4. Hierarchical

• Each bit in the root level indicates all zeros in a


Crucially, this scheme eliminates table-cache pressure for
leaf group
applications that do not use tagged pointers. In addition, this
• Reduces tag cache footprint
1 WHISK demonstrates that the root level of a two-level tag table can be
• Amplifies
managed in softwarecache capacity
at the cost of flushing tag caches on root updates [31].

13
Use-case 1: Pointer Integrity
• All virtual addresses are tagged
All words that match successful TLB translations

• Similar to our CHERI FPGA implementation


100
90 Cache Lines Containing Pointers
Percent of Cache Lines

80
70
60
50
40
30
20
10
0
bzip2 vlc chromium Octane Earley- Octane Splay
Boyer
C C++ C++ & Javascript
14 Pointer-heavy Javascript
Use-case 2: Zero Elimination

• Tag cache lines that contain zeros


• Eliminate zero cache lines from DRAM traffic
• Can we eliminate more data traffic than the tag
table generates?
• 1.5-2.5% of lines in DRAM traffic are all zero
(in our workloads)

• If we use less than 1% for table traffic, we


improve performance!
15
Overhead with Compression
4.0%

3.5%
DRAM Traffic Overhead

FFMPEG Earley-Boyer
3.0%

2.5%

2.0%

1.5%

1.0%

0.5%

0.0%
No Compression Compressed
Pointers Compressed
Zeroes
No Compression
Pointers Zeros
16
CHERI FPGA Implementation
• 64-bit MIPS implementation with tagged pointers
• 256KiB, 4-way set associative L2 cache
• Parameterisable hierarchical tag controller
backed by 32KiB 4-way associative tag cache

17
Benchmarks in Hardware
DRAM Traffic Overhead in FPGA Implementation
Note: MiBench overheads with compression are approximately zero
Uncompressed
8% Compressed
DRAM traffic overhead

6%

4%

2%

0%
o rt peg tra ha C32 FT yer lay dfjs mu
s j s s
q ijk R F
-Bo sp p
gbe
d C y
MiBench rle Octane
Ea
18
Things We’ve Learned
• A tag table caches extremely well
Spatial locality pays off for very wide lines
• Simple compression works well for sparse tags
Only pay for the cost of tags when used
• Single-bit tags in standard memory can require nearly zero
overhead in the common case
Pointer tags + zero line elimination could actually net reduce
memory accesses for most cases!
• Tagged memory should not be a barrier to adoption of CHERI
capabilities
Questions?
Jonathan.Woodruff@cl.cam.ac.uk
19

You might also like