Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

1.

Multi-Ported Random Access Memories(MPRAMs)


Cornerstone for all high-performance Central Processing Unit (CPU) designs. Often used
as shared-memory structures such as register files, Translation Lookaside Buffers (TLBs),
caches and coherence tags. Used as high-bandwidth memories with multiple parallel read
and write ports, for eg., second generation of the Itanium processor architecture employs
a 20-port register file constructed from SRAM bit cells with 12 read ports and 8 write ports.
In particular, multi-ported RAMs are often used by
Wide superscalar processors,
Very Large Instruction Word (VLIW) processors,
Multi-core processors,
Vector processors,
Coarse-Grained Reconfigurable Arrays (CGRAs), and
Digital Signal Processors (DSPs).
The key requirement for all of these designs is fast, concurrent, single-cycle access from
multiple requestors. These multiple requesters require concurrent access for performance
reasons. They demand highly parallel memory structures to keep pace with their
concurrent nature since memories are usually the bottleneck of computation
performance.
The two leading multi-port RAM techniques in FPGAs have relatively large overhead in
Register usage or
Total SRAM block count.

1.1. Content Addressable Memories (CAMs)


CAMs, being a hardware implementation of associative arrays, are massively parallel
search engines accessing all memory content to compare with the searched pattern
simultaneously.
To find a specific pattern in the CAM, all CAM patterns are read simultaneously, then
compared to the matched pattern. Comparing all CAM patterns to the matched pattern
generates match indicators (or match lines) which indicated for each pattern in the CAM
if this pattern matches the searched pattern. A priority-encoder will detect if there is a
match and will generate the address of the first matched pattern.
CAMs are considered heavy power consumers due to the very wide memory bandwidth
requirement and the concurrent compare. While a standard RAM returns data located at
the given address, a CAM returns an address containing a specific given datum, thus
performing a memory-wide search for a specific value. To do this, it must perform a
memory-wide search for a specific value, and there may be multiple addresses that match
the data. Since CAM is actually a high-performance implementation of a very basic
associative search, it can be used in many fields:
Network processors: IP lookup engines for packet forwarding, intrusion detection,
packet filtering and classification.
High-Performance Processors: Memory management as coherence tag arrays for
highly-associative caches and Translation Lookaside Buffers (TLBs). Load and store queues
in out-of-order instruction schedulers with a wide scheduling window.
Pattern matching, data compression, DSP, databases, bioinformatics, logic
minimization. CAMs as single-cycle associative search accelerators with millions of search
entries.
The main goal is to address scaling issues with both designs, to permit use of deeper
MPRAMs with more ports, as well as deeper and wider CAMs that can scale better. These
new designs must be practical as well, meaning they achieve high clock rates and provide
functionality such as initialization, bypassing, and fast updates.

1.2. Limitations
1. MPRAM
A SRAM cell is a Complementary Metal-Oxide Semiconductor (CMOS) bi-stable circuit built
out of CMOS cross-coupled inverters and access pass-gates. To provide more access ports,
the basic SRAM bit cell can be altered to provide more bit lies, word lines, and access
transistors, however, the area growth is quadratic with the number of ports. Furthermore,
this requires a custom design for each unique set of parameters (e.g., number of ports,
width and depth of RAM). In FPGAs, one way of synthesizing a multi-ported RAM is to build
from registers and logic (works for small size memories), and the other method is build
them from RAM blocks.
2. CAM
CAMs are usually custom-designed at the transistor-level. The four additional transistors
over the standard six-transistor SRAM cell form a comparison circuit, an XOR with NMOS-
stack only, since its output (the match line) is pulled-up. The area growth of traditional
CAM techniques in FPGAs is high and is currently limited to 64k entries.
Wide and shallow RAMs are needed to efficiently implement brute-force CAMs. Shallow
RAMs are required because each extra bit in the CAM pattern width doubles the required
RAM depth, resulting in poor efficiency. In addition, deeper CAMs can be built by
increasing RAM width. However, FPGA RAM block width is growing slowly. For example,
M4K blocks in Stratix II devices have minimal depth of 128 with maximal width of 36, M9K
blocks in Stratix III and Stratix IV devices have minimal depth of 256 and maximal width of
36, M20K blocks in Stratix V devices have minimal depth of 512 and maximal width of 40.
With the increasing depth of RAMs, and limited width growth, the brute-force approach is
getting less efficient.
Altera Stratix V 5SGXMABN1F45C2 device: High-end performance-oriented speed grade 2
device with 360k ALMs, 2640 M20K blocks, and 1064 I/O pins. Half of the ALMs can be used
to construct Memory Logic Array Blocks (MLABs), where a single MLAB consists of 10
ALMs.
Altera’s M20K blocks can be configured into several RAM depth and data width
configurations. The total amount of utilized SRAM bits can be either 16K bits or 20K bits.
Assuming that the RAM packing process minimizes the number of blocks cascaded in
depth to avoid additional address decoding, each 16K lines will be packed into single bit-
wide blocks, and the remainder will be packed into the minimal required configuration.
An estimation of the number of packed M20K blocks required to construct a RAM with a
specific depth d, and data width w, is nM20K(d, w).
1. Cross Connect

You might also like