Professional Documents
Culture Documents
A One Cycle FIFO Buffer For Memory Management Units in Manycore Systems
A One Cycle FIFO Buffer For Memory Management Units in Manycore Systems
A One Cycle FIFO Buffer For Memory Management Units in Manycore Systems
Abstract—We present an efficient synchronous first-in management software units to hardware components [1][2]
first-out (FIFO) buffer for enhanced memory management A frequently used memory management unit is the buffer
units and inter-core data communication in manycore management software defining the first-in first-out (FIFO)
systems. Our design significantly reduces hardware buffer rules between PEs. Fig. 1 depicts a generic circuit
overhead and eliminates latency delays by using both the architecture for a conventional FIFO buffer design comprised
rising and falling clock edges during read and write, which of three stages. The address-pointer stage generates the address
makes our design suitable for increased processing element pointers, the address-decoder stage generates the memory
(PE) utilization by increasing the memory bandwidth in address word line, and the isolate stage isolates these two
complex network and system on-chip solutions. Compared operations. We note that this architecture can have different
to prior work, our design can operate 5X faster at the circuit components depending on the power consumption and
same supply voltage, or up to 44X faster with a 2.5X speed optimization requirements [6][11][7][8][13].
increase in supply voltage. Our design’s total power
One of the drawbacks of this generic architecture is the N
consumption is 7.8 mW with a total transistor count of registers required to isolate the memory address line updates
34,470. and increment operations to ensure that the correct memory
address lines are active at the correct times. These registers
I. INTRODUCTION impose a large power consumption and area overhead [4]. To
Traditional microprocessors have increasing numbers of reduce switching activity, prior works (e.g., [11][8]) introduced
processing elements (PEs) on a single die in order to afford a gray counter in the address-pointer stage while other works
continued performance improvements. To leverage these [7][13] introduced a shift register, however, even though these
performance improvements, applications must be written such methods are efficient, they still require a large number of
that the work can be efficiently divided across the PEs, often registers.
resulting in different cores operating on the same data or
coordinating data passing between PEs, presenting a significant Another major drawback is the delay control circuit needed
challenge in keeping up with memory bandwidth requirements to compensate for the address-decoder’s stage setup/hold time
in order to eliminate/reduce PE idleness while waiting for data by delaying the clock signal. To address these drawbacks,
transfers. The memory management unit that orchestrates this some works used a counter-based pointer [11], or replicated the
data movement can introduce significant performance write and read detector pointers [6]. However, both approaches
degradation, which worsens as the number of PEs and data had complex layout designs to align with and meet the
sharing/passing increases. One way to alleviate this bottleneck setup/hold timing criteria between the address-decoder stage
is to migrate a portion of the most frequently used memory and input/output for the data memory, requiring complex
control units and self-timing, which imposed additional design
¥*&&&
%0**47-4*
Fig. 2. (a) FIFO block diagram; (b) Write operation; (c) Read operation
Fig. 3. Detailed hardware diagram for our proposed FIFO design
parallel counter logic presented in [9].
read operation ends even if there are outstanding read requests. The two-phase clock signals, CLKN and CLKP, are
Fig. 3 depicts the read and write operations’ hardware with derived from CLK and represent CLK’s negative and positive
respect to CLK. The FIFO’s core storage unit is constructed versions, respectively. CLKP and CLKN are at the opposite
using an 8T-Cell SRAM circuit. The write and read decoders phase of each other and their transition edges are non-
are divided into two separate components: the address-pointer overlapping, which ensures that there are no race conditions
and address-decoder. The circuit structure of the address- between address-pointer and address-decoder, such that while
pointer is the up counter and is based on the state look-ahead address-pointer is disabled/enabled, address-decoder is
enabled/disabled, respectively. The clock signal source and
traditional back-to-back NAND two-phase clocking systems
[9] are usually susceptible to fabrication process, voltage, and
temperature (PVT) variation, as well as noise jitter variation.
All these variations can have a negative impact on the duty
cycle of the two phases and non-overlapping margin, thus
negatively impacting overall circuit timing performance.
Usually, a delay-locked-loop circuit (DLL) [14] is used to
control the PVT variations and preserve clock skew, however,
due to its complexity, power consumption, and area cost and
large overhead in manycore architectures, we use a simpler
circuit [10] with similar achievements. [10] showed that
simulations over PVT corners, for non-ideal clock signals
provides non-overlapping phases with low values of root mean
square (RMS) jitter.
There are two parallel counters, one for each read and write
address-pointer. Each counter is triggered by CLKN, which is
gated with the read/write operation enable signals, REN/WEN,
respectively, such that the counter is incremented whenever a
write/read operation occurs. Alternatively, the write/read
address-decoders are enabled when CLKP is asserted high and
is gated with the WEN/REN signals, respectively.
Fig. 4. Empty/Full flag circuitry.
Fig. 4 depicts our empty-full flag circuitry solution. The generates the address-pointer is only the delay of a single DFF.
empty-full flag detects the roll-over of the read and write A single DFF delay is composed of two gate delays as a
address-pointers using two serially-connected DFFs for each precaution for a worst-case timing scenario:
operation (i.e., one pair of DFFs for the read operation and
another pair for the write operation). Both pairs of serially CT = 2 * GDs (2)
connected DFFs are initialized to ‘10.’ The least significant bits A portion of the address-decoder logic, except the last
(LSBs) are XNORed for evaluation of the flags’ statuses in stage, is integrated into the address-pointer stage, giving an
combination with the read and write address-pointers. If the additional three gate delays. Thus, the total gate delay count for
address-pointers’ values are equal and the serially-connected generating the memory word line address prior to the last stage
DFFs’ LSBs are equal, the empty flag is asserted high, and of address-decoder is:
alternatively, if the address-pointers’ values are equal and the
serially-connected DFFs’ LSBs are different, full is asserted Path2-End1 = 5 GDs (3)
high. When the associated read/write address-pointer reaches This delay is shown as Path2-End1 in Fig. 5.
the full state, then the associated pair of serially-connected
DFFs change state to ‘01’ on the rising CLKP edge, as shown The empty and full flags are evaluated at the rising edge of
on the DFFs’ clock signal in Fig. 4. the CLKN signal that follows the incrementing of the address-
pointer. CLKN is implemented using XNOR and AND gate
III. TIMING CONSTRAINTS logic that requires two gate delays
In order to evaluate the scalability of our approach, the timing Path2-End2 = 2 GDs (CT) + 2 GDs (XNORs&AND) = 4 GDs
of all critical paths including the setup/hold-gated with the two- (4)
phase clocks, CLKN and CLKP must be independent of the
technology. To the best of our knowledge, our design is the This delay is shown as Path2-End2 in Fig. 5.
first implementation of a FIFO that uses a non-overlapping The roll-over status is pre-evaluated in the empty and full
two-phase clocking system. Our design uses only a basic flag circuitry on the rising CLKP edge after incrementing the
CMOS logic gate structure with basic width/length sizes for address-pointer. This operation enables a large fan-in AND
design layout clarity and cost-effectiveness for continued gate, which requires two gate delays:
technology scaling. We evaluate our proposed FIFO circuit
based on an 8T-Cell SRAM structure of size 64-entry x 64-bit Path2-End3=2 GDs (CT) + 2 GDs (large fan-in AND)= 4
as a common size for many applications. GDs (5)
Fig. 5 enumerates all of the possible path delays generated This delay is shown as Path2-End3 in Fig. 5.
in the data path by the non-overlapping two-phase clocking The evaluation of the roll-over status is stored in the 2-bit
system. The circuit generation of the clocking system serial shifter in the flag circuitry at the rising CLKP edge. This
approximates the non-overlapping time as three gate delays evaluation ensures that Path2-End3 ends before the 2-bit serial
(GDs) for a worst-case timing scenario: shifter changes its state. Therefore, one of the possible worst-
Non-Overlap = 3 * GDs (1) case timing scenarios for CLK includes the worst-case timing
of CLKN, with the addition of three gate delays for the non-
such that, on the falling CLK edge, CLKN is rising after three overlapping operation:
gate delays and CLKP is falling after one gate delay.
Alternatively, on the rising CLK edge, CLKN is falling after CLK = 5 GDs (CLKN high) + 5 GDs (CLKN low) + 3 GDs
three gate delays and CLKP is rising after one CLK delay. (Non-Overlap) = 13 GDs (6)
The address-pointer stage is triggered by the rising CLKN The SRAM read-write word lines accessed in the last stage
edge (Fig. 5, Path2). The time to increment the counter that of the address-decoder are gated with CLKP and the REN or
TABLE 2
COMPARISON BETWEEN PRIOR WORKS AND OUR PROPOSED FIFO DESIGN
Address-Pointer Control Unit Storage Unit Flag Unit Freq/
Voltage
[6] Event-driven self-timed Self-timed management 10T-Cell SRAM Up/down indicators 22.7 MHz/
pointers using replica units and power 256-word X 16- 0.4V
write and read pulses switched control circuit bit
[11] Adder and binary registers Counter control unit Dual-port RAM Johnson counter 200 MHz/
128-word X 32- 1V
bit
[7] Cyclic address with one- Latches are connected Latched-based No flags are used, 27 MHz/
hot bubble-encoding and in parallel forming requiring N*(N- all pipeline lanes are 0.28V
shift register lanes activated every 1) latches for N full, then output is
other cycle lanes released
Our Carry look-ahead state Non-overlapping clock 8T-Cell SRAM Rollover detection 1 GHz/
Work counter using 2 opposing phase- 64-word X 64-bit and comparison 1V
clocks circuit