A One Cycle FIFO Buffer For Memory Management Units in Manycore Systems

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

*&&&$PNQVUFS4PDJFUZ"OOVBM4ZNQPTJVNPO7-4* *47-4*

A One-Cycle FIFO Buffer for Memory Management Units in Manycore Systems

Ann Gordon-Ross1*, Saleh Abdel-Hafeez2, and Mohamad Hammam Alsafrjalni3


1
Department of Electrical and Computer Engineering University of Florida (UF), Gainesville, FL, USA
2
Jordan University of Science and Technology, Irbid 22110, Jordan; sabbatical at Qassim University, Saudi Arabia
3
Department of Electrical and Computer Engineering University of Miami (UM), Coral Gables, FL, USA
*Also with the NSF Center for High Performance Reconfigurable Computing (CHREC) at UF
E-mail: 1ann@ece.ufl.edu, 2sabdel@just.edu.jo, 3alsafrjalani@miami.edu

Abstract—We present an efficient synchronous first-in management software units to hardware components [1][2]
first-out (FIFO) buffer for enhanced memory management A frequently used memory management unit is the buffer
units and inter-core data communication in manycore management software defining the first-in first-out (FIFO)
systems. Our design significantly reduces hardware buffer rules between PEs. Fig. 1 depicts a generic circuit
overhead and eliminates latency delays by using both the architecture for a conventional FIFO buffer design comprised
rising and falling clock edges during read and write, which of three stages. The address-pointer stage generates the address
makes our design suitable for increased processing element pointers, the address-decoder stage generates the memory
(PE) utilization by increasing the memory bandwidth in address word line, and the isolate stage isolates these two
complex network and system on-chip solutions. Compared operations. We note that this architecture can have different
to prior work, our design can operate 5X faster at the circuit components depending on the power consumption and
same supply voltage, or up to 44X faster with a 2.5X speed optimization requirements [6][11][7][8][13].
increase in supply voltage. Our design’s total power
One of the drawbacks of this generic architecture is the N
consumption is 7.8 mW with a total transistor count of registers required to isolate the memory address line updates
34,470. and increment operations to ensure that the correct memory
address lines are active at the correct times. These registers
I. INTRODUCTION impose a large power consumption and area overhead [4]. To
Traditional microprocessors have increasing numbers of reduce switching activity, prior works (e.g., [11][8]) introduced
processing elements (PEs) on a single die in order to afford a gray counter in the address-pointer stage while other works
continued performance improvements. To leverage these [7][13] introduced a shift register, however, even though these
performance improvements, applications must be written such methods are efficient, they still require a large number of
that the work can be efficiently divided across the PEs, often registers.
resulting in different cores operating on the same data or
coordinating data passing between PEs, presenting a significant Another major drawback is the delay control circuit needed
challenge in keeping up with memory bandwidth requirements to compensate for the address-decoder’s stage setup/hold time
in order to eliminate/reduce PE idleness while waiting for data by delaying the clock signal. To address these drawbacks,
transfers. The memory management unit that orchestrates this some works used a counter-based pointer [11], or replicated the
data movement can introduce significant performance write and read detector pointers [6]. However, both approaches
degradation, which worsens as the number of PEs and data had complex layout designs to align with and meet the
sharing/passing increases. One way to alleviate this bottleneck setup/hold timing criteria between the address-decoder stage
is to migrate a portion of the most frequently used memory and input/output for the data memory, requiring complex
control units and self-timing, which imposed additional design

Fig. 1. Conventional FIFO design

¥*&&& 
%0**47-4*



Fig. 2. (a) FIFO block diagram; (b) Write operation; (c) Read operation

increase in supply voltage. Our design’s total power


complexity and degraded the operating speed. consumption is 7.8 mW with a total transistor count of 34,470.
To address these drawbacks, we propose a new architecture
that isolates the address-pointer and address-decoder stages II. FIFO QUEUE HARDWARE
using a non-overlapping circuit [3], which eliminates the need Fig. 2 (a) depicts the architectural block structure for our FIFO
for the N isolation registers using a novel separated two-phase design. The finite state machine (FSM) flow charts in Fig. 2 (b)
non-overlapping clocking system structure. Our circuit uses the and (c) depict the structure’s write and read functionalities,
main system clock CLK to generate two phase-clock signals respectively, along with the associated input/output signals.
CLKN and CLKP that are the exact complement of each other The rectangular boxes in the FSM are states that occur on non-
and whose edges are isolated, thus the two are mutually overlapping clock cycles, where the clock edges are shown
exclusive events occurring on the rising and falling edges of next to the boxes. TABLE 1 depicts the input/output signals’
the system clock [10]. abbreviations and descriptions.
Prior designs used the isolate stage’s N registers to Before operation, the write address-pointer (WAP), read
synchronize with the delay control circuit, which was needed to address-pointer (RAP), and up-down counter (CT) are
insert a large enough delayed to adhere to the setup time initialized to zero. During the write operation, the M-bit input
criteria for reading and writing to the SRAM. Our approach data is stored at the memory location pointed to by the memory
obviates the use of these N registers, and eliminates all design write word line (WWL) of size 2N-row for N-bit WAP on the
challenges, additional timing, and all associated overhead rising system clock (CLK) edge (i.e., the WAP is an up counter
circuitries. Race conditions are safely avoided between the of size N-bit). During the falling CLK edge, WAP and CT are
address-pointer and address-decoder stages (i.e., maintaining incremented, and WWL is disabled. Once CT reaches the full
isolation between the stages without the N registers) and the state, the write operation ends even if there are outstanding
setup/hold time between the memory word line increments and write requests. The read operation is similar, wherein the M-bit
data accesses are sufficient for at least half of a clock cycle, output data (DOUT) is released during the rising CLK edge.
eliminating the need for the delay control circuit. By restricting During the falling CLK edge, the RAP is incremented using a
the address-pointer and address-decoder stages’ activities to simple N-bit up counter, while the CT is decremented and
separate halves of the clock cycle, the isolate stage and delay RWL is disabled. Once CT reaches zero, the empty state, the
control circuits are eliminated, enabling high-speed read/write
operations and power consumption reduction by eliminating
unnecessary components and restricting switching activity to TABLE 1
only the needed components during the required clock phase. SIGNAL DEFINITIONS
Input-Output Representation
Another novel feature of our proposed design is the method
Signals
for generating the empty and full flags using circuitry that
WE Write enable for the write operation
checks for an address roll-over instead of tracking every RE Read enable for the read operation
storage address or data register in the FIFO. This structure CLK System clock
eliminates any arithmetic operation circuitry, such as a RES Reset for state initialization
subtractor and gray code mapping, a long right-left shift EMPTY Flag for the empty buffer
register, or a logic-expensive up/down counter FULL Flag for the full buffer
[2][6][11][7][8][13]. Furthermore, the flags require only four DIN [M-1:0] Input data bus of size M
DFFs, regardless of the depth of the FIFO, and thus limit the DOUT [M-1:0] Output data bus of size M
cost and hardware overhead associated with deeper FIFOs. RAP [N-1:0] Read address-pointer of size N
Compared to prior work, our design can operate 5X faster WAP [N-1:0] Write address-pointer of size N
at the same supply voltage, or up to 44X faster with a 2.5X WWL [2N-1:0] Memory write word line of size 2N
RWL [2N-1:0] Memory read word line of size 2N





Fig. 3. Detailed hardware diagram for our proposed FIFO design
parallel counter logic presented in [9].
read operation ends even if there are outstanding read requests. The two-phase clock signals, CLKN and CLKP, are
Fig. 3 depicts the read and write operations’ hardware with derived from CLK and represent CLK’s negative and positive
respect to CLK. The FIFO’s core storage unit is constructed versions, respectively. CLKP and CLKN are at the opposite
using an 8T-Cell SRAM circuit. The write and read decoders phase of each other and their transition edges are non-
are divided into two separate components: the address-pointer overlapping, which ensures that there are no race conditions
and address-decoder. The circuit structure of the address- between address-pointer and address-decoder, such that while
pointer is the up counter and is based on the state look-ahead address-pointer is disabled/enabled, address-decoder is
enabled/disabled, respectively. The clock signal source and
traditional back-to-back NAND two-phase clocking systems
[9] are usually susceptible to fabrication process, voltage, and
temperature (PVT) variation, as well as noise jitter variation.
All these variations can have a negative impact on the duty
cycle of the two phases and non-overlapping margin, thus
negatively impacting overall circuit timing performance.
Usually, a delay-locked-loop circuit (DLL) [14] is used to
control the PVT variations and preserve clock skew, however,
due to its complexity, power consumption, and area cost and
large overhead in manycore architectures, we use a simpler
circuit [10] with similar achievements. [10] showed that
simulations over PVT corners, for non-ideal clock signals
provides non-overlapping phases with low values of root mean
square (RMS) jitter.
There are two parallel counters, one for each read and write
address-pointer. Each counter is triggered by CLKN, which is
gated with the read/write operation enable signals, REN/WEN,
respectively, such that the counter is incremented whenever a
write/read operation occurs. Alternatively, the write/read
 address-decoders are enabled when CLKP is asserted high and
is gated with the WEN/REN signals, respectively.
Fig. 4. Empty/Full flag circuitry.




Fig. 4 depicts our empty-full flag circuitry solution. The generates the address-pointer is only the delay of a single DFF.
empty-full flag detects the roll-over of the read and write A single DFF delay is composed of two gate delays as a
address-pointers using two serially-connected DFFs for each precaution for a worst-case timing scenario:
operation (i.e., one pair of DFFs for the read operation and
another pair for the write operation). Both pairs of serially CT = 2 * GDs (2)
connected DFFs are initialized to ‘10.’ The least significant bits A portion of the address-decoder logic, except the last
(LSBs) are XNORed for evaluation of the flags’ statuses in stage, is integrated into the address-pointer stage, giving an
combination with the read and write address-pointers. If the additional three gate delays. Thus, the total gate delay count for
address-pointers’ values are equal and the serially-connected generating the memory word line address prior to the last stage
DFFs’ LSBs are equal, the empty flag is asserted high, and of address-decoder is:
alternatively, if the address-pointers’ values are equal and the
serially-connected DFFs’ LSBs are different, full is asserted Path2-End1 = 5 GDs (3)
high. When the associated read/write address-pointer reaches This delay is shown as Path2-End1 in Fig. 5.
the full state, then the associated pair of serially-connected
DFFs change state to ‘01’ on the rising CLKP edge, as shown The empty and full flags are evaluated at the rising edge of
on the DFFs’ clock signal in Fig. 4. the CLKN signal that follows the incrementing of the address-
pointer. CLKN is implemented using XNOR and AND gate
III. TIMING CONSTRAINTS logic that requires two gate delays
In order to evaluate the scalability of our approach, the timing Path2-End2 = 2 GDs (CT) + 2 GDs (XNORs&AND) = 4 GDs
of all critical paths including the setup/hold-gated with the two- (4)
phase clocks, CLKN and CLKP must be independent of the
technology. To the best of our knowledge, our design is the This delay is shown as Path2-End2 in Fig. 5.
first implementation of a FIFO that uses a non-overlapping The roll-over status is pre-evaluated in the empty and full
two-phase clocking system. Our design uses only a basic flag circuitry on the rising CLKP edge after incrementing the
CMOS logic gate structure with basic width/length sizes for address-pointer. This operation enables a large fan-in AND
design layout clarity and cost-effectiveness for continued gate, which requires two gate delays:
technology scaling. We evaluate our proposed FIFO circuit
based on an 8T-Cell SRAM structure of size 64-entry x 64-bit Path2-End3=2 GDs (CT) + 2 GDs (large fan-in AND)= 4
as a common size for many applications. GDs (5)
Fig. 5 enumerates all of the possible path delays generated This delay is shown as Path2-End3 in Fig. 5.
in the data path by the non-overlapping two-phase clocking The evaluation of the roll-over status is stored in the 2-bit
system. The circuit generation of the clocking system serial shifter in the flag circuitry at the rising CLKP edge. This
approximates the non-overlapping time as three gate delays evaluation ensures that Path2-End3 ends before the 2-bit serial
(GDs) for a worst-case timing scenario: shifter changes its state. Therefore, one of the possible worst-
Non-Overlap = 3 * GDs (1) case timing scenarios for CLK includes the worst-case timing
of CLKN, with the addition of three gate delays for the non-
such that, on the falling CLK edge, CLKN is rising after three overlapping operation:
gate delays and CLKP is falling after one gate delay.
Alternatively, on the rising CLK edge, CLKN is falling after CLK = 5 GDs (CLKN high) + 5 GDs (CLKN low) + 3 GDs
three gate delays and CLKP is rising after one CLK delay. (Non-Overlap) = 13 GDs (6)
The address-pointer stage is triggered by the rising CLKN The SRAM read-write word lines accessed in the last stage
edge (Fig. 5, Path2). The time to increment the counter that of the address-decoder are gated with CLKP and the REN or

Fig. 5. Critical paths in our proposed design




when studying processor evolution, such as the ARM-1 to


ARM-9, with recent technologies.
We verify the functionality of our proposed FIFO using a
Verilog HDL implementation. For brevity, we omit the
behavioral simulation waveforms without SRAM. The
waveforms show the changes in the write and read address-
pointers, word line activities based on these pointers, and the
full and empty flag operations, using the signals defined in
Table 1 for write operations followed by read operations. The
FIFO is initially empty and the following sequential operations
occur: (1) a write operation for 10 clock cycles; (2) a read
operation for 12 clock cycles; (3) a write operation for 66 clock
cycles; and (4) a read operation for 68 consecutive cycles.
Fig. 6. Timing diagram with associated events during one clock
These operations and cycle durations verify correct operation
cycle
for all possible scenarios for the full and empty flags
WEN signals, as depicted in Fig. 5. The last stage of the conditions, the read and write address-pointers as well as the
address-decoder requires three gate delays, even though it is address-pointers’ incrementing on the positive CLKN edge
driven by only one gate, due to the loading parasitic effect. As and holding on the positive CLKP edge, and the SRAM word
a result, considering the worst-case word line access timing address lines showing activity only on the rising CLKP edge
scenario, the read operation starts from Path1 and ends at the based on the empty and full flag assertions.
input/output to the SRAM memory:
We gathered timing delay values, total power consumption,
Path1-End1 = 6 GDs (SRAM read) + 3 GDs (last stage and total transistor counts using HSPICE [12] simulations,
decoder) + 1 GD (gated CLKP) = 10 GDs (7) which show that our proposed FIFO buffer’s read/write
operations can execute at a system clock frequency CLK =
This delay is shown as Path1-End1 in Fig. 5. This path creates 1.15GHz with a slew rate rise-fall of 0.15 v/ns. This design
a CLK timing scenario that can be approximated using the frequency is very close to the timing analysis derived in
worst-case timing of CLKP plus the three gate delays for the Section IV, with a system clock CLK cycle timing of
non-overlapping operation: approximately 30 gate delays as derived in equation (10).
CLK = 20 GDs (2*Path1-End1) + 3 GDs (Non-Overlap) To reduce power consumption, we used a non-overlapping
= 23 GDs (8) clocking system that is commonly used in the ARM
The slew rate is approximated as three for each of the rising architecture. This design approach has no latency delay (i.e.,
and falling edges and is added to the overall CLK cycle time, the write or read cycle happen in exactly one clock cycle).
giving the final CLK cycle time of: The gate delay for the 65-nm technology ranges from 0.015
CLK = 23 GDs (CLK) + 6 GDs (slew rate) = 29 GDs (9) ns to 0.035 ns depending on the parasitic components and the
components’ sizes, wherein the minimum technology size for
The control unit structure for the FIFO (Fig. 3) generates the length is used. For all transistors, we use a channel length
the write enable WEN and read enable REN signals within one of 65 nm where the widths range from 3 μm to 5 μm except the
gate delay. A single gate delay is all that is required to inverter driver widths that are Wp = 15 μm and Wn = 10 μm.
guarantee that the WEN or REN signals are active high for one The 8T-Cell SRAM is 65 nm, which is a standard geometry
complete CLK cycle, even if the signals RE and WE are not size from Intel [5]. The overall design’s total power
originally aligned with the CLK. Therefore, the total worst- consumption is approximately 7.8 mW.
case CLK cycle time is:
TABLE 2 shows the characteristics and component
CLK = 30 GDs (10) architecture details for our FIFO compared to related work
Fig. 6 summarizes a FIFO access during one CLK cycle, [6][11][7] for continued technology scaling. We detail the
where the events labeled as Path1-End1, Path1-End2, Path2- address-pointer, control, data storage, and the empty/full flag
End1, Path2-End2, Path2-End3, and Pre-charge occur within circuitry. The related works’ structures follow recent
the two non-overlapping two-phase signals, CLKN and CLKP, advancements in technology implementation trends and have
with a safe non-overlapping margin. circuits that are targeted for high-speed operation with a low
power budget. This comparison evaluates the designs’
IV. RESULTS AND ANALYSIS complexities and scalabilities independent of the underlying
Without loss of generality and for comparison purposes, we technology since it is challenging to find comparable designs
implemented, tested, and verified our FIFO queue buffer with the same technology parameters, however, the comparison
hardware architecture of size 64-entry x 64-bits using SRAM still provides insights about the power consumption, speed,
8T-Cells, which is similar to many prior hardware queue scalability, and design layout complexity. Compared to prior
integrated circuits [2][6][11][7][8][13]. We architected our work, our design can operate 5X faster at the same supply
proposed queue buffer hardware at the CMOS transistor level voltage, or up to 44X faster with a 2.5X increase in supply
using 65-nm Taiwan Semiconductor Manufacturing Company voltage.
(TSMC) technology with a 1 V power supply. Additionally, it
is well known that two-phase clocking systems are becoming
tdominant in emerging technology markets, which is evident




TABLE 2
COMPARISON BETWEEN PRIOR WORKS AND OUR PROPOSED FIFO DESIGN
Address-Pointer Control Unit Storage Unit Flag Unit Freq/
Voltage
[6] Event-driven self-timed Self-timed management 10T-Cell SRAM Up/down indicators 22.7 MHz/
pointers using replica units and power 256-word X 16- 0.4V
write and read pulses switched control circuit bit

[11] Adder and binary registers Counter control unit Dual-port RAM Johnson counter 200 MHz/
128-word X 32- 1V
bit
[7] Cyclic address with one- Latches are connected Latched-based No flags are used, 27 MHz/
hot bubble-encoding and in parallel forming requiring N*(N- all pipeline lanes are 0.28V
shift register lanes activated every 1) latches for N full, then output is
other cycle lanes released
Our Carry look-ahead state Non-overlapping clock 8T-Cell SRAM Rollover detection 1 GHz/
Work counter using 2 opposing phase- 64-word X 64-bit and comparison 1V
clocks circuit


V. CONCLUSIONS [5] S. Abdel-Hafeez, M. Shatnawi, and A. Gordon-Ross, A DOUBLE


DATA RATE 8T-CELL SRAM ARCHITECTURE FOR SYSTEMS-
Our proposed FIFO design provides high-speed operation, and ON-CHIP", IEEE 14Th International Symposium on System-on-Chip
large SRAM-based storage capacity with a simple scalable 2012, Tampere, Finland, October 11-12, 2012.
design. The design uses two phase-clocks , which eliminates [6] W. Hsu, P. Huang, S. Wu, C. Chuang, W. Hwang, M. Tu, and M. Yin,
“8nm Ultra-Low Power near-/Sub-threshold First-In-First-Out (FIFO)
the race conditions between incrementing the addresses and Memory for Multi-Bio-signal Sensing Platforms. International
accessing the data for write/read operations. Our design Symposium on Automation and Test VLSI Design (VLSI-DAT), pp. 1-
leverages the key advantages of the two-ported SRAM 4, Hsinchu, Taiwan , April 2016.
memory 8T-Cell for high-speed operation. The empty/full flag [7] D. Jeon, M. B. Henry, Y. Kim, I. Lee, Z. Zhang, D. Blaauw, and D.
circuitry is simplified to record the roll-over of the FIFO’s Sylvester, An Energy Efficient Full-Frame feature Extraction
maximum address values, which precludes the use of large Accelerator With Shift-Latch FIFO in 28 nm CMOS. IEEE J. SSC, Vol.
49, No. 5, pp. 1271-1283, May 2014.
circuits with continuous switching activity. In future work, we
will adapt the design for asynchronous behavior using two [8] B. Keller, M. Fojtik, and B. Khailany. A Pausible Bisynchronous FIFO
for GLAS systems. 21st IEEE International Symposium on
independent separate clocks signals for the read and write Asynchronous Circuits and Systems, pp. 1-8, California, May 2015
operations. [9] W. Lin and W. C. Black, Jr. A low-jitter skew-calibrated multi-phase
clock generator for time-interleaved applications. Solid-State Circuits
VI. ACKNOWLEDGEMENTS Conference, 2001. Digest of Technical Papers. ISSCC. 2001, IEEE
International, pp. 396-397, 2001
The authors would like to acknowledge the support of National [10] B. Nowacki, N. Paulino, and J. Goes. A Simple 1 GHz Non-Overlapping
Science Foundation (CNS-1718033) and Jordan University of Two-Phase Clock Generators for SC Circuits. MIXDES 2013, 20th
Science and Technology for both providing the financial International Conference "Mixed Design of Integrated Circuits and
support to complete this work. Any opinions, findings, and Systems", June 20-22, Gdynia, Poland, pp. 174-178, 2013
conclusions or recommendations expressed in this material are [11] A. Rahmani, P. Liljeberg, J. Plosila, and H. Tenhunen. Design and
those of the authors and do not necessarily reflect the views of implementation of reconfigurable FIFOs for Voltage/Frequency Island-
based Networks-on-Chip. Microprocessors and Microsystems, Vol. 37,
the National Science Foundation. pp. 432-445, June–July 2013.
[12] Synopsys. (2010). HSPICE. [Online]. Available:
VII. REFERENCES http://www.synopsys.com
[1] IEEE spectrum, Breaking the Multicore Bottleneck, Nov. 2016 [13] N. Shibata, M. Watanabe, and Y. Tanabe. A Current-Sensed High-Speed
and Low-Power First-In-First-Out Memory Using a Wordline/Bitline-
[2] Y. Bae, S. Park, and I. Park. A single-chip programmable platform based Swapped Dual-Port SRAM Cell. IEEE J. of SSC, Vol. 37, No. 6, pp.
on a Multithreaded Processor and Configurable logic Clusters, IEEE 735-750, June 2002.
Journal of Solid-State Circuits, vol. 38, No. 10, Oct. 2003.
[14] D. Zhang , HG Yang , W. Zhu , W. Li, Z. Huang , L. Li, and T. Li. A
[3] S. Furber. ARM: system-on-chip architecture, 2nd edition, Addison- Multiphase DLL With a Novel Fast-Locking Fine-Code Time-to-Digital
Wesley, 2000. Converter. IEEE Transactions on Very Large Scale Integration (VLSI)
[4] S. Abdel-Hafeez and A. Gordon-Ross. A Digital CMOS Parallel Systems, Volume: 23 , Issue: 11, pp. 2680 – 2684, 2015
Counter Architecture Based on State Look-Ahead logic, Journal of IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 19,
Issue 6, , pp. 1023-1034, May 23, 2011.



You might also like