Slides

4/4/2011
EE 811 Advanced Digital System Design

Dr. Arshad Aziz
Basic FPGA Architecture
Technology Timeline
1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 Transistors ICs (General) SRAMs & DRAMs Microprocessors SPLDs CPLDs ASICs FPGAs
The Design Warriors Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com)
4/4/2011
Major FPGA vendors

SRAM-based FPGAs Xilinx Inc. www xilinx com Inc www.xilinx.com Altera Corp. www.altera.com Atmel Corp. www.atmel.com Lattice Semiconductor Corp. www.latticesemi.com Antifuse and fl h b A tif d flash-based FPGA d FPGAs Actel Corp. www.actel.com QuickLogic Corp. www.quicklogic.com
Feature
Technology node Reprogrammable Reprogramming speed (inc. erasing) Volatile (must be programmed on power-up) Requires external configuration file Good for prototyping Instant-on IP Security Size of configuration cell Power consumption Rad Hard
SRAM
State-of-the-art Yes (in system) Fast
Antifuse
One or more generations behind No
E2PROM / FLASH
One or more generations behind Yes (in-system or offline) 3x slower than SRAM No (but can be if required) No Yes (reasonable) Yes Very Good Medium-small (two transistors) Medium Not really
----
Yes
No
Yes Yes (very good) No Acceptable

(especially when using bitstream encryption)
No No Yes Very Good Very small Low Yes
Large (six transistors) Medium No
4/4/2011
The Programmable Marketplace

Q1 Calendar Year 2005
PLD Segment Actel Lattice L tti 5% 7% QuickLogic: Q i kL i 2% Other: 2% FPGA Sub-Segment
Xilinx
58% 33% 51% 31% Altera 11%
Xilinx
Altera
All Others
Source: Company reports Latest information available; computed on a 4-quarter rolling basis
FPGA Families
Low-cost
Spartan 3 Spartan 3E Spartan 3L
High-performance
Virtex 4 LX / SX / FX Virtex 5 LX
Xilinx
Cyclone II
Stratix II Stratix II GX
Altera
4/4/2011
Xilinx
Primary products: FPGAs and the associated CAD software
Programmable Logic Devices
ISE Alliance and Foundation Series Design Software
Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Seiko Epson (Japan) TSMC (Taiwan)
Source: [Xilinx Inc.]
Xilinx
Primary products: FPGAs and the associated CAD software
Programmable Logic Devices
ISE Alliance and Foundation Series Design Software
Main headquarters in San Jose, CA Fabless* Semiconductor and Software Company UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 1996} Seiko Epson (Japan) TSMC (Taiwan)
4/4/2011
Xilinx FPGA Families

Old families XC3000, XC4000, XC5200 Old 0.5m, 0.35m and 0.25m technology. Not recommended for modern designs. L Low Cost F il C t Family Spartan/XL derived from XC4000 Spartan-II derived from Virtex Spartan-IIE derived from Virtex-E Spartan-3 (90 nm) Spartan-3E (90 nm) Spartan-3A (90 nm) High-performance families High performance Virtex (220 nm) Virtex-E, Virtex-EM (180 nm) Virtex-II, Virtex-II PRO (130 nm) Virtex-4 (90 nm) Virtex 5 (65 nm)
General structure of an FPGA
4/4/2011
Xilinx FPGA
Configurable Logic Blocks
Block RAMs Block RAMs
I/O Blocks Block RAMs
Generic FPGA architecture:

Configurable Logic Block (CLB) (CLB) Connection Block Wire segments Switch Block Routing Channels
I/O pad
4/4/2011
Xilinx CLB
Configurable logic block (CLB) Slice CLB CLB Logic ll L i cell Logic cell Slice Logic ll L i cell Logic cell
Slice CLB CLB Logic cell Logic cell
Slice Logic cell Logic cell
Xilinx Point of Reference

A Xilinx CLB has FOUR slices
Each slice has TWO logic cells Each logic cell has TWO LUTs plus other logic (carry and control) plus a flip-flop/latch
For SLICEL slices, these LUTs can be configured as:
1. 1 LUT
For SLICEM slices, these LUTs can be configured as:

1. LUT 2. 16 x 1 Distributed RAM (16 words x 1 bit/word) 3. 16-bit Shift Register
4/4/2011
CLB Structure of Spartan 3

Each Virtex-II CLB contains four slices
Local routing provides feedback between slices in the same CLB, and it provides routing to neighboring CLBs A switch matrix provides access to general routing resources
Switch Matrix COUT BUFT BUF T Slice S3 COUT
Slice S2 SHIFT
Slice S1
Slice S0
Local Routing
CIN
CIN
Simplified view of a Xilinx Logic Cell

16-bit SR 16x1 RAM
a b c d e clock clock enable set/reset
4-input p LUT
y mux flip-flop q
4/4/2011
Simplified Slice Structure

Each slice has four outputs
Two registered outputs, two non-registered outputs Two BUFTs associated with each CLB, accessible by all 16 CLB outputs
Slice 0 LUT Carry
PRE D Q CE CLR
Carry logic runs vertically, up only

Two independent carry chains per CLB
LUT Carry
D PRE Q CE CLR
Detailed Slice Structure

The next few slides discuss the slice features
LUTs MUXF5, MUXF6, MUXF7, MUXF8 (only the F5 and F6 MUX are shown in this diagram) Carry Logic MULT_ANDs Sequential Elements
4/4/2011
SRAM Cell (Pass Transistor)

An SRAM cell can drive the gate (G) terminal of an NMOS transistor. If SRAM (M) = 1 then signals passes from S D An SRAM cell can be attached to the select line of a MUX to control it.
Look-Up Tables
Combinatorial logic is stored in Look-Up Tables (LUTs)
Also called Function Generators (FGs) Capacity is limited by the number of inputs, not by the complexity A B C D Z 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 . .
Z
Delay through the LUT is constant

Combinatorial Logic
A B C D
1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1
10
4/4/2011
Look Up Table (LUT)

The LUT is used to realize any Boolean function. Assume the function to be realized is y = (a&b) | !c This could be achieved by loading the LUT with the appropriate output values
LUT (Look-Up Table) Functionality

x1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 x2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 x3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 x4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 y 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 x1 x2 x3 x4
LUT
x1 x2 x3 x4
x1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
x2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
x3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
x4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
y 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0
Look-Up tables are primary elements for logic implementation Each LUT can implement any function of 4 inputs i
x1 x2 y y
11
4/4/2011
5-Input Functions implemented using two LUTs

One CLB Slice can implement any function of 5 inputs Logic function i partitioned b L i f i is ii d between two LUT LUTs F5 multiplexer selects LUT
A4 A3 A2 A1 WS DI
0
LUT ROM RAM
F5
F5 GXOR G
F4 F3 F2 F1 BX
A4 A3 A2 A1
WS
DI D
LUT ROM RAM
nBX BX 1 0
5-Input Functions implemented using two LUTs

X X X X X 5 4 3 2 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0 1 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 Y 0 1 0 0 1 1 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 0
LUT
OUT
LUT
12
4/4/2011
Dedicated Expansion Multiplexers

MUXF5 combines 2 LUTs to create
CLB Slice LUT LUT MUXF5 Slice LUT LUT MUXF5 MUXF6
Any 5-input function (LUT5) Or selected functions up to 9 inputs Or 4x1 multiplexer Any 6-input function (LUT6) Or selected functions up to 19 inputs 8x1 multiplexer
MUXF6 combines 2 slices to form
Dedicated muxes are faster and more space efficient
Connecting Look-Up Tables

F5 F8
CLB Slice S3 Slice S2

F7
MUXF8 combines the two MUXF7 outputs (from the CLB above or below) MUXF6 combines slices S2 and S3 MUXF7 combines the two MUXF6 outputs MUXF6 combines slices S0 and S1 MUXF5 combines LUTs in each slice
Slice S1
F5
Slice S0
F6
F F5
F5
F6
13
4/4/2011
Programmable Logic Block
Early devices were based on the concept of programmable logic block, which comprised
3-input 3 input lookup table (LUT), (LUT) register that could act as flip flop or a latch, multiplexer, along with a few other elements.
3-, 4-, 5-, or 6-input LUTs?

The key feature of n-input LUT is that it can implement any possible n-input combinational logic function. Adding more inputs allows you to represent more complex functions, but every time you add an input, you double the number of SRAM cells!
The first FPGAs were based on 3-input LUTs.
FPGA vendors and researchers studied the relative merits of 3, 4, 5 and even 6 input LUTS.
The current consensus is that 4-input LUTS offer the optimal balance of pros and cons.
In the past, some devices were created using a mixture of different LUT sizes because this offered the promise of optimal device utilization. However current logic synthesis tools prefer uniformity and regularity
14
4/4/2011
FPGA Function generators

LUT Example: Implement the function using: 2 input 2-input LUTs 3-input LUTs 4-input LUTs
F = ABD + BC D + A B C
A B D B C D A B C
A B D F B C D A B C F A B C D F
Fast Carry Logic
Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters p ,
Carry logic is independent of normal logic and routing resources
LSB
Carry Logic Routing
Each CLB contains separate logic and routing for the fast MSB generation of sum & carry signals
15
4/4/2011
Fast Carry Logic

Simple, fast, and complete arithmetic Logic
Dedicated XOR gate for singlelevel sum completion Uses dedicated routing resources ti All synthesis tools can infer carry logic
COUT
To S0 of the next CLB
COUT
To CIN of S2 of the next CLB
First Carry Chain
SLICE S3
CIN COUT
SLICE S2 SLICE S1
COUT
CIN
Second Carry Chain SLICE S0
CIN
CIN
CLB
Accessing Carry Logic
All major synthesis tools can infer carry

logic for arithmetic functions
Addition (SUM <= A + B) Subtraction (DIFF <= A - B) Comparators (if A < B then) Counters (count <= count +1)
16
4/4/2011
Flexible Sequential Elements

Either flip-flops or latches Two in each slice; eight in each CLB ; g Inputs come from LUTs or from an independent CLB input Separate set and reset controls
Can be synchronous or asynchronous
_1 FDRSE D CE R S Q
FDCPE D PRE Q CE CLR
All controls are shared within a slice

Control signals can be inverted locally within a slice
LDCPE D PRE Q CE G CLR
Shift Register
LUT
Each LUT can be configured as shift register i t

Serial in, serial out
IN CE CLK
D CE
D CE
Dynamically addressable delay up to 16 cycles For programmable pipeline Cascade for greater cycle delays d l Use CLB flip-flops to add depth
LUT
D CE
OUT
D CE
DEPTH[3:0]
17
4/4/2011
Shift Register
12 Cycles Operation A 64 4 Cycles Operation C 3 Cycles 3 Cycles Operation B 8 Cycles 64
9-Cycle imbalance
Register-rich FPGA Register rich

Allows for addition of pipeline stages to increase throughput
Data paths must be balanced to keep desired functionality
Shift Register LUT Example
12 Cycles
Operation A Operation B
64
4 Cycles
Operation C
8 Cycles
Operation D - NOP
64
3 Cycles
12 Cycles
9 Cycles
Paths are Statically Balanced
18
4/4/2011
Distributed RAM
CLB LUT configurable as Distributed RAM
An LUT equals 16x1 RAM Cascade LUTs to increase RAM size
LUT
RAM16X1S
=
RAM32X1S
D WE WCLK A0 A1 A2 A3 A4 O
D WE WCLK A0 A1 A2 A3
Synchronous write Asynchronous read

Can create a synchronous read by using extra flip-flops Naturally distributed RAM Naturally, read is asynchronous
LUT
=
LUT
or
RAM16X2S
D0 D1 WE WCLK A0 A1 A2 A3 O0 O1
RAM16X1D
D WE WCLK A0 A1 A2 A3 DPRA0 DPO DPRA1 DPRA2 DPRA3 SPO
Two LUTs can make

32 x 1 single-port RAM 16 x 2 single-port RAM 16 x 1 dual-port RAM
or
Xilinx Multipurpose LUT
19
4/4/2011
Simplified view of a Xilinx Logic Cell

16-bit SR 16x1 RAM
a b c d e clock clock enable set/reset
4-input p LUT
y mux flip-flop q
RAM Blocks and Multipliers in Xilinx FPGAs
20
4/4/2011
Embedded Ram Blocks

A lot of applications require the use of memory, so FPGAs now include relatively large chunks of embedded RAM called e-RAM or Block RAM (BRAM). ( ) Depending on the architecture of the component, these blocks might be positioned around the periphery of the device or organized as columns
These blocks can be used for a variety of purposes, such as implementing standard single or dual port RAMs, FIFO, e.t.c.
Block RAM
Port B Port A
Spartan-3 Dual-Port Block RAM
Block RAM
Most efficient memory implementation

Dedicated blocks of memory
Ideal for most memory requirements

4 to 104 memory blocks
18 kbits = 18 432 bits per block (16 k without parity bits) 18,432
Use multiple blocks for larger memories
Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM)
21
4/4/2011
Spartan-3 Block RAM Amounts
1 0
Block RAM can have various configurations (port aspect ratios)

2 0 4 0
8k x 2
4,095
4k x 4
16k x 1
8,191 0
8+1
2k x (8+1) ( )
2047 16+2 0 1023 16,383
1024 x (16+2)
22
4/4/2011
Block RAM Port Aspect Ratios
Single-Port Block RAM
23
4/4/2011
Dual-Port Block RAM
Dual-Port Bus Flexibility

RAMB4_S16_S8
WEA
Port A In 1K-Bit Depth
ENA RSTA CLKA ADDRA[9:0] DIA[17:0] WEB ENB DOA[17:0]
Port A Out 18-Bit Width
Port B In 2k-Bit Depth
RSTB CLKB ADDRB[10:0] DIB[8:0]
DOB[8:0]
Port B Out 9-Bit Width
Each port can be configured with a different data bus width Provides easy data width conversion without any additional logic
24
4/4/2011
Two Independent Single-Port RAMs

RAMB4_S1_S1 Port A In 8K-Bit Depth 0, ADDR[12:0] 0 ADDR[12 0]
WEA ENA RSTA CLKA ADDRA[12:0] DIA[0] DOA[0]
Port A Out 1-Bit Width
Port B In 8K-Bit Depth 1, ADDR[12:0]
WEB ENB RSTB CLKB ADDRB[12:0] DIB[0] DOB[0]
Port B Out 1-Bit Width
Added advantage of True Dual DualPort

No wasted RAM Bits
To access the lower RAM

Tie the MSB address bit to Logic Low
Can split a Dual-Port 16K RAM into two Single-Port 8K RAM

Simultaneous independent access to each RAM
To access the upper RAM

Tie the MSB address bit to Logic High
Embedded Multipliers

Some functions, like multipliers are inherently slow if they are implemented by connecting a large number of programmable logic blocks together. g g Current FPGA incorporate special hard wired multiplier blocks which are typically located in close proximity to the embedded RAM blocks (Arithmetic Based Applications).
25
4/4/2011
18 x 18 Embedded Multiplier
Fast arithmetic functions
Optimized to implement multiply / accumulate modules
18 x 18 signed multiplier Fully combinational Optional registers with CE & RST ( i li ) O i l i ih (pipeline) Independent from adjacent block RAM
18 x 18 Multiplier
Embedded 18-bit x 18-bit multiplier
2s complement signed operation
i d in l M lti li Multipliers are organized i columns

Data_A (18 bits)
18 x 18 Multiplier
Data_B (18 bits)
Output (36 bits)
26
4/4/2011
Positions of Multipliers
Asynchronous 18-bit Multiplier
27
4/4/2011
18-bit Multiplier with Register
A simple clock tree

Clock tree Flip-flops
Special clock pin and pad Clock signal from outside world
28
4/4/2011
Digital Clock Manager (DCM)
Clock signal from outside world
Clock Manager etc.

Special clock pin and pad
Daughter clocks used to drive internal clock trees or output pins
Digital Clock Managers (DCM)

The clock pin is usually connected to special hard-wired function called a clock-manager that generates daughter clocks. The daughter clocks may be used to drive internal clock trees or external output pins that can be used to provide clocking services to other devices on the host circuit board. There might be multiple clock managers supporting only a subset of features (Jitter removal, Frequency Synthesis, )
Clock signal from outside world
Clock Manager etc.

Special clock pin and pad
Daughter clocks used to drive internal clock trees or output pins
29
4/4/2011
DCM: Jitter Removal

In the real world clock edges may arrive a little early or a little late. A fuzzy clock would result (jitter) due to the delay encountered. The FPGA clock manager can be used to detect and correct for this jitter and provide a clean daughter clock signal for use inside the device.
DCM: Frequency Synthesis

The frequency of the clock signal being presented to the FPGA from the outside world might not be exactly what the designer engineer wishes for. The clock manager can be used to generate daughter clocks with frequencies that are derived by multiplying or dividing the original signal.
30
4/4/2011
DCM: Phase Shifting

Certain designs require the use of clocks that are phase shifted (delayed) with respect to each other. other Some clock managers allow you to select from fixed phase shifts of common values such as 1200 and 2400 (for a three-phase clocking scheme)
Basic I/O Block Structure

Three-State FF Enable Clock Set/Reset Output FF Enable D Q EC SR Direct Input FF Enable Registered Input Q D EC Input Path D Q EC SR Three-State Control
Output Path
SR
31
4/4/2011
IOB Functionality
IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered
advised for high-performance I/O
Inputs can be delayed
Configurable I/O Impedances

The signals used to connect devices on todays circuit board often have fast edge rates. In order to prevent signals reflecting back it is necessary to apply appropriate terminating resistors to the FPGA input and output pins.
In the past, resistors were applied as discrete components (outside the FPGA). FPGA) Today's FPGAs allow the use of internal terminating resistors whose value can be configured by the user.
32
4/4/2011
Spartan 3 Family Attributes
FPGA Nomenclature
33
4/4/2011
Spartan-3 FPGA Family Members
2001 Virtex-II FPGA Family

Virtex-II FPGA introduced followed by Virtex-II Pro in 2003 444 18x18 Multipliers & 18kbit block RAMs introduced Gbit Serial I/O Communications & Power PC Processors Introduced C Complex Floating Point Algorithm Implementation now possible
Virtex-II / Pro 44,000 Logic Slices 444 18Kbits BRAMs 444 18x18 Multipliers 2 PowerPC Processors 20 Gbit I/O 1164 Max User I/O
34
4/4/2011
Virtex II Pro Floorplan

Up to 16 serial transceivers
622 Mbps to 3.125 Gbps
1t 4P to PowerPCs PC 4 to 16 multi-gigabit transceivers 12 to 216 multipliers 3,000 to 50,000 logic cells 200k to 4M bits RAM 204 to 852 I/Os
PowerPCs
Logic cells
Virtex-II Pro (Selection)
35
4/4/2011
Embedded Processor Cores (Hard and Soft)

The majority of designs make use of microprocessors. These appeared as discrete devices on the circuit board. Lately, high-end FPGAs have become available that contain one or more embedded microprocessors (referred to as microprocessor cores). There are two types of cores: A hard microprocessor core is implemented as a dedicated predefined block (two approaches) A soft microprocessor core is implemented by configuring a group of programmable logic blocks to act as a microprocessor.
Embedded Core (Inside)

Xilinx and Altera tend to embed one or more microprocessor cores directly into the main FPGA fabric (PowerPC) In this case the design tools have to be able to take account of the presence of these blocks in the fabric (any memory used by the core is formed from the embedded RAM blocks).
The main advantage of this scheme is the inherent speed p advantages to be gained from having the processor core in intimate proximity to FPGA fabric.
36
4/4/2011
Soft Core

As opposed to embedding a microprocessor physically into the fabric of the chip, it is possible to configure a group of p g programmable logic blocks to act as a microprocessor. g p Soft cores are simpler (more primitive) and slower than their hard-core counterparts.
ADVANTAGE?
1.
2.
The main advantage of this scheme is that the user need only implement a core if he/she needs it. Also, the user can instantiate as many cores as they require until they run out of resources!
Virtex Architectures
Built for high-performance applications
Other Families include Virtex-II Pro Virtex-4 Virtex-5 Latest Family include Virtex-6
Basic Architecture 74
37
4/4/2011
Virtex-II Pro Architecture

Contains embedded Processors and Multi-Gigabit Transceivers
High performance True Dual-port RAM - 8 Mb SelectIO- Ultra Technology - 1164 I/O
Advanced FPGA Logic 99k logic cells
XtremeDSP Functionality Embedded multipliers
RocketIO and RocketIO X High-speed Serial Transceivers 622 Mbps to 3.125 Gbps PowerPC Processors 400+ MHz Clock Rate - 2 XCITE Digitally Controlled Impedance Any I/O DCM Digital Clock Management - 12
130 nm, 9 layer copper in 300 mm wafer technology
Virtex-4 Family
Advanced Silicon Modular BLock (ASMBL) Architecture Optimized for logic, Embedded, and Signal Processing
LX
Resource
FX
12K 12K140K LCs 0.6 0.610 Mb 420 32 32192 240 240896 024 Channels 1 or 2 Cores 2 or 4 Cores
SX
23K 23K55K LCs 2.3 2.35.7 Mb 48 128 128512 320 320640 N/A N/A N/A
Logic Memory DCMs DSP Slices SelectIO RocketIO PowerPC Ethernet MAC
14K 14K200K LCs 0.9 0.96 Mb 412 32 3296 240 240960 N/A N/A N/A
38
4/4/2011
Virtex-4 Architecture
RocketIO Multi-Gigabit Transceivers
622 Mbps10.3 Gbps
Smart RAM
New block RAM/FIFO
Advanced CLBs
200K Logic Cells
Xesium Clocking Technology

500 MHz
Tri-Mode Ethernet MAC XtremeDSP Technology Slices

256 18x18 GMACs 10/100/1000 Mbps
PowerPC 405 with APU Interface

450 MHz, 680 DMIPS
1 Gbps SelectIO
ChipSync Source synch, XCITE Active Termination
Virtex-5 Family
Optimized for logic, Embedded, Signal Processing, and High-Speed Connectivity
Virtex-5 Platforms
LX
The image cannot be display ed. Your computer may not hav e enough memory to open the image, or the image may hav e been corrupted. Restart y our computer, and then open the file again. If the red x still appears, y ou may hav e to delete the image and then insert it again.
LXT
SXT
FXT
Logic
Logic On-chip RAM DSP Capabilities Parallel I/Os Serial I/Os PowerPC Processors
Logic/Serial
DSP/Serial
Emb./Serial
39
4/4/2011
Virtex-5 Architecture
Enhanced
36Kbit Dual-Port Block RAM / Dualg FIFO with Integrated ECC 550 MHz Clock Management Tile with DCM and PLL SelectIO with ChipSync Technology and XCITE DCI Advanced Configuration Options 25x18 DSP Slice with Integrated ALU RocketIO Transceiver Options TriTri-Mode 10/100/1000 Mbps Ethernet MACs
LowLow-Power GTP: Up to 3.75 Gbps HighHigh-Performance GTX: Up to 6.5 Gbps
New
Most Advanced HighHighPerformance Real 6LUT Logic Fabric PCI Express Endpoint Block System Monitor Function with BuiltBuilt-in ADC Next Generation PowerPC Embedded Processor
TheBuilt for high volume, low-cost applications Spartan-3 Family

18x18 bit Embedded Pipelined Multipliers for efficient DSP Configurable 18K Block RAMs + Distributed RAM
Spartan-3
Bank 0 Bank 2
Up to eight on-chip Digital Clock Managers to support multiple system clocks
4 I/O Banks, Support for all I/O Standards including PCI, DDR333, RSDS, mini-LVDS
Bank 3
Bank 1
40
4/4/2011
Spartan-3 Family
Based upon Virtex-II Architecture Optimized for Lower Cost
Smaller process = lower core voltage

.09 micron versus .15 micron Vccint = 1.2V versus 1.5V
Logic resources
Only one-half of the slices support RAM or SRL16s (SLICEM) Fewer block RAMs and multiplier blocks
Clock Resources
Fewer global clock multiplexers and DCM blocks
I/O Resources
Fewer pins per package No internal 3-state buffers Support for different standards
New standards: 1.2V LVCMOS, 1.8V HSTL, and SSTL Default is LVCMOS, versus LVTTL
SLICEM and SLICEL

Each Spartan-3 CLB contains four slices
Similar to the Virtex-II
Left-Hand SLICEM Right-Hand SLICEL
COUT COUT
Slices are grouped in pairs

Left-hand SLICEM (Memory)
LUTs can be g y configured as memory or SRL16
Switch Matrix SHIFTIN
Slice X1Y1
Slice X1Y0
Slice X0Y1
Right-hand SLICEL (Logic)

LUT can be used as logic only
Slice X0Y0
Fast Connects
SHIFTOUT CIN
CIN
41
4/4/2011
Multiple Domain-optimized Platforms
Spartan-3E Features
More gates per I/O than Spartan 3 Spartan-3 Removed some I/O standards
Higher-drive LVCMOS GTL, GTLP SSTL2_II HSTL_II_18, HSTL I HSTL II 18 HSTL_I, HSTL_III LVDS_EXT, ULVDS
16 BUFGMUXes on left and right sides

Drive half the chip only In addition to eight global clocks
Pipelined multipliers Additional configuration modes

SPI, BPI Multi-Boot mode
DDR Cascade
Internal data is presented on a single clock edge Architecture Basic
84
42
4/4/2011
Spartan-3A DSP Features

Increased amount of block memory (BRAM)
1512K of S3A1800 vs 648 K of S3E1600
More XtremeDSP DSP48A slices

Replaces Embedded multiplier of Spartan-3E
3400A 126 DSP48As 1800A 84 DSP48As
Spartan-3A DSP
Tuning DSP Performance
Integrated XtremeDSP Sli Xt DSP Slice
Application optimized capacity Integrated pre-adder optimized for filters 250 MHz operation, standard speed grade Compatible with VirtexDSP
XtremeDSP DSP48A Slice
Increased memory capacity and performance

Also important for embedded processing, complex IP, etc Basic Architecture 86
43
4/4/2011
Function Multiplier Pre-Adder Cascade Inputs Cascade Output Dedicated C input Adder Dynamic Opmodes ALU Logic Functions Pattern Detect SIMD ALU Support Carry Signals
DSP48
DSP48 Comparison
DSP48E 25 x 18 No Two Yes Yes 3 input 48 bit Yes Yes Yes Yes Carry In & Out DSP48A 18 x 18 Yes One Yes Yes 2 input 48 bit Yes No No No
Enables parallel ALU operations on multiple data sets.
Benefit
18 x 18 No One Yes No 3 input 48 bit Yes No No No Carry In
Reduces FPGA resource needs for DSP algorithms. Reduces the critical path timing in FIR filter applications better performance. Important in FIR filter construction. Enables fast d t E bl f t data path chaining of DSP48 bl k f l th h i i f blocks for larger filt filters. Enables fast data path chaining of DSP48 blocks for larger filters. The C input supports many 3-input mathematical functions, such as 3input addition and 2-input multiplication with a single addition and the very valuable rounding of multiplication away from zero. Supports simple add and accumulate functions.
One DSP48 can provide more than one function.. Multiply, Multiply-add, multiply-accumulate etc. Similar to the ALU of a microprocessor. Enables the selection of ALU function on a clock cycle basis Enables multiple functions to be selected. (Add, Subtract, or Compare) This feature supports convergent rounding, underflow/overflow detection for saturation arithmetic, and auto-resetting counters/accumulators.
Carry In & Out
Supports fast carry functions between DSP blocks. Often a speed limiting path.
Spartan-3A Device Table

Spartan-3 Spartan-3A XC3S1400A XtremeDSP DSP48A Slices Dedicated Multipliers Block Ram Blocks Block RAM (Kb) Distributed RAM (Kb) FFs/LUTs Logic C ll L i Cells DCMs Max Diff I/O Pairs CS484 19x19mm (0.8mm pitch) *FG676 27x27mm (1.0mm pitch) Spartan-DSP Spartan-3A DSP XC3SD1800A XC3SD3400A
32 32 576 176 22,528 25,344 25 344 8 227 502 Basic Architecture 88
84 DSP48As 84 1,512 260 33,280 37,440 37 440 8 227 309 519
126 DSP48As 126 2,268 373 47,744 53, 53 712 8 213 309 469
44
4/4/2011
Latest Families
Architecture Alignment
Virtex-6 FPGAs Spartan-6 FPGAs
760K Logic Cell Device
Common Resources
LUT-6 CLB BlockRAM DSP Slices High-performance Clocking
150K Logic Cell Device
FIFO Logic Tri-mode EMAC System Monitor

*Optimized for target application in each family
Parallel I/O HSS Transceivers* PCIe Interface
Hardened Memory Controllers 3.3 Volt compatible I/O
Enables IP Portability, Protects Design Investments

45
4/4/2011
Addressing the Broad Range of Technical Requirements

Spartan-6 LX Spartan-6 LXT Virtex-6 LXT
Lowest cost logic + DSP Lowest logic + high-speed serial
Virtex-6 HXT
Market Size
Virtex-6 SXT
High logic density + serial connectivity Ultra high-speed serial connectivity + logic DSP + logic + serial connectivity
Application Market Segments

+ 100s More
Designers Eccentrics
Higher System Performance
More design margin to simplify designs Higher integrated functionality
Lower System Cost

Reduce BOM Implement design in a smaller device & lower speedgrade
Lower Power
Help meet power budgets Eliminate heat sinks & fans Architecture 92 Basic Prevent thermal runaway
46
4/4/2011
Virtex-6 Family
Virtex Product & Process Evolution

Virtex-6
40-nm
Virtex-5
65-nm 6
Virtex-4
90-nm
Virtex-II Pro
130-nm
Virtex-II
150-nm
Virtex-E
180-nm 180 nm
Virtex
220-nm
2nd Generation 3rd Generation 4th Generation 5th Generation 6th Generation
1st Generation
Delivering Balanced Performance, Power, and Cost

Virtex-6 Base Platform 94
47
4/4/2011
Strong Focus on Power Reduction

Static Power Reduction
Higher distribution of low leakage transistors Reduced capacitance through device shrink VCCINT = 0.9V option allows power / performance tradeoff Dynamic termination Allows sophisticated monitoring of temperature and voltage
D Dynamic P i Power R d ti Reduction Reduced Core Voltage Devices Lower Overall Power I/O Power Improvements System Monitor
Up to 50% Power Reduction vs. Previous Generation

Virtex-6 Logic Fabric

Virtex-6 Configurable Logic Block (CLB)
Each CLB contains two slices Each slice contains four 6-input Lookup Tables 6 input (6LUT)
Slice
LUT LUT
Slice
LUT LUT LUT LUT LUT LUT
Slices implement logic functions (slice_l) Slices for memories and shift registers (slice_m) LUT6 implements
All functions of up to 6 variables Two functions of up to 5 or less variables each Shift registers up to 32 stages long Consumption Benefits PowerMemories of 64 bits Performance Benefits
Shift register Multiple configurations within slice_m memories mode greatly reduces power Increased ratio of a slice consumption over FF implementation available closer to the source or target logic
CLB
Cost Benefits
Can pack logic and memory functions more efficiently
48
4/4/2011
Higher DSP Performance

Most advanced DSP architecture
New optional pre-adder for symmetric filters 25x18 multiplier
High resolution filters Efficient floating point support
ALU-like second stage enables mapping of advanced operations

Programmable op-code SIMD support Addition / Subtraction / Logic functions
Pattern detector
Lowest power consumption Highest DSP slice capacity

Up to 2K DSP Slices
Virtex-6 LXT / SXT FPGAs
49
4/4/2011
Spartan-6 Family
Spartan-6
Next Generation 45nm Spartan Family
Increased performance & density Evolutionary feature enhancements Dramatic cost & power reductions
Two Silicon Platforms

LX: Cost optimized Logic, Memory LXT: LX features plus High-Speed Serial Connectivity More unified & integrated with Virtex
Delivering the Optimal Balanced of Cost, Power & Performance

50
4/4/2011
Spartan-6 Logic Evolution

Higher Performance, Increased Utilization
Modified Virtex 6-input LUT
4 additional flip-flops per slice Higher utilization for register Spartan-3A Series & Spartanintensive designs Earlier LUT / FF Pair
NEW Efficient Design
SpartanSpartan-6
LUT / Dual FF Pair 6LUT
Efficient & Capable

Logic Arithmetic functions Distributed RAM & shift registers Interconnect 4LUT
Up to 25% Higher Performance
Great GeneralGeneral-Purpose Logic
6-input LUT & 2nd FlipFlipflop for Higher Utilization
Spartan-6 CLB Logic Slices

SliceM (25%) SliceL (25%) SliceX (50%)
LUT6 8 Registers Carry Logic Wide Function Muxes Distributed RAM / SRL logic
LUT6 8 Registers Carry Logic Wide Function Muxes
LUT6 Optimized for Logic p g 8 Registers
Slice mix chosen for the optimal balance of Cost, Power & Performance
51
4/4/2011
Spartan-6 Lowest Total Power

Static power reductions
Process & architectural innovations
Dynamic power reduction

Lower node capacitance & architectural innovations
More hard IP functionality

Integrated transceivers & other logic reduces power Hard IP uses less current & power than soft IP
Lower IO power Low power option -1L reduces power even further Fewer supply rails reduces power
Spartan-6 Hard Memory Controller

New Hard Block Memory Controller
Up to 4 controllers per device
Why a Hard Memory Block?

Very common design component Multiple customer benefits
Customer Requests
Higher performance Lower cost Lower power Easier designs
Spartan-6 Hard Block Memory Controller Benefits

Up to 800 Mbps Saves soft logic, smaller die Dedicated logic Timing closure no longer an issue Configurable MultiPort user interface CoreGen/MIG wizard & EDK support
52
4/4/2011
Memory Controller
Only low cost FPGA with a hard memory controller G Guaranteed memory interface performance providing t d i t f f idi
Reduced engineering & board design time DDR, DDR2, DDR3 & LP DDR support Up to 12.8Mbps bandwidth for each memory controller
Automatic calibration features M lti t structure f user i t f Multiport t t for interface

Six 32-bit programmable ports from fabric
Spartan-6
DRAM
SRAM
Controller interface to 4, 8 or 16 bit memories devices
FLASH
DRAM DDR DDR2 DDR3 LP DDR
EEPROM
Integrated DSP Slice

250 MHz implementation
Fast multiplier & 48 bit adder ASIC-like performance
XtremeDSP DSP48A1 Slice
Input and output registers for higher speed
Optimizes FIR filter applications
Super Regional Training 106
53
4/4/2011
Better, More BRAM

More Block RAMs
2x higher BRAM to Logic Cell ratio than Spartan-3A platform
9K BRAM 18K BRAM
More port flexibility

18K can be split into two 9K BRAM blocks and can be independently addressed
OR
9K BRAM
Improves buffering, caching & data storage

Excellent for embedded processing, communication protocols Enables DSP blocks to provide more efficient video and surveillance algorithms
Lower Static Power

Compare to Spartan-3A
Twice the Capabilities, Half the Power, Hard Blocks!
Feature Logic Cells (Kbit) LUT Design Block RAM (Mbit) Transceiver Count / Speed Voltage Scaling Static Power (typ mW) Memory Interface Max Differential IO Multipliers/DSP Memory Controllers Clock Management PCI Express Endpoint Security
Extended Spartan-3A (90nm) Up to 55K 4 input 4-input LUT + FF Up to 2 Mbit no No (1.2V only) 11 mW (smallest density) 400 Mbps 640 Mbps Up to 126 Multipliers / DSP no DCM Only no Device DNA Only
Spartan-6 (45nm) Up to 150K 6 input 6-input LUT + 2FF Up to 5 Mbit Up to 8 / Up to 3.125 Gbps Yes (1.2V, 1.0V) Up to 60% less! DDR3 800 Mbps 1050 Mbps Up to 184 DSP48 Blocks Up to 4 Hard Blocks DCM & PLL Yes, Gen 1 Device DNA & AES
54
4/4/2011
Spartan-6 LX / LXT FPGAs
** All memory controller support x16 interface, except in CS225 package where x8 only is supported
FPGA Design Flow
55
4/4/2011
Design process (1)

Specification
Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 8031 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds..
Verilog description (Your Verilog Source Files)

Library IEEE; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(31 downto 0); data_output: out std_logic_vector(31 downto 0); out_full: in std_logic; key_input: in std_logic_vector(31 downto 0); key_read: out std_logic; ); end AES_core;
Functional simulation
Synthesis
Post-synthesis simulation y
Design process (2)

Implementation (Mapping, Placing & Routing) Timing simulation g
Configuration On chip testing
56
4/4/2011
Design Process control from Active-HDL
Logic Synthesis
VHDL description
architecture MLU_DATAFLOW of MLU is signal A1:STD_LOGIC; signal B1:STD_LOGIC; signal Y1:STD_LOGIC; signal MUX_0, MUX_1, MUX_2, MUX_3: STD_LOGIC; begin A1<=A when (NEG_A='0') else not A; B1<=B when (NEG_B='0') else not B; Y<=Y1 when (NEG_Y='0') else not Y1; MUX_0<=A1 and B1; MUX_1< A1 MUX 1<=A1 or B1; MUX_2<=A1 xor B1; MUX_3<=A1 xnor B1; with (L1 & L0) select Y1<=MUX_0 when "00", MUX_1 when "01", MUX_2 when "10", MUX_3 when others; end MLU_DATAFLOW;
Circuit netlist
57
4/4/2011
Synthesis Tools
XST
and others
Features of synthesis tools
Interpret RTL code p Synplify Pro: Produces synthesized circuit netlist in a standard EDIF (.edf) format
Can optionally produce .VHM (VHDL code merged into one) file for post-synthesis simulation
XST: Produces synthesized circuit netlist in NGC format Netlist is composed of gates in the particular Xilinx implementation library
http://toolbox.xilinx.com/docsan/xilinx9/books/manuals.pdf has information on libraries
Give preliminary performance estimates Some can display circuit schematics corresponding to EDIF netlist
58
4/4/2011
Timing report after synthesis

Performance Summary ******************* Worst slack in design: -0.924 Requested Estimated Requested Estimated Clock Clock Starting Clock Frequency Frequency Period Period Slack Type Group ------------------------------------------------------------------------------------------------------exam1|clk 85.0 MHz 78.8 MHz 11.765 12.688 -0.924 0.924 inferred Inferred_clkgroup_0 System 85.0 MHz 86.4 MHz 11.765 11.572 0.193 system default_clkgroup ===========================================================
Implementation
After synthesis the entire implementation process is performed by FPGA vendor tools
59
4/4/2011
Mapping
LUT0 LUT4 LUT1 LUT5 LUT2 FF2 LUT3 FF1
60
4/4/2011
Placing
FPGA
CLB SLICES
Routing
Programmable Connections
FPGA
61
4/4/2011
Map report header

Release 7.1.03i Map H.41 Xilinx Mapping Report File for Design 'exam1' Design Information -----------------Command Line : c:\Xilinx\bin\nt\map.exe -p 2S200FG256-6 -o map.ncd -pr b -k 4 -cm area -c 100 -tx off exam1.ngd exam1.pcf Target Device : xc2s200 Target Package : fg256 Target Speed : -6 Mapper Version : spartan2 -- $Revision: 1.26.6.4 $ Mapped Date : Wed Nov 02 11:15:15 2005
Map report
Design Summary -------------Number of errors: 0 Number of warnings: 0 Logic Utilization: Number of Slice Flip Flops: 144 out of 4,704 3% Number of 4 input LUTs: 173 out of 4,704 3% Logic Distribution: Number of occupied Slices: 145 out of 2,352 6% Number of Slices containing only related logic: 145 out of 145 100% Number of Slices containing unrelated logic: g g 0 out of 145 0% *See NOTES below for an explanation of the effects of unrelated logic Total Number 4 input LUTs: 210 out of 4,704 4% Number used as logic: 173 Number used as a route-thru: 5 Number used as 16x1 RAMs: 32 Number of bonded IOBs: 74 out of 176 42% Number of GCLKs: 1 out of 4 25% Number of GCLKIOBs: 1 out of 4 25
62
4/4/2011
Place & route report

Timing Score: 0 Asterisk (*) preceding a constraint indicates it was not met met. This may be due to a setup or hold violation. -------------------------------------------------------------------------------Constraint | Requested | Actual | Logic | | | Levels -------------------------------------------------------------------------------TS_clk = PERIOD TIMEGRP "clk" 11.765 ns | 11.765ns | 11.622ns | 13 HIGH 50% | | | -------------------------------------------------------------------------------OFFSET = OUT 11.765 ns AFTER COMP "clk" | 11.765ns | 11.491ns | 1 -------------------------------------------------------------------------------OFFSET = IN 11.765 ns BEFORE COMP "clk" | 11.765ns | 11.442ns | 2 --------------------------------------------------------------------------------
Post layout timing report

Timing summary: --------------Timing errors: 0 Score: 0 Constraints cover 42912 paths, 0 nets, and 1038 connections Design statistics: Minimum period: 11.622ns (Maximum frequency: 86.044MHz) Minimum input required time before clock: 11.442ns Minimum output required time after clock: 11.491ns
63
4/4/2011
Post-place-and-route simulation
After place-and-route performed, can do post-place-and-route simulation t l d t i l ti
Now have real timing information! Also can do static timing analysis: shows the worst case critical path in circuit
Configuration
Once a design is implemented, you must create a file that the FPGA can understand
This file is called a bit stream: a BIT file (.bit extension)
The BIT file can be downloaded directly to the FPGA, FPGA or can be converted into a PROM file which stores the programming information
64
4/4/2011
Configuration of SRAM based FPGAs
System Gates vs. Real Gates

One common metric used to measure the size of a device in the ASIC world is that of equivalent gates (e-gate) (eConvention used:
A 2-input NAND function to represent one equivalent gate. An equivalent gate consists of an arbitrary number of transistors.
Different vendors provide different functions in their cell libraries, where each implementation of each function requires a different number of transistors (difficult to compare capacity/complexity) Solution: Assign each function an equivalent gate value and sum all these values. th l How can we establish a basis for comparison between FPGAs and ASICs? Can an ASIC of 500,000 equivalent gates that needs to be migrated into an FPGA fit into a particular FPGA?
65
4/4/2011
FPGAs: System Gates

System Gates A 4-input LUT can be used to represent Gates: anywhere between one and more than twenty 2-input primitive h b t d th t t 2i t i iti logic gates. Rule of thumb?
Divide the system gates value by three, so a three million FPGA system gates would equate to one million ASIC equivalent gates!!
However, to make comparisons between two different implementations on an FPGA (i.e. Floating point adder vs. Fixed point adder) designers should use the resources available in an FPGA:
Number of 4-input LUTs used Number of embedded multipliers Number of embedded RAM blocks
State-of-the-Art FPGAs

65-90 nm process on 300 mm wafers
Lower cost per function (LUT + register) Smaller and faster transistors: Higher speed Mainly through smart interconnects, clock management, dedicated circuits, flexible I/O. Integrated transceivers running at 10 Gigabits/sec >100,000 LUTs & flip-flops >200 embedded RAMs, and same number 18 x 18 multipliers
System speed up to 500 MHz
More Logic and Better Features:
1156 pins (balls) with >800 GP I/O i (b ll ) ith 50 I/O standards, incl. LVDS with internal termination 16 low-skew global clock lines Multiple clock management circuits On-chip microprocessor(s) and multi-Gbps transceivers
66
4/4/2011
Latest Devices: Capacity & Features

Xilinx Virtex-5
65nm process Up to 960 I/Os /O >200000 logic cells Up to 552 18kb block RAMs (~10Mb RAM) 450 DSP slices (18x18 multiplier-accumulator) 20 digital clock managers (DCM) 24 high-speed serial transceivers (622Mb/s to 11.1Gb/s) Up to four PowerPC 405 cores
Altera Stratix-II
90nm process Up to 1170 I/Os 179000 logic elements 9.6Mb embedded RAM 96 DSP blocks: 380 18x18 multipliers
12 PLLs
Serial I/O up to 1Gb/s No hard processor cores
FPGAs Becoming More Attractive
21 X Bigger
C a p a c ity S peed P ric e
5.5 X Faster
50 X Less Expensive
1/9 1 1/92 1/93 1 /94 1/9 5 1/96 1/97 1/98 1 /99
Y ear
Source: Xilinx
67
4/4/2011
FPGA Shortcomings

Circuit Delay Delay increases due to programmable switches in the FPGA routing architecture Area Configuration cells and programmable resources incur substantial area penalty Power Typically not suited for low power applications
Performance ASIC Need to improve FPGA FPGA FPGA Cost ASIC Time to market ASIC
Conclusion
FPGAs are the main enabler of Reconfigurable Computing Systems FPGAs fill the gap between Instruction Set Processors (GPs) and ASICS.
Advantages: Flexible, programmable, Disadvantages: Power dissipation, performance w.r.t. ASIC
Applicability of FPGAs relies on CAD tools provided by different vendors such as Xili and Alt diff t d h Xilinx d Altera RCS can be realized with several technologies:
FPGAs: Fine/Medium Grain Coarse Grain Reconfigurable Architectures: CGRAs
68

Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides

Uploaded by

Copyright:

Available Formats

4/4/2011

EE 811 Advanced Digital System Design

Basic FPGA Architecture

Major FPGA vendors

Yes Yes (very good) No Acceptable

No No Yes Very Good Very small Low Yes

Large (six transistors) Medium No

The Programmable Marketplace

58% 33% 51% 31% Altera 11%

Primary products: FPGAs and the associated CAD software

Programmable Logic Devices

ISE Alliance and Foundation Series Design Software

Programmable Logic Devices

ISE Alliance and Foundation Series Design Software

Xilinx FPGA Families

General structure of an FPGA

I/O Blocks Block RAMs

Generic FPGA architecture:

Slice CLB CLB Logic cell Logic cell

Slice Logic cell Logic cell

Xilinx Point of Reference

For SLICEM slices, these LUTs can be configured as:

CLB Structure of Spartan 3

Simplified view of a Xilinx Logic Cell

a b c d e clock clock enable set/reset

Simplified Slice Structure

Carry logic runs vertically, up only

Detailed Slice Structure

SRAM Cell (Pass Transistor)

Delay through the LUT is constant

Look Up Table (LUT)

LUT (Look-Up Table) Functionality

5-Input Functions implemented using two LUTs

LUT ROM RAM

LUT ROM RAM

5-Input Functions implemented using two LUTs

Dedicated Expansion Multiplexers

MUXF6 combines 2 slices to form

Dedicated muxes are faster and more space efficient

Connecting Look-Up Tables

CLB Slice S3 Slice S2

Programmable Logic Block

3-, 4-, 5-, or 6-input LUTs?

The first FPGAs were based on 3-input LUTs.

FPGA Function generators

Fast Carry Logic

Carry logic is independent of normal logic and routing resources

Carry Logic Routing

Fast Carry Logic

First Carry Chain

Second Carry Chain SLICE S0

Accessing Carry Logic

All major synthesis tools can infer carry

Flexible Sequential Elements

FDCPE D PRE Q CE CLR

All controls are shared within a slice

LDCPE D PRE Q CE G CLR

Each LUT can be configured as shift register i t

Register-rich FPGA Register rich

Data paths must be balanced to keep desired functionality

Shift Register LUT Example

Synchronous write Asynchronous read

Two LUTs can make