Professional Documents
Culture Documents
Unit 3 and 4 PPts of RC2
Unit 3 and 4 PPts of RC2
BY
Prof. K. R. Saraf
Qn. Compare the architectures and capabilities of
ASIC, PDSP, GPP, FPGA, and memory 10
Qn. Qn. Write a note on
1. RALU (6) page 41 R.C. Bobada
2. Limitations of current FPGAs (6)
3. Area consumption by different architectural building blocks in
reconfigurable device. (6)
4. Weak upper bound and interconnect (6) page no. 104 General purpose
computing by Andrew Dehon
5. Bus-based Communication (6) page no. 203 Bobada
6. Direct Communication (6) page no. 201 Bobada
7. Circuit Switching (6) page no. 203 Bobada
8. The Dynamic Network on Chip ( DyNoC) (6) page no. 219 Bobada
9. Comparison of DPGA with FPGA (6)
10. MATRIX (6) page no. 273 General purpose computing by Andrew Dehon
1. RALU (6) page 41 R.C. Bobada
• The Xputer’s concept was presented in early 1980s by Reiner Hartenstein, a
researcher at the University of Kaiserslautern in Germany.
• The goal of Xputer’s was to have a very high degree of programmable parallelism
in the hardware, at lowest possible level, to obtain performance not possible with
the Von Neumann computers.
• Instead of sequencing the instructions, the Xputer used to sequence data, thus
exploiting the regularity in the data dependencies of some class of applications
like in image processing, where a repetitive processing is performed on a large
amount of data.
• An Xputer consists of three main parts: the data sequencer, the data memory
and the reconfigurable ALU (rALU) that permits the run-time configuration of
communication at levels below instruction set level.
• Within a loop, data to be processed were accessed via a data structure called
scan window. Data manipulation was done by the rALU that had access to many
scan windows.
• The most essential part of the data sequencer was the generic address generator
(GAG) that was able to produce address sequences corresponding to the data of
up to three nested loops.
• An rALU subnet that could be configured to perform all computations on the data
of a scan window was required for each level of a nested loop.
4. Weak upper bound and interconnect page
no. 104 General purpose computing by Andrew Dehon
6. Direct Communication (6) page no. 201 Bobada
• Direct communication paradigm allows modules placed on the chip to
communicate using dedicated physical channels, configured at compile-time.
• The configuration of the channels remains until the next full reconfiguration of
the device.
• A configuration defines the set of physical lines to be used, their direction, their
bandwidth and speed as well as the terminal, i.e the components that are
connected by the lines.
• Components must be designed and placed on the device in such a way that their
ports can be connected to the predefined terminals.
• Feed through channels must also be available in each component to allow signal
used by modules aside the component to cross the components.
• Example:- The configuration of figure above provides an example of 1-D
communication using direct lines.
• For this purpose, a set of 10 predefine channels is fixed and the component must
adapt their position and direction to make use of the channels.
• The physical channels 1, 5, 6 and 7 are not used.
• Line 3 is used by module C2 for connection with the device pins on the left and
right sides.
• It is fed through component C1 that provides the necessary channel for the
signal to cross.
• Additionally, the two components use the same line to access the device pins.
• Lines 8 and 10 are used by component C5 to access the device pins on the left
and right sides.
• The main disadvantage of this approach is the restriction imposed on the design
of components.
• For each component, dedicated channels must be foreseen to allow signals that
are not used by this component in its placement location to cross.
• Also, the placement algorithm must deal with additional restrictions like the
availability of signals in a given location.
• This increases its complexity and makes the approach only possible for an
offline temporal placement, where all the configurations can be defined and
implemented at compile-time.
5. Bus-based Communication (6) page no. 203 Bobada
• Communication over third party is used for example by Brebner in and Walder et
al.
• Each message is first sent to the central module, which forwards it to the
destination.
• The module inputs and outputs are controlled by registers that are mapped into
the address space of the processor.
• This approach can be used not only to allow the communication between a
reconfigurable module and a user program but also between several
reconfigurable modules connected together through a bus.
• Bus-based Communication
• The communication between the reconfigurable modules on a given device can
also be done using a common bus.
• However, the additional delay increased by the bus arbitration can drastically
affect the performance of the system.
• Switches are available at the column and line intersection to allow a longer
connection using the vertical and horizontal lines at an intersection point.
• Circuit routing was experienced in several systems such as the YUPPIE in the
1980s.
• Long communication delay: This will happen, if the connection between two
processing elements has to go through many other processors.
• The number of switches on the communication path therefore increases, thus
increasing communication delay and therefore slowing down the clock.
• The synthesis process constrains a module on a given area of the device, where it
uses all the resources, i.e. the processing elements and the interconnects in that area.
• This has the consequence that the placement of a module at run-time in a region,
where a route is running, will destroy the route, because the connection used by the
route will be assigned to the component.
• To avoid this, we can place components only at locations where no route will destroy
them, a restriction that will however increase the chip fragmentation.
• Consider the placement of figure below with three routes to connect PE1 to PE2, PE3
to PE4 and PE5 to PE6.
• For a new component that needs to use four consecutive processing elements in a
quadratic surface, the region where the routes are implemented must be prohibited.
• No region that can accommodate the new component is therefore left on the device,
and the component must be rejected, although enough resources are available on the
device.
• In device allowing only a 1-D placement like it is the case of Xilinx VirtexII
FPGAs, which are only column wise reconfigurable, circuit switching can be
used to connect a few number of modules, usually 2–8, and allow dynamic
communication to be established between the components running onto the
device at run-time.
8. The Dynamic Network on Chip ( DyNoC) (6) page no.
219 Bobada
• A Dynamic Network on Chip (DyNoC) is a Network on Chip whose structure can
be dynamically changed at run-time. In a DyNoC, a routers is a programmable
element basically configured as router but that can be configured at run-time to
implement any function that can fit on it.
• The placed module only needs one router to access the network.
• Figure Implementation of a large reconfigurable module on a Network on Chip
• With this, the routers in the area of a placed component are no more accessible
by other components in the network.
• Upon completion, the module is removed and the network must be reactivated in
the area where the component where previously placed.
• This can be done quickly, because the router are programmable components that
can be quickly reset to their basic configuration, i.e. the routers.
• With this, we have a network in which some parts may be deactivated at a given
period of time and reactivated in the future.
• Datapaths can be composed efficiently from primitives since instructions are not
prededicated to datapath elements, but rather delivered through the uniform
interconnection network.
4. Static interconnect (6) page no. 30 General purpose computing by Andrew Dehon
5. Interconnect hierarchy (6) page no. 161 General purpose computing by Andrew Dehon
6. Overhead in network design Or overheads in Design (6)
• Brown and Rose suggest each 4-LUT in a moderate sized FPGA with 100’s of 4-
LUTs will require 200-400 switches.
• Agarwal and Lewis suggest approximately 100 switches per LUT for hierarchical
FPGAs with some reduction in logic utilization.
• Commercial devices also exhibit on the order of 200 switches per 4-LUT.
• The fact that conventional FPGAs can, with difficulty, route most all designs
using less than 80-90% of the device LUTs, suggests that they chose a number
of switches which provides reasonably “adequate” interconnect for the current
device sizes – hundreds to a couple of thousand 4-LUTs.
4. Static interconnect (6) page no. 30 General purpose computing by Andrew
Dehon
5. Interconnect hierarchy (6) page no. 161 General purpose computing by
Andrew Dehon
7. Bisection BW (6) page no. 81 General purpose computing by Andrew
Dehon
8. Crossbars (6) page no. 81 General purpose computing by Andrew Dehon
OR
• Unlike FPGAs, multicontext devices store several configurations for the logic
and the interconnect on the chip.
• The additional area for the extra contexts decreases functional density, but it
increases functional diversity by allowing each LUT element to perform several
different functions.
• Like FPGAs, these devices may suffer from limited interconnect or application
pipelining limits.
• The result is a set of partitions that are used to the complete device.
• While the implementation of single partitions is easy, the amount of waste
resources in partitions can be very high.
• Recall that the waste resource of a component is the amount resources occupied
by that component multiplied by the time where the component is idle on the
device.
• Wasting resources on the chip can be avoided if any single component is placed
on the chip only when its computation is required and remains on the device
only for time it is active.
• With this, idle components can be replaced by new ones, ready to run at a given
point of time.
• Exchanging a single component on the chip means reconfiguring the chip only
on the location previously occupied by that component.
• This process is called partial reconfiguration in contrast to full reconfiguration
where the full device must be reconfigured, even for the replacement of a single
component.
• While most of the existing devices support full reconfiguration, only few are
able to be partially reconfigured.
Qn. What is relationship of interconnect growth and
requirement of area on chip? Why do interconnects
consume dominant area on chip? Give example. 8
page no. 109 General purpose computing by Andrew Dehon
• Figure 1 plots computational density against datapath width, , and the number
of instructions per function group, c.
• For single context designs, there is only a factor of 2.5 difference in density
between single bit granularity and 128-bit granularity.
• At this size, network effects dominate instruction effects, and the factor of
difference comes almost entirely from the difference in switching requirements.
• For heavily multicontext devices at the same number of instruction contexts, the
difference between fine and coarse granularity is greater since the instruction
memory area dominates (See also Figure 2).
• At 1024 contexts, the 128 bit datapath is 36 denser than an array with bit-level
granularity.
• Figure 2: Compute and Instruction Densities Versus Contexts and Datapath Width
• As the number of contexts, , increase, the device is supporting more loaded
instructions; that is, a larger on chip instruction diversity.
• These same density trends hold if we set aside a fixed amount of data memory.
• The area outside of the data memory will follow the same density curves shown
here.
Qn. Give Rents Rule based hierarchical model for
interconnect. 8 page no. 87 General purpose computing by Andrew Dehon
• Figure 1 below depicts the basic architecture for this DPGA. Each array element is a
conventional 4-input lookup table (4-LUT).
• Small collections of array elements, in this case 4 4 arrays, are grouped together into
sub-arrays.
• A single, 2-bit, global context identifier is distributed throughout the array to select the
configuration for use.
• Figure 1: Architecture and Composition of DPGA
• Additionally, programming lines are distributed to read and write configuration
memories.
• DRAM Memory The basic memory primitive is a 4 *32 bit DRAM array which
provides four context configurations for both the LUT and interconnection
network (See Figure 2).
• Notably, the context memory cells are built entirely out of N-well devices,
allowing the memory array to be packed densely, avoiding the large cost for N-
well to P-well separation.
• The active context data is read onto a row of standard, complementary CMOS
inverters which drive LUT programming and selection logic.
• Figure 2: DRAM Memory Primitive
• Array Element The array element is a 4-LUT which includes an optional flip-flop
on its output (Figure 3).
• Each array element contains a context memory array. For our prototype, this is
the 4*32 bit memory described above.
• 16 bits provide the LUT programming, 12 configure the four 8-input multiplexors
which select each input to the 4-LUT, and one selects the optional flip-flop.
• Each array element output is run vertically and horizontally across the entire
span of the subarray (Figure 4).
• Each array element can, in turn, select as an input the output of any array
element in its subarray which shares the same row or column.
• The row and column widths remains fixed regardless of array size so the
horizontal and vertical interconnect would eventually saturate the row and
column channel capacity if the topology were scaled up.
• Figure 4: Subarray Local Interconnect
• Additionally, the delay on the local interconnect increases with each additional
element in a row or column.
• For small subarrays, there is adequate channel capacity to route all outputs
across a row and column without increasing array element size, so the topology
is feasible and
desirable.
• Further, the additional delay for the few elements in the row or column of a small
subarray is moderately small compared to the fixed delays in the array element
and routing network.
• In general, the subarray size should be carefully chosen with these properties in
mind.
• Non-Local Interconnect In addition to the local outputs which run across each row
and column, a number of non-local lines are also allocated to each row and column.
• The non-local lines are driven by the global interconnect (Figure 4).
• Each LUT can then pick inputs from among the lines which cross its array element.
• In the prototype, each row and column supports four non-local lines.
• Each array element could thus pick its inputs from eight global lines, six row and
column neighbor outputs, and its own output.
• Each input is configured with an 8:1 selector as noted above (Figure 3).
• Of course, not all combinations of 15 inputs taken 4 at a time are available with this
scheme.
• The inputs are arranged so any combination of local signals can be selected along with
many subsets of global signals.
• Freedom available at the crossbar in assigning global lines to tracks reduces the
impact of this restriction, but complicates placement.
• Local Decode Row select lines for the context memories are decoded and buffered
locally from the 2-bit context identifier.
• One decoder also services the crossbar memories for four of the adjacent crossbars.
• In our prototype, this placed five decoders in each subarray, each servicing four array
element or crossbar memory blocks for a total of 128 memory columns.
• Each local decoder also contains circuitry to refresh the DRAM memory on contexts
which are not being actively read or written.
Qn. Explore the architectural building blocks of iDPGA and DPGA. What is difference
between them? 16
OR
Qn. With the help of detail block diagram. Explain the architecture of iDPGA. List its
features, merits and limitations. 16
page no. 230 General purpose computing by Andrew Dehon
• With the help of detail block diagram. Explain the architecture of iDPGA. List its
features, merits and limitations. 16
• RP space area model. 6
• MATRIX 6
• Draw the detail block diagram of TSFPGA. Explain each block in detail. Discuss
merits and demerits. 16
• What are the concepts behind time switched FPGA and dynamically
programmable gate array? Explore the architecture of any one of them in detail.
16
• How is task switching innovative in TSFPGA? Explain the architectural blocks of
TSFPGA in detail. What are its limitations? 16