Professional Documents
Culture Documents
Multiplication Is Basically A Shift Add Operation
Multiplication Is Basically A Shift Add Operation
Some are more suitable for FPGA use than others. This page is a brief tutorial on multiplication hardware. The hyperlinked items in this list are currently in the text. The remaining items will be added in a future release of this page. Scaling Accumulator Multipliers Serial by Parallel Booth Multipliers Ripple Carry Array Multipliers Row Adder Tree Multipliers Carry Save Array Multipliers Look-Up Table Multipliers Partial Product LUT Multipliers Computed Partial Product Multipliers Constant Multipliers from Adders KCM multipliers Limited Set LUT Multipliers Wallace Trees Booth Recoding Negative inputs
1 0 1 1
Row ripple form Unrolled shift-add algorithm Delay is proportional to N A ripple carry array multiplier (also called row ripple form) is an unrolled embodiment of the classic shift-add multiplication algorithm. The illustration shows the adder structure used to combine all the bit products in a 4x4 multiplier. The bit products are the logical and of the bits from each input. They are shown in the form x,y in the drawing. The maximum delay is the path from either LSB input to the MSB of the product, and is the same (ignoring routing delays) regardless of the path taken. The delay is approximately 2*n.
This basic structure is simple to implement in FPGAs, but does not make efficient use of the logic in many FPGAs, and is therefore larger and slower than other implementations.
000 000000 000000 000000 000000 000000 000000 000000 000000 001 000000 000001 000010 000011 000100 000101 000110 000111 010 000000 000010 000100 000110 001000 001010 001100 001110 011 000000 000011 000110 001001 001100 001111 010010 010101 100 000000 000100 001000 001100 010000 010100 011000 011100 101 000000 000101 001010 001111 010100 011001 011110 100011
110 000000 000110 001100 010010 011000 011110 100100 101010 111 000000 000111 001110 010101 011100 100011 101010 110001
By performing the multiplication one digit at a time and then shifting and summing the individual partial products, the size of the memorized times table is greatly reduced. While this example is decimal, the technique works for any radix. The order in which the partial products are obtained or summed is not important. The proper weighting by shifting must be maintained however. The example below shows how this technique is applied in hardware to obtain a 6x6 multiplier using the 3x3 LUT multiplier shown above. The LUT (which performs multiplication of a pair of octal digits) is duplicated so that all of the partial products are obtained simultaneously. The partial products are then shifted as needed and summed together. An adder tree is used to obtain the sum with minimum delay.
The LUT could be replaced by any other multiplier implementation, since LUT is being used as a multiplier. This gives the insight into how to combine multipliers of an arbitrary size to obtain a larger multiplier. The LUT multipliers shown have matched radices (both inputs are octal). The partial products can also have mixed radices on the inputs provided care is taken to make sure the partial products are
shifted properly before summing. Where the partial products are obtained with small LUTs, the most efficient implementation occurs when LUT is square (ie the input radices are the same). For 8 bit LUTs, such as might be found in an Altera 10K FPGA, this means the LUT radix is hexadecimal. For 4 bit LUTs, found in many FPGA logic cells, the ideal radix is 2 bits (This is really the only option for a 4 LUT: a 1 bit input reduces the LUT to an AND gate, and since each LUT cell has 1 output, it can only use one bit on the other input). A more compact but slower version is possible by computing the partial products sequentially using one LUT and accumulating the results in a scaling accumulator. Note that in this case, the shifter would need a special control to obtain the proper shift on all the partials
The Xilinx Virtex device includes an extra gate in the carry chain logic that allows a 4 input LUT plus the carry chain to perform a 2xN partial product, thereby achieving twice the density otherwise attainable. In this case, the adder (consisting of the XOR gates and muxes in the carry chain) adds a pair of 1xN partial products obtained with AND gates. The extra AND gate in the carry logic allows you to put AND gates on both inputs of the adder while maintaining a 4 input function.
2 x n bit computed partial product implemented in Xilinx Virtex using special MULTAND gate in carry chain logic
010 134 670 1206 1742 011 201 737 1273 1809 100 268 804 1340 1876 101 335 871 1407 1943
110 402 938 1474 2010 111 469 1005 1541 2077 When the LUT does not offer enough inputs to accommodate the desired variable width, several identical LUTs may be combined using the partial products techniques discussed above. In this case, the constant multiplicand is full width, so the partial products will be m x n where m is the number of LUT inputs and n is the width of the constant.
000 001 010 011 100 101 110 603 1139 1675 85
010 134 670 1206 1742 170 850 1530 2210 011 201 737 1273 1809 255 935 1615 2295 100 268 804 1340 1876 340 1020 1700 2380 101 335 871 1407 1943 425 1105 1785 2465 110 402 938 1474 2010 510 1190 1870 2550 111 469 1005 1541 2077 595 1275 1955 2635
therefore do not have to be included in the sum. If the number of '1' bits in a constant coefficient multiplier is small, then a constant multiplier may be realized with wired shifts and a few adders as shown in the 'times 10' example below.
0 1 0 1 0000000 1011001 0000000 +1011001 1101111010
In cases where there are strings of '1' bits in the constant, adders can be eliminated by using Booth recoding methods with subtractors. The 'times 14 example below illustrates this technique. Note that 14 = 8+4+2 can be expressed as 14=16-2, which reduces the number of partial products.
0 1 1 1 0000000 1011001 1011001 +1011001 10011011110 0 0000000 -1 1110100111 0 0000000 0 0000000 1 +1011001 10011011110
Combinations of partial products can sometimes also be shifted and added in order to reduce the number of partials, although this may not necessarily reduce the depth of a tree. For example, the 'times 1/3' approximation (85/256=0.332) below uses less adders than would be necessary if all the partial products were summed directly. Note that the shifts are in the opposite direction to obtain the fractional partial products.
Clearly, the complexity of a constant multiplier constructed from adders is dependent upon the constant. For an arbitrary constant, the KCM multiplier discussed above is a better choice. For certain 'quick and dirty' scaling applications, this multiplier works nicely.
Wallace Trees
Optimized column adder tree Combines all partial products into 2 vectors (carry and sum) Carry and sum outputs combined using a conventional adder Delay is log(n) Wallace tree multiplier uses Wallace tree to combine 1 x n partial products Irregular routing Not optimum in many FPGAs A Wallace tree is an implementation of an adder tree designed for minimum propagation delay. Rather than completely adding the partial products in pairs like the ripple adder tree does, the Wallace tree sums up all the bits of the same weights in a merged tree. Usually full adders are used, so that 3 equally weighted bits are combined to produce two bits: one (the carry) with weight of n+1 and the other (the sum) with weight n. Each layer of the tree therefore reduces the number of vectors by a
factor of 3:2 (Another popular scheme obtains a 4:2 reduction using a different adder style that adds little delay in an ASIC implementation). The tree has as many layers as is necessary to reduce the number of vectors to two (a carry and a sum). A conventional adder is used to combine these to obtain the final product. The structure of the tree is shown below. The red numbers after each full adder in the illustration indicate the bit weights of each signal. For a multiplier, this tree is pruned because the input partial products are shifted by varying amounts.
A section of an 8 input wallace tree. The wallace tree combines the 8 partial product inputs to two output vectors corresponding to a sum and a carry. A conventional adder is used to combine these outputs to obtain the complete product..
If you trace the bits in the tree (the tree in the illustration is color coded to help in this regard), you will find that the Wallace tree is a tree of carry-save adders arranged as shown to the left. A carry save adder consists of full adders like the more familiar ripple adders, but the carry output from each bit is brought out to form second result vector rather being than wired to the next most significant bit. The carry vector is 'saved' to be combined with the sum later, hence the carrysave moniker. A Wallace tree multiplier is one that uses a Wallace tree to combine the partial products from a field of 1x n multipliers (made of AND gates). It turns out that the number of Carry Save Adders in a Wallace tree multiplier is exactly the same asWallace tree rearranges the wiring however, so that the partial product bits with the longest delays are wired closer to the root of the tree. This changes the delay characteristic from o(n+n) to o(n+log(n)) at no gate cost. Unfortunately the nice regular routing of the array multiplier is also replaced with a rats-nest.