Professional Documents
Culture Documents
Computer Arithmetics
Computer Arithmetics
ARCHITECTURE
COMPUTER ARITHMETIC
Dr. Valon Raca
Notes
- Slides adapted and based on sources from University of Pennsylvania (Joseph
Devietti, Benedict Brown, C.J. Taylor, Milo Martin & Amir Roth)
2
This Unit: Arithmetic
⬗ A little review
App App App
System software
⬗ Binary + 2s complement
⬗ Ripple-carry addition (RCA)
Mem CPU I/O
⬗ P&H
⬗ Chapter 3
⬗ Appendix B.6
4
The Importance of Fast Arithmetic
+
4
Register
File Data
PC Insn s1 s2 d
Mem
Mem
11
1st Grade: Decimal Addition
1
43
+29
72
⬗ Repeat N times
⬗ Add least significant digits and any overflow from previous add
⬗ Carry “overflow” to next addition
⬗ Overflow: any digit other than least significant of sum
⬗ Shift two addends and sum one digit to the right
12
Binary Addition: Works the Same Way
1 111111
43 = 00101011
+29 = 00011101
72 = 01001000
⬗ Repeat N times
⬗ Add least significant bits and any overflow from previous add
⬗ Carry the overflow to next addition
⬗ Shift two addends and sum one bit to the right
⬗ Sum of two N-bit numbers can yield an N+1 bit number
⬗ CO (carry out) = AB
B
CO
⬗ This is called a half adder
14
The Other Half
⬗ We could chain half adders together, but to do that…
⬗ Need to incorporate a carry out from previous adder
C A B = C0 S CI
0 0 0 = 0 0
0 0 1 = 0 1
S
0 1 0 = 0 1 CI
0 1 1 = 1 0 A
A S
1 0 0 = 0 1 B
FA
1 0 1 = 1 0 B
1 1 0 = 1 0 CO
1 1 1 = 1 1
CO
⬗ S = C’A’B + C’AB’ + CA’B’ + CAB = C ^ A ^ B
⬗ CO = C’AB + CA’B + CAB’ + CAB = CA + CB + AB
⬗ This is called a full adder
15
Ripple-Carry Adder
⬗ N-bit ripple-carry adder
0
…
⬗ Example: 16-bit ripple carry adder A15 S15
⬗ How fast is this? B15
FA
16
Quantifying Adder Delay
⬗ Combinational logic dominated by gate delays
⬗ How many gates between input and output?
⬗ Array storage dominated by wire delays
⬗ Longest delay or critical path is what matters
⬗ delay(CO15) = 2 + MAX(d(A15),d(B15),d(CI15)) A0 S0
⬗ d(A15) = d(B15) = 0, d(CI15) = d(CO14)
FA
B0
⬗ d(CO15) = 2 + d(CO14) = 2 + 2 + d(CO13) … A1 S1
⬗ d(CO15) = 32 B1
FA
A2 S2
⬗ d(CON–1) = 2N B2
FA
– Too slow! …
– Linear in number of bits
A15 S15
FA
⬗ Number of gates is also linear B15
CO
18
Fast Addition
19
Bad idea: a PLA-based Adder?
⬗ If any function can be expressed as two-level logic…
⬗ …why not use a PLA for an entire 8-bit adder?
⬗ Not small
⬗ Approx. 216 AND gates, each with 16 inputs
⬗ Then, 8 OR gates, each with 216 inputs
⬗ Number of gates exponential in bit width!
⬗ Not that fast, either
⬗ An AND gate with 65K inputs != 2-input AND gate
⬗ Many-input gates made from tree of, say, 4-input gates
⬗ 216-input gates would have at least 8 logic levels
⬗ So, at least 10 levels of logic for a 16-bit PLA
⬗ Even so, delay is still logarithmic in number of bits
⬗ There are better (faster, smaller) ways
20
Theme: Hardware != Software
⬗ Hardware can do things that software fundamentally can’t
⬗ And vice versa (of course)
22
Multi-Segment Carry-Select Adder
0
⬗ Multiple segments A4-0
5+ S4-0
⬗ Example: 5, 5, 6 bit = 16 bit
B4-0 10
0 S9-5
A9-5
5+
⬗ Hardware cost B9-5 S9-5
30
25
Gate delay
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
M
25
Another Option: Carry Lookahead
⬗ Is carry-select adder as fast as we can go?
⬗ Nope!
26
A & B → Generate & Propagate
⬗ Let’s look at the single-bit carry-out function
⬗ CO = AB+AC+BC
⬗ Factor out terms that use only A and B (available immediately)
⬗ CO =(AB)+(A+B)C
⬗ (AB): generates carry-out regardless of incoming C → rename
to G C C
⬗ (A+B) : propagates
incoming C → rename to P A
P
A
⬗ CO = G+PC B
B
G
CO CO
27
Infinite Hardware CLA
⬗ Can expand C1…N in terms of G’s, P’s, and C0
⬗ Example: C16
⬗ C16 = G15+P15C15
⬗ C16 = G15+P15(G14+P14C14)
⬗ C16 = G15+P15G14+ … + P15P14…P2P1G0 + P15P14…P2P1P0C0
⬗ Similar expansions for C15, C14, etc.
⬗ Is there a compromise?
⬗ Reasonable number of small gates?
⬗ Sublinear (doesn’t have to be constant) latency?
⬗ Yes, multi-level CLA: exploits hierarchy to achieve this
29
Carry Lookahead Adder (CLA)
⬗ Calculate “propagate” and “generate” C0
based on A, B A0
B0
G0
P0
⬗ Not based on carry in C1
G1-0
P1-0
C1
⬗ Combine with tree structure A1
B1
G1
P1 G3-0
C2 P3-0
C2
A2 G2
⬗ Take aways
B2 P2 G3-2
C3 P3-2
30
Two-Level CLA for 4-bit Adder
⬗ Individual carry equations
⬗ C1 = G0+P0C0, C2 = G1+P1C1, C3 = G2+P2C2, C4 = G3+P3C3
⬗ Infinite hardware CLA equations
⬗ C1 = G0+P0C0
⬗ C2 = G1+P1G0+P1P0C0
⬗ C3 = G2+P2G1+P2P1G0+P2P1P0C0
⬗ C4 = G3+P3G2+P3P2G1+P3P2P1G0+P3P2P1P0C0
⬗ Hierarchical CLA equations
⬗ First level: expand C2 using C1, C4 using C3
⬗ C2 = G1+P1(G0+P0C0) = (G1+P1G0)+(P1P0)C0 = G1-0+P1-0C0
⬗ C4 = G3+P3(G2+P2C2) = (G3+P3G2)+(P3P2)C2 = G3-2+P3-2C2
⬗ Second level: expand C4 using expanded C2
⬗ C4 = G3-2+P3-2(G1-0+P1-0C0) = (G3-2+P3-2G1-0)+(P3-2P1-0)C0
⬗ C4 = G3-0+P3-0C0
31
Two-Level CLA for 4-bit Adder
⬗ Top first-level CLA block
⬗ Input: C0, G0, P0, G1, P1 C0
⬗ C4 block
B3 P3
32
4-bit CLA Timing: d0
⬗ Which signals are ready at 0 gate delays (d0)?
⬗ C0 C0
⬗ A0, B0 A
B
0G
P
0
⬗ A1, B1
0 0
G1-0
C1 P1-0
⬗ A2, B2
C1
A 1G 1
B 1P 1
⬗ A3, B3
G3-0
C2 P3-0
C2
A2 G2
B2 P2 G3-2
C3 P3-2
C3
A3 G3
B3 P3
C4 C4
33
4-bit CLA Timing: d1
⬗ Signals ready from before C0
⬗ P0 = A0+B0, G0 = A0B0
A1 G1
B1 P1 G3-0
⬗ P1 = A1+B1, G1 = A1B1
C2 P3-0
C2
A2 G2
⬗ P2 = A2+B2, G2 = A2B2 C3
B2 P2 G3-2
⬗ P3 = A3+B3, G3 = A3B3
P3-2
C3
A3 G3
B3 P3
C4 C4
34
4-bit CLA Timing: d3
⬗ Signals ready from before
⬗ d0: C0, Ai, Bi A0
B0
G0
P0
⬗ d1: Pi, Gi C0 C1
G1-0
P1-0
C1
⬗ New signals ready at d3
A1 G1
B1 P1 G3-0
⬗ P1-0 = P1P0
C2 P3-0
C2
⬗ G1-0 = G1+P1G0
A2 G2
B2 P2 G3-2
⬗ C1 = G0+P0C0
C3 P3-2
C3
A3 G3
B3 P3
⬗ P3-2 = P3P2 C4 C4
⬗ G3-2 = G3+P3G2
⬗ C3 is not ready
35
4-bit CLA Timing: d5
⬗ Signals ready from before C0
⬗ P3-0 = P3-2P1-0
A2 G2
B2 P2 G3-2
⬗ G3-0 = G3-2+P3-2G1-0
C3 P3-2
C3
A3 G3
⬗ C2 = G1-0+P1-0C0 B3 P3
C4 C4
36
4-bit CLA Timing: d7
⬗ Signals ready from before C0
⬗ d1: Pi, Gi
B0 P0
G1-0
C1 P1-0
⬗ C3 = G2+P2C2
C3 P3-2
C3
⬗ C4 = G3-0+P3-0C0
A3 G3
B3 P3
C4
⬗ Si ready d2 after Ci
C4
37
Two-Level CLA for 4-bit Adder
⬗ Hardware?
⬗ CLA blocks: 5 gates, max 3 inputs (infinite CLA C0
with N=2)
⬗ 3 of these
A0 G0
B0 P0
⬗
G1-0
C4 block: 2 gates, max 2 inputs C1 P1-0
⬗ Total: 17 gates/3inputs A1 G1
C1
⬗ Infinite: 14/5 C2
B1 P1 G3-0
P3-0
C2
A2 G2
⬗ Latency? B2 P2 G3-2
⬗ Infinite: 5 C4 C4
G3-0
P3-0
⬗
C1,2,3
Hardware?
⬗ First level: 14 gates * 4 blocks
C4
⬗ Total: 70 gates
P7-4
C5,6,7
⬗ largest gate: 5 inputs G15-0
⬗ Infinite: 152 gates, 17 inputs
C8
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
⬗ Latency? C12
⬗ Total: 9 (1 + 2 + 2 + 2 + 2)
⬗ Infinite: 5 G15-12
P15-12
C13,14,15
40
16-bit CLA Timing: d0
⬗ What is ready after 0 gate
C0
G3-0
delays? P3-0
C1,2,3
⬗ C0 C4
G7-4
P7-4
C5,6,7
C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12
G15-12
P15-12
C13,14,15
C16
41
16-bit CLA Timing: d1
C0
⬗ Individual G/P
P3-0
C1,2,3
C4
G7-4
P7-4
C5,6,7
C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12
G15-12
P15-12
C13,14,15
C16
42
16-bit CLA Timing: d3
C0
C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12
G15-12
P15-12
C13,14,15
C16
43
16-bit CLA Timing: d5
C0
⬗ And after 5 gate delays?
⬗ Outer level group G/P
G3-0
P3-0
C1,2,3
⬗ Outer level “interior” carries C4
C8 G15-0
P15-0
G11-8 C4,8,12
P11-8
C9,10,11
C12
G15-12
P15-12
C13,14,15
C16
44
16-bit CLA Timing: d7
⬗ And after 7 gate delays?
C0
⬗ C16 G3-0
P3-0
⬗ First level “interior” carries C1,2,3
⬗ C5, C6, C7
C4
C16
45
Adders In Real Processors
⬗ Real processors super-optimize their adders
⬗ Ten or so different versions of CLA
⬗ Highly optimized versions of carry-select
⬗ Other gate techniques: carry-skip, conditional-sum
⬗ Sub-gate (transistor) techniques: Manchester carry chain
⬗ Combinations of different techniques
⬗ Alpha 21264 used CLA+CSeA+RippleCA
⬗ Used at different levels
46
47
Subtraction: Addition’s Tricky Pal
⬗ Sign/magnitude subtraction is mental reverse addition
⬗ 2C subtraction is addition
⬗ How to subtract using an adder?
⬗ sub A B = add A -B
⬗ Negate B before adding (fast negation trick: –B = B’ + 1)
⬗ Isn’t a subtraction then a negation and two additions?
+ No, an adder can implement A+B+1 by setting the carry-in to 1
0
1
A
B
~
48
Shifts & Rotates
49
Shift and Rotation Instructions
⬗ Left/right shifts are useful…
⬗ Fast multiplication/division by small constants (next)
⬗ Bit manipulation: extracting and setting individual bits in words
⬗ Right shifts
⬗ Can be logical (shift in 0s) or arithmetic (shift in copies of MSB)
srl 110011, 2 = 001100
sra 110011, 2 = 111100
⬗ Caveat: for negative numbers, sra is not equal to division by 2
⬗ Consider: -1 / 16 = ?
⬗ Useful for address calculation: all basic data types are 2M in size
int A[100];
&A[N] = A+(N*sizeof(int)) = A+N*4 = A+N<<2
51
A Simple Shifter
⬗ The simplest 16-bit shifter: can only shift left by 1
⬗ Implement using wires (no logic!)
⬗ Slightly more complicated: can shift left by 1 or 0
⬗ Implement using wires and a multiplexor (mux16_2to1)
0 A0
A O
A15
A O
A <<1 O <<1
52
Barrel Shifter
⬗ What about shifting left by any amount 0–15?
53
Shifter in Verilog
⬗ Logical shift operators << >>
⬗ performs zero-extension for >>
wire [15:0] a = b << c[3:0];
⬗ Arithmetic shift operator >>>
⬗ performs sign-extension
⬗ requires a signed wire input
wire signed [15:0] b;
wire [15:0] a = b >>> c[3:0];
54
Multiplication
55
3rd Grade: Decimal Multiplication
19 // multiplicand
* 12 // multiplier
38
+ 190
228 // product
int pd = 0; // product
int i = 0;
for (i = 0; i < 16 && mr != 0; i++) {
if (mr & 1) {
pd = pd + md;
}
md = md << 1; // shift left
mr = mr >> 1; // shift right
}
58
Hardware Multiply: Sequential
<< 1
Multiplicand Multiplier >> 1
(32 bit) (16 bit)
lsb==1?
32+
32
Product
(32 bit) we
16+
16+
0 0
C<<3 0
16+
0 P
16+
C<<2
16+
C<<2
16+
0 0 P
C<<1
16+
C<<1
16+
0 0
C C
0 0
61
Division
62
4th Grade: Decimal Division
9 // quotient
3 |29 // divisor | dividend
-27
2 // remainder
⬗ Shift divisor left (multiply by 10) until MSB lines up with dividend’s
⬗ Repeat until remaining dividend (remainder) < divisor
⬗ Find largest single digit q such that (q*divisor) < dividend
⬗ Set LSB of quotient to q
⬗ Subtract (q*divisor) from dividend
⬗ Shift quotient left by one digit (multiply by 10)
⬗ Shift divisor right by one digit (divide by 10)
63
Binary Division
1001 = 9
3 |29 = 0011 |011101
-24 = - 011000
5 = 000101
- 3 = - 000011
2 = 000010
64
Binary Division Hardware
⬗ Same as decimal division, except (again)
– More individual steps (base is smaller)
+ Each step is simpler
⬗ Find largest bit q such that (q*divisor) < dividend
⬗ q = 0 or 1
⬗ Subtract (q*divisor) from dividend
⬗ q = 0 or 1 → no actual multiplication, subtract divisor or not
remainder = 0;
quotient = 0;
for (int i = 0; i < 32; i++) {
remainder = (remainder << 1) | (dividend >> 31);
if (remainder >= divisor) {
quotient = (quotient << 1) | 1;
remainder = remainder - divisor;
} else {
quotient = (quotient << 1) | 0;
}
dividend = dividend << 1;
}
66
Divide Example
⬗ Input: Divisor = 00011 , Dividend = 11101
67
Divider Circuit
Divisor
Quotient
Sub >=0
Shift in 0 or 1
Remainder Dividend
Shift in 0 or 1
msb
Shift in 0
69
Floating Point (FP) Numbers
⬗ Floating point numbers: numbers in scientific notation
⬗ Two uses
Int 32 Int 64 Fp 32 Fp 64
Add/Subtract 1 1 5 5
Multiply 3 3 5 5
Divide 14-30 14-46 8-15 8-15
⬗ Fast addition
⬗ Carry-select: parallelism in sum
⬗ Multiplication
⬗ Chains and trees of additions
⬗ Division
⬗ Floating point
78