Lecture 1 Numbers and Arithmetic
Front Page Computer Arithmetic • March 1994 – Thomas Nicely, mathematician at Lynchbur...
43 downloads
1541 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Lecture 1 Numbers and Arithmetic
Front Page Computer Arithmetic • March 1994 – Thomas Nicely, mathematician at Lynchburg College, Virginia – Pentium processor did not match other processors in computing 1/p + 1/(p+2)
• October 1994 – Convinced that Pentium was at fault. – Exchanged results with other researchers – Posted results on Internet
1
The Diagnosis • Tim Coe, engineer at Vitesse Semiconductor – build a model of Pentium’s floating-point division hardware based on radix-4 SRT algorithm – diagnosed problem: – 4,195,835 / 3,145,727 = 1.333 820 44 – but, on Pentium, it was = 1.333 739 06 – (accurate only to 14 bits)
Intel’s Response • Dismissed severity of problem • Admitted to a “subtle flaw” • Claimed probability of 1 in 9 billion (once every 27,000 yrs.) for average user • Published a white paper describing problem. • Announced replacement policy: – replacement of defective part based on customer need, customers had to show their applications required correct arithmetic.
2
Customer Response • Heavy criticism from customers – Lots of bad press – On-line criticism
• Intel revised its policy: no-questions-asked replacement policy • First instance of arithmetic becoming frontpage news
Moral • Glaring software faults have become routine, (ref. Microsoft) but … • Hardware bugs are rare, untolerated, and newsworthy • Computer arithmetic is important
3
What is computer arithmetic? • Major field in computer architecture • Implementation of arithmetic functions – Arithmetic algorithms for software or firmware. – Hardware algorithms – High-speed circuits for computation
Applications of Computer Arithmetic • Design of top-of-the-line CPU’s • High-performance arithmetic circuits. • Designs for embedded application-specific circuits. • Arithmetic algorithms for software. • Understand what went wrong in the Pentium ...
4
Numbers and their Encodings • Number representations have advanced in parallel with the evolution of language – Use of sticks and stones – Grouping of sticks into groups of 5 or 10. – Symbolic forms
Roman Numeral System • 1, 5, 10, 50, 100, 1000 = I, V, X, L, D, C, M • Problems – not suitable for representing large numbers – difficult to do arithmetic with
5
Positional Number Systems • First used by Chinese • Value of a symbol depends on where it is. • Ex: 222 = 200 + 20 + 2 – Each symbol “2” has a different value
Fixed-Radix System • Each position is worth a constant multiple of the position to the right: ∆∆∆ L than than L ∆ Rtimes larger → ∆ Rtimes larger → ∆ →
• binary = Positional, Fixed Radix R=2 • decimal = Positional, Fixed Radix R=10
6
Mixed-Radix System • Radix vector gives weights • Ex: time intervals: days : hours : minutes : seconds × × × days 24 → hours 60 → minutes 60 → seconds
R = [0 24 60 60]
Digital Systems • Numbers are encoded using 0’s and 1’s • Suppose system has 4 bits ⇒ 16 codes • You are free to assign the 16 codes to numbers as you please. Examples: – – – –
Binary ⇒ [0,15] Signed-magnitude ⇒ [-7,7] , 0 encoded twice 2’s complement ⇒ [-8,7] 3.1 fixed point ⇒ [0, 7.5]
7
Fixed-Radix Position Number Systems k −1 ( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l ) = ∑ xi r i i =−l
(Error on page 8)
r is the radix x is a digit
{0,1,L, r − 1} is the implicit digit set k .l digits : k digits for the whole part l digits for the fractional part • is the radix - point
Example:
Balanced Ternary System r =3
{
digit set = 1, 0, 1
}
1 = −1
L , 1 1, 1, 0, 1, 11, 10, 11, 111, 11 0,L
8
Example:
Redundant Signed-Digit Radix-4 r=4
{
digit set = 2, 1, 0, 1, 2
}
5decimal = 11 6 decimal = 12 = 22
(redundant)
Other Fancy Radices • • • •
Negative radix systems Fractional radix systems Irrational radix systems Complex radix systems
– see examples in the book
9
How many digits are needed? To represent the natural numbers in [0, max] in radix r with digit set = [ 0, r-1 ] requires k digits : xk −1 xk −2 L x0 max = r k − 1 k = log r max + 1 = log r (max + 1)
Fixed-point Numbers Radix r and digit set [ 0, r-1 ] k whole and l fractional digits r −l = ulp unit in the least (significant) position max = r k − r −l Example : Binary max = 1111.11 = 2 4 − 2 − 2 = 15.75
10
Number Radix Conversion Assume that the unsigned value u has exact representations in both radices r and R : u = w.v
(a fixed - point number)
= ( xk −1 xk − 2 L x0 . x−1 x− 2 L x−l )
r = ( X K −1 X K − 2 L X 0 . X −1 X − 2 L X − L )
R
Conversion Problem Given r , the old radix R , the new radix xi ' s , digits in radix - r that represent u find the X i ' s , digits in radix - R that represent u
11
Method #1 • Use when radix r arithmetic is easier than radix R arithmetic • Convert whole part using repeated division by (in radix r) by R. The remainders are the digits X0 , X1 , . . . • Convert fractional part using repeated multiplication (in radix r) by R. The whole parts at each step are digits X-1 , X-2 , . . .
Method #2 • Use when radix R arithmetic is easier that radix r arithmetic • Convert the whole part by using Horner’s method: uk = r uk+1 + xk • Convert the fractional part by converting r l v using above method, and then dividing by r l.
12
Shortcuts • When the old and new radices are integral powers of a common base b, that is, r = b g, and R = b G, then conversion can be done with no computation, using a table lookup. • Example: hex to octal
13
Lecture 2 Representing Signed Numbers
Lecture 2 • In lecture 1, we talked about natural numbers [0…max] – often referred to as unsigned numbers
• In this lecture, we will talk about signed numbers – include both positive and negative values
1
Signed-Magnitude Representation • One bit (MSB) is devoted to sign. – By convention, 1 = negative, 0 = positive
• k-1 bits are available to represent magitude • Range of k-bit signed-magnitude number is:
[ − (2
k −1
− 1) , 2 k −1 − 1
]
Signed-magnitude Representation
2
Signed-Magnitude Representation • Advantages – intuitive appeal & conceptual simplicity – symmetric range – simple negation
• Disadvantages – fewer numbers encoded (two encodings for 0) – subtraction is more complicated than in 2’s-comp
Circuit for Sign-Magnitude Arithmetic
3
Biased Representations • Signed numbers are converted into unsigned numbers by adding a constant bias signed number unsigned number • 64biased 44 74448 6 474 8 [− bias , max − bias ] + bias = [0 , max] • Such an encoding is sometimes referred to as an excess-bias coding. • Excess-127 and excess-1023 codes used for exponents in IEEE floating point.
Biased Representations Example: excess-8 coding
4
Biased Representations • Do not lend themselves to simple arithmetic xbias = x + bias
ybias = y + bias
( x + y )bias = xbias + ybias − bias ( x − y )bias = xbias + ybias + bias
• Multiplication and division performed directly on biased numbers is very difficult.
Converting biased numbers xbias = x + bias ⇒ x = xbias − bias xbias bias 67 8 67 8 k −1
k −1
= ∑ xi 2 − ∑ bi 2i i
i =0
=
i =0
∑ x 2 + ∑ ( x − 1)2 i
i:bi = 0
i
i:bi =1
i
if bi = 0 i xi = ∑ 2 i = 0 ( xi − 1) if bi = 1 k −1
+ and - digits
i
+ value 0 0 1 1
− value 0 −1 1 0
5
Converting biased numbers Example + value
− value
0 0 1 1
0 1
(10010010) baised
−1 0 bias = (11000011) base − 2
−− +++ +−−
1 0 0 1 0 0 1 0 = 0 ⋅ 2 7 − 1 ⋅ 2 6 + 0 ⋅ 25 + 1 ⋅ 2 4 + L − 1 ⋅ 2 0 = −49 check : 146 − 195 = −49
Complement Representations • Like biased representation, except: – only negative numbers are biased – bias M is large enough to bias negative numbers above the positive number range
x < 0 : x + M xcomp − M = x x ≥ 0 : • To represent numbers in range [-N,+P], M ≥ N + P + 1 ( = for max coding efficiency)
6
Complement Representation
Complement Representations • Subtraction is performed by – complementing the subtrahend – performing addition modulo-M
• Addition and subtraction are essentially the same operation. • This is the primary advantage of complement representations.
7
Complement Arithmetic • Two auxiliary operations are required to do complement arithmetic: – complementation (change of sign)
(− x ) M −comp = M − x (− − x) M −comp = M − ( M − x ) = x – computation of residues mod M
( x + M ) mod M = x
Addition of Complement Signed Numbers
8
Radix-complement • For a k-digit, radix r number • M = rk • Auxiliary functions become: – modulo-M reduction: ignore carry-out from digit position k-1 – complement of x (M-x): replace each non-zero digit xi with r-1- xi and add ulp (particularly simple if r is a power of 2).
Digit-complement (or Diminished-radix-complement) • For a k-digit, radix r number (possibly with fractional digits as well) • M = r k - ulp (= r k - r -l for a k.l-digit number) • Auxiliary functions become: – modulo-M reduction: add carry-out from digit position k-1 to result – complement of x (M-x): replace each non-zero digit xi with r-1- xi (particularly simple if r is a power of 2).
9
Two’s-complement • Radix-complement with radix = 2 • Complementation constant M = 2 k for a k-digit binary number
[
• Range = − 2 k −1 , 2 k −1 − ulp
]
• The name “2’s” complement comes from case when k = 1, and M = 2.
2’s-complement Representation
10
Finding the two’s-complement (negation) of a number
(
)
2 k − x = ( 2 k − ulp ) − x + ulp = (11L1.1L11 − x) + ulp = x + ulp Negate: complement each bit and add one ulp. Because of the slightly asymmetric range, negation may lead to overflow!
Two’s-complement Addition and Subtraction • To add numbers modulo 2 k, simply drop the carry-out from bit position k-1. This carry is worth 2 k. • To subtract, complement subtrahend, then add, and drop the carry-out from bit position k-1.
11
Add/Sub Circuit for 2’s-complement Arithmetic
This method of negation never overflows.
Two’s-complement Sign Extension • To extend a k.l-digit number to k’.l’-digits, the complementation constant increases from M = 2 k to M’ = 2 k’ • The difference of the two constants: M’ - M = 2 k’ - 2 k = 2 k (2 k’-k - 1) must be added to the representation of any negative number. This is equal to (k’-k) 1’s followed by k 0’s.
12
One’s-complement • Digit-complement with radix = 2 • Complementation constant M = 2 k - ulp for a k-digit binary number • Range =
[ − (2
k −1
− ulp) , 2 k −1 − ulp
]
1’s-complement Representation
13
Finding the one’s-complement (negation) of a number (2 k − ulp ) − x = 11L1.1L11 − x =x Negate: complement each bit. Because of the symmetric range, negation cannot overflow.
One’s-complement Addition and Subtraction • To add numbers modulo 2 k-ulp, simply drop the carry-out from bit position k-1 and simultaneously insert a carry into position -l. The net effect is to reduce the result by 2 k-ulp. This is known as end-around carry.
14
Oops • End-around carry does not reduce (mod 2 k-ulp) a result that equals 2 k-ulp – no carry is generated
• However, 2 k-ulp = all 1’s = (-0)1’s-complement • If it were reduced (mod 2 k-ulp), it would reduced to 0. • But -0 = 0, so does it matter??
One’s-complement Sign Extension • To extend a k.l-digit number to k’.l’-digits, the complementation constant increases from M = 2 k-ulp to M’ = 2 k’-ulp’ • This leads to the rule that a one’scomplement number must be sign-extended on both ends.
15
Comparing radix- and digitcomplement Systems
Indirect Signed Arithmetic • If you only have hardware that does unsigned arithmetic, and you want to do signed arithmetic, you can by: – converting signed operands to unsigned operands – a result is obtained based on unsigned operands – the result is converted back to the signed representation
• This is called indirect arithmetic
16
Direct vs. Indirect Operations on Signed Numbers
Using Signed Positions or Signed Digits • The value of a two’s-complement number can be found using the standard binary-todecimal conversion, except the weight of the MSB (sign bit) is taken to be negative ( -2 k-1 )
17
Another way to Interpret 2’s-complement Numbers
(10100110)two 's −comp = (10100110) radix − 2
Interpretation of Two’s-complement Numbers k −1
( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l )base − 2 = ∑ xi r i i=−l
( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l ) 2 's −comp = − xk −1r
k −1
k −2
+ ∑ xi r i i =−l
18
Generalization • Assign negative weights to an arbitrary subset of the k+l digit positions in a radixr number, and positive weights to the other positions. • Negative weight digit is in set {-1, 0} • Positive weight digit is in set {0, 1}
More Generalization • Any set [-α, β] of – r or more consecutive integers ( α+β+1 ≥ r ) – that include 0
can be used for the digit set for radix r. • If α+β+1 > r , then the number system is redundant, and ρ = α+β+1 - r is the redundancy index.
19
Converting to another Digit Set {0,1,2,3}radix − 4 → {1,0,1,2}radix − 4 0 00 1 01 2 02 3 11
Converting to another Digit Set {0,1,2,3}radix −4 → {2,1,0,1,2}radix − 4 0 00 1 01 2 02 12 3 11 Transfers do not propagate, and thus this conversion is always carry-free.
20
Lecture 3 Redundant Number Systems
Addition • Addition is the basic building block of all arithmetic operations • If addition is slow, then all other operations suffer in speed or cost. • Carry propagation is either – slow ( O(n) for carry ripple ), or – expensive ( carry lookahead, etc. )
1
Coping with the Carry Problem • Limit carry propagation to within a small number of bits. • Detect the end of propagation rather that wait for worst-case time • Speed up propagation using lookahead or other techniques • Ideal: Eliminate carry propagation altogether !
Eliminating Carry Propagation • Can numbers be represented in such a way that addition does not require carry propagation ?? • Decimal [0,18]
But, a second addition could cause problems . . .
2
Limiting Carry Propagation • Consider adding 2 numbers with digit set [0,18] • Interval arithmetic: [0,18] + [0,18] = [0,36] = 10[0,2] + [0,16] • Carry: [0,16] + [0,2] = [0,18] – No additional carry’s are generated. – Carry propagates only one position. – This is referred to as carry-free addition
Carry-Free Addition
How much redundancy in the digit set is needed to enable carry-free addition ?
3
[0,18] is more than you need: Carry-Free using [0,11]
The Key to Carry-Free Addition • Redundant representations provide multiple encodings for some numbers • Interim sums + Transfer carrys fit into digit set and do not propagate carrys. • Single stage propagation is eliminated with a simple lookahead scheme: – sum i = f ( x i , y i , x i-1 , y i-1 ) – no carry
4
Carry-Free Addition
Redundancy in Computer Arithmetic • Redundancy is used extensively for speeding up arithmetic operations. • First example introduced in 1959: Carry-save addition
5
Carry-Ripple Addition A3 A
B3
A2 B2
A1 B1
B
A
A
B
B
A
B
Co Ci
Co Ci
Co Ci
S
S
S
S
B
A
C2
B
A
C1
B
A
C0
B
A
B
Co Ci
Co Ci
Co Ci
Co Ci
Co Ci
S
S
S
S
S
D3 A
B
B0
Co Ci
C3 A
A0
A
B
A
D2
B
A
D1
B
A
D0
B
A
B
Co Ci
Co Ci
Co Ci
Co Ci
Co Ci
Co Ci
S
S
S
S
S
S
S5
S4
S3
S2
S1
S0
Carry-Save Addition A3 A
B3 C3 A2 B2 C2 A1 B1 C1 A0
B0 C0
B
B
A
B
S5
A
Co Ci
Co Ci
S
S
S
S
A
B
B
Co Ci
D3
A
A
Co Ci
B
D2 A
B
D1 A
B
D0 A
B
Co Ci
Co Ci
Co Ci
Co Ci
S
S
S
S
A
B
A
B
A
B
Co Ci
Co Ci
Co Ci
Co Ci
S
S
S
S
S4
S3
S2
S1
S0
6
Carry-Save Numbers Digit : Representation 0 : (0,0) 1 : (0,1) or (1,0) 2 : (1,1)
Carry-save Addition
Digit Set Conversion
[0,2] + [0,1] + [0,1] = 2*[0,1] + [0,2]
7
Digit Sets and Digit-set Conversion Radix-r Digit set [-λ, µ] , λ+µ+1≥r Radix-r Digit set [-α, β], α+β+1≥r Essentially a digit-serial process, like carry propagation, that begins at the right and ripples left.
[0, 18] → [0, 9]
8
[0, 2] → [0, 1]
[0, 18] → [-6, 5]
9
[0, 2] → [-1, 1]
Generalized Signed-digit Numbers • A digit set need not be the standard [0, r-1] • radix-2 [-1, 1] was first proposed in 1897. • Avizienis (1961) defined the class of signed-digit number systems with symmetric digit sets [-α, α] in radix r > 2.
r 2 + 1 ≤ α ≤ r − 1
10
Catagorizing Digit Sets • GSD [-α, β] Generalized Signed-digit – symmetric : α=β , asymmetric : α≠β – minimal : ρ = α+β+1-r = 1 (minimal redundancy)
• OSD [-α, α] Ordinary Signed-digit (Avizienis) • BSC [0, 2] , radix 2 , Binary Stored-carry • BSD [-1, 1] , radix 2, Binary Signed Digit
A Taxonomy of Positional Number Systems
11
Encodings for [-1, 1] • To represent a number in binary, α+β+1 digits must be encoded in binary.
– There are many other encodings – Encoding efficiency is total number of different numbers represented / 2bits
Hybrid Signed-digit • Redundancy in select positions only. BSD = [-1,1] B = [0,1]
Worst case propagation is 3 stages
12
GSD Carry-free Addition Algorithm • Compute the position sums: p i = x i + y i • Separate each p i into a transfer t i+1 and interim sum w i such that p i = r t i+1 + w i • Add the incoming transfers to obtain the sum digits si = wi+ ti with no new transfer.
wi
Conditions for Carry-free Addition • t i is from digit set [-λ, µ] • s i is from digit set [-α, β] • To ensure s i = w i + t i with no new transfer, -α + λ ≤ w i ≤ β - µ where the interim sum w i = p i - r t i+1 • This can be shown true if λ ≥ α /(r-1) and µ ≥ β /(r-1)
13
Selection of Transfer Digit • The value pi is compared to a set of selection constants: Cj : - λ ≤ j ≤ µ+1 • if Cj ≤ pi < Cj+1 then ti+1 = j
Selecting the Transfer Value pi = [-10, 18] λ ≥ 5/9 µ ≥ 1 ti = [- λ, µ] = [-1, 1] wi = [-α + λ, β - µ] si = [-α, β]
β=9 β-µ=8
ti+1 = -1
ti+1 = 0
ti+1 = 1
wi C0
0 -10
C1 0
pi 18
-α+λ=-4 -α=-5 w i = p i - 10 t i+1
14
How are Selection Constants chosen ?
Adding with radix-10, [-5, 9]
15
How much redundancy is needed? • Carry-free addition is possible iff one of the following sets of conditions is satisfied: – –
r>2, ρ≥3 r>2, ρ=2, α≠1, β≠1
• Does not work for – – –
r=2 ρ=1 ρ = 2 , α = 1 or β = 1
Limited-carry algorithm for GSD numbers • Use when carry-free algorithms do not exist
16
Implementations of Limited-carry Addition
ei pi -2 -1 0 1 2
ti+1 = [-1,1] low high [-1,0] [0,1] (-1,0) (-1,1) (0,-1) (0,0) (0,1) (1,-1) (1,0) (ti+1 , wi)
17
ei pi 0 1 2 3 4 5 6
ti+1 = [0,3] low high [0,2] [1,3] (0, 0) (0, 1) (1,-1) (1, 0) (1, 1) (2,-1) (2, 0) (2, 1) (3,-1) - (3, 0) (ti+1 , wi)
18
Conversions • Outside world is binary or decimal • To do GSD arithmetic internally, conversions are required. • Conversion is done at input and output.
19
Example: Conversion of BSD to Signed Binary
Support Functions • Zero detection – used for equality testing
• Sign test – used for relational comparison (< , ≤ , ≥ , >)
• Overflow handling
20
Zero Test • Zero may have multiple representations • If α
Sign Test • The sign of a GSD number generally depends on all of its digits. • In general, sign test is: – slow if done by carry propagation, or – expensive if done by fast lookahead
• If α
21
Overflow • Detection is difficult in GSD arithmetic. Even when tk ≠0, the result still might be representable in k digits.
True overflow detection is possible, but slow.
Difficulties with GSD • Difficulties with sign test and overflow detection can nullify some, or all of the speed advantages of GSD number representations. • Applications of GSD are presently limited to special-purpose systems, or to internal number representations.
22
Lecture 4 Residue Number Systems
RNS Representation and Arithmetic • Given: äx
mod 7 = 2 ä x mod 5 = 3 ä x mod 3 = 2
• What is x ? • ( 2 | 3 | 2 ) RNS(7 | 5 | 3 ) = ?
1
Residue Number Systems (RNS) • X = (xk-1 |… | x1 | x0 ) • Positional number system, with different weights for each position. • Position weights are mutually prime moduli mk-1 , … , m1 , m0 • mk-1 > … > m1 > m0 • xi = X mod mi = X mi xi = [0, mi-1 ] • Dynamic range: M = mk-1 × … × m1 × m0 ä The
number of distinct values that can be represented.
Default RNS • RNS ( 8 | 7 | 5 | 3 ) • M = 8 × 7 × 5 × 3 = 840 ä 840
= the total number of distinct values that can be represented with RNS ( 8 | 7 | 5 | 3 ) ä [0, 839] unsigned , or ä [-420, 419] signed , or ä any interval of 840 consecutive integers
2
Some examples ä(
0 | 0 | 0 | 0 ) = 0 or 840 or … ä ( 1 | 1 | 1 | 1 ) = 1 or 841 or … ä ( 2 | 2 | 2 | 2 ) = 2 or 842 or … ä ( 0 | 1 | 3 | 2 ) = 8 or 848 or … ä ( 5 | 0 | 1 | 0 ) = 21 or 861 or … ä ( 0 | 1 | 4 | 1 ) = 64 or 904 or … ä ( 2 | 0 | 0 | 2 ) = -70 or 770 or … ä ( 7 | 6 | 4 | 2 ) = -1 or 839 or …
Negative RNS Numbers M m =0 • − x mi = M − x mi i • Given the RNS representation of x , the representation of -x can be found by complementing each of the digits xi with respect to moduli mi (0 digits will remain unchanged). ä ä
21 = ( 5 | 0 | 1 | 0 ) RNS ( 8 | 7 | 5 | 3 ) -21 = ( 8-5 | 0 | 5-1 | 0 ) = ( 3 | 0 | 4 | 0 )
3
Converting RNS to Decimal • Any RNS can be viewed as a weighted positional representation. • For RNS(8|7|5|3) the weights associated with the four positions are: ä 105
120 336 280
• Example ä
(1|2|4|0)RNS = ( 105 × 1 + 120 × 2 + 336 × 4 + 280 × 0 ) mod 840 = 1689 mod 840 = 9
Representational Efficiency • Each digit must be encoded in binary • RNS ( 8 | 7 | 5 | 3 ) requires 3 + 3 + 3 + 2 = 11 bits • Representational efficiency = 840/2048 = 41%
4
RNS Arithmetic • Negation, Addition, Subtraction, and Multiplication can be performed independently by operating on each digit individually.
Advantages of RNS Arithmetic • No carry problem • Digits are small ä Digit
operations can easily be done with lookup tables. ä With 6 bit residue digits each operation requires a 4K × 6 table.
• Fast and simple
5
Disadvantages of RNS Arithmetic • Division, Sign Test, Magnitude Comparision, and Overflow Detection are difficult and complex. • These difficulties have thus far limited the application of RNS to certain signal processing problems where ä addition
and multiplication are the predominant operations ä results are within know ranges
A distant light ? • Developments in recent years by Hung and Parhami (1994) have greatly reduced the cost for division and sign detection. • May lead to more widespread application of RNS in the future.
6
Choosing the RNS Moduli • The set of moduli chosen affects ä representational
efficiency ä complexity of arithmetic algorithms
• The magnitude of the largest modulus dictates the speed of arithmetic operations ä make
all the moduli comparable in size to the largest one ä doesn’t change the speed of arithmetic
• Moduli must be mutually prime
Example: [ 0, 100,00010] • Normally requires 17 bits to represent • Choose mutually prime moduli until dynamic range M > 100000. ä RNS(13|11|7|5|3|2)
M=30030
• too small ä RNS(17|13|11|7|5|3|2)
M=510510
• 5.1 times too big, in fact • So remove the 5 ä RNS(17|13|11|7|3|2)
M=102102
7
Example, continued ä RNS(17|13|11|7|3|2)
M=102102
• Bits = 5+4+4+3+2+1 = 19 • Speed dictated by 5 bits • Combine moduli 2 &13 and 3 &7 with no speed penalty ä RNS(26|21|17|11)
• Still needs 5+5+5+4 = 19 bits • but two fewer modules
Another Approach • Better results can be obtained if we proceed like before, but include powers of smaller primes. ä RNS(22|3)
M=12 ä RNS(32|23|7|5) M=2520 ä RNS(11|32|23|7|5) M=27720 ä RNS(13|11|32|23|7|5) M=360360 • 3.6 times too large, replace 9 with 3, combine 3 & 5 ä RNS(15|13|11|
23|7) M=120120
• 4+4+4+3+3 = 18 bits fewer bits than before • faster because largest residue is 4 instead of 5
8
Low-cost Moduli • 2k moduli simplify the required arithmetic operations (particularly the mod operation) ä modulus
mod-16 is easier than mod-13
• 2k-1 moduli are also easy ä k-bit
adder with end-around carry
• 2a-1 and 2b-1 are relatively prime iff a and b are relatively prime • k-modulus system: RNS (2 ak −2 | 2 ak −2 − 1 | L | 2 a1 − 1 | 2 a0 − 1) ak − 2 > L > a1 > a0 and are mutually prime
Try it on [0, 100000] • RNS(25 | 25-1 | 24-1 | 23-1 ) Basis: 5,4,3 ä ä
RNS(32|31|15|7) M = 104160 5+5+4+3 = 17 bits efficiency ≈ 100% • provably > 50% efficiency, worst case (no more than 1 extra bit).
ä
largest residue = 5 bits • but power of 2 makes it simple
ä best
choice yet for [0,100000]
9
Choosing the Moduli Summary • In general, restricting moduli to low-cost moduli tends to increase the width of the largest residues. • The optimal choice is dependent on both: ä ä
the application, and the target implementation technology
Encoding and Decoding Numbers • Binary to RNS xi = ( yk −1 L y1 y0 ) two =
yk −1 2 k −1
mi
mi
= yk −1 2 k −1 + L + y1 21 + y0
+ L + y1 21
mi
+ y0
mi
mi
mi
j and store 2 mi for each mi ä residue xi = y m = ( y mod mi ) is computed by i modulo-mi addition of selected stored constants
ä precompute
10
Precomputed Residues of 20, 21, … , 29 for RNS(8|7|5|3)
Why are the residues for (2j)8 not shown ?
Convert 16410 to RNS(8|7|5|3)
11
Conversion from RNS to Mixed-radix Form • Associated with any RNS(mk-1|…|m2|m1|m0) is a mixed-radix number system MRS(mk-1|…|m2|m1|m0) , which is essentially äa
k-digit positional number system with weights: (mk-2…m2m1m0) … (m2m1m0) (m1m0) (m0) (1) ä and digit sets: [0,mk-1-1] … [0,m2-1] [0,m1-1] [0,m0-1] ( digit set are same ranges as RNS digits, but digits themselves are different )
Example • MRS(8|7|5|3) has position weights: 7×5 × 3=105 , 5 × 3=15 , 3 , 1 • (2|3|4|1)MRS = 2 × 105 + 3 × 15 + 4 × 3 + 1 = 210 + 45 + 12 + 1 = 268
12
Conversion from RNS to Mixed-radix Form • The RNS-to-MRS conversion is to find the digits zi of MRS, given the digits xi of RNS: y = ( xk −1 | L | x2 | x1 | x0 ) RNS = ( zk −1 | L | z 2 | z1 | z0 ) MRS
= z k −1 (mk − 2 L m2 m1m0 ) + L + z2 (m1m0 ) + z1 (m0 ) + z0
• To find each digit: ä x0 = y
m0
= zk −1 (mk − 2 L m2 m1m0 ) + L + z 2 (m1m0 ) + z1 (m0 ) + z0
= z0 from both sides, y′ = y − z0 ä divide both sides by m0 , y ′′ = y ′ m0 ä repeat process on y ′′ to get next digit
m0
= z0
ä subtract x0
RNS Arithmetic • To compute: y′ = ( xk′ −1 | L | x1′ | x0′ ) RNS = y − z0 ä
z0 = ( z0 | L | z0 | z0 ) RNS
x′j = x j − z0
mj
• To compute: y′′ = ( xk′′−1 | L | x1′′ | x0′′) RNS = y′ m0 ä Much
easier than general division ä Called scaling ä For each digit find multiplicative inverse ij of m0 with respect to mj , such that i j × m0 = 1 ä
i = (ik −1 | L | i1 | i0 ) RNS x′j′ = i j × x′j
mj
y′′ = y ′ m0 = i × y′
mj
13
Example: y=(0|6|3|0)RNS
After Conversion • Mixed-radix representation allows ä compare
the magnitudes of two RNS numbers ä detect the sign of a number
• (0|6|3|0)RNS ? (5|3|0|0)RNS --- convert to MRS --(0|3|1|0)MRS ? (0|3|0|0)MRS using ordinary comparison (0|3|1|0)MRS > (0|3|0|0)MRS
14
Conversion from RNS to Binary/Decimal • Method #1 RNS → MRS → Decimal/Binary • Method #2 (direct) RNS → Decimal/Binary using RNS position weights computed using the Chinese remainder theorem (CRT)
Example: (3|2|4|2)RNS • Consider conversion of y=(3|2|4|2)RNS to decimal. Based on RNS properties:
15
Example: (3|2|4|2)RNS • Knowing the values of the following four constants (the RNS position weights) would allow us to convert any number from RNS(8|7|5|3) to decimal using four multiplications and three additions.
• Thus,
How are the weights derived? • w3 = (1|0|0|0)RNS = 105 ? • Since the last three residues are 0’s, w3 is divisible by 3, 5, and 7. Hence it is a multiple of 105. • We must pick the right multiple of 105 such that its residue with respect to 8 is 1. n × 105 8 = 1 for w3 : n = 1
16
Chinese Remainder Theorem
x3 = 3 m3 = 8 M 3 = M m3 = 840 8 = 105 α3 = M 3 M 3 α 3 x3
−1 m3 m3
= 1 since 105 × 1 8 = 1
= 105 × 1× 3 8 = 315
To avoid multiplication in the conversion process, we can store premultiplied constants. Conversion is then performed by only by doing table lookups and modulo-M additions.
17
Difficult Arithmetic Operations • Sign Test • Magnitude Comparison • Overflow Detection ä The
above 3 are essential the same problem ä Two methods • Convert to MRS or binary and do comparison • Do approximate CRT decoding and compare
• General Division ä Discussed
in chapters 14 and 15
Redundant RNS representation • Example: modulus m=13 • Normal digit set [0,12] (4 bits) • Redundant digit set [0,15] (still 4 bits) ä Residues
0,1,2 have redundant representations 13, 14, 15 respectively, since • 0 mod 13 = 13 mod 13 • 1 mod 13 = 14 mod 13 • 2 mod 13 = 15 mod 13
ä Modulo
addition is done by 4-bit adder.
• Carry out causes 3 to be added to result as adjustment
18
Limits of Fast Arithmetic in RNS • Addition of binary numbers in the range [0, M-1] can be done in: ä O(log
log M) time ä O(log M) cost (using carry lookahead, etc)
• Addition of low-cost residue numbers: ä O(log
log M) time ä O(log M) cost (using carry lookahead, etc)
• Asymptotically, RNS offers little advantage over standard binary
19
Lecture 5 Basic Addition and Counting
Half Adders and Full Adders Basic building blocks for arithmetic circuits
Half Adder Inputs : x, y Outputs : s = x ⊕ y c = x⋅ y x
Outputs : s = x ⊕ y ⊕ cin cout = x ⋅ y + ( x + y ) ⋅ cin
y
HA c
Full Adder Inputs : x, y, cin
s
x cout
y
FA s
cin Also called a (3,2) Counter
1
Full Adder x
y x
HA cin
cout
=
cout
HA
y
FA
cin
s
s
Mixed Positive and Negative Binary Full Adders + Digit Set = {0,1} − Digit Set = {-1,0}
+ 0 0
− −1 0
1 1
0
1
Excess-1 encoding
Digit Encodings
− +
+ − x y
x y
+ co FA s
−
ci+
+ co FA s
−
+ + x y ci+
+ co FA
ci −
s
−
2
More Mixed Binary Full Adders − −
+ − x y
x y
− co FA
− co FA
ci+
s +
− +
x y
ci−
s +
s +
− −
+ + x y + co FA
ci −
− co FA
x y
ci+
− co FA
s +
Amazingly all done using the same Full Adder
ci −
s
−
Mixed Binary Additions − − −
x y
++ x y
+ − x y
− − x y
+ + x y
FA
FA −
FA −
FA
FA
s
+
+
s
−
s
+
+
s
+
− +
x y
+
s
+
FA
+ ci
s
−
Propagate the digit sets. You can add any combination of +/− to any other combination of +/−
3
Converting Two’s Comp to +/− Two’s complement number − 2 k −1 2 k −2
L
20
− +
L
+
Mixed +/− digit set number
Half Adder c x s y 18 T
4
Full adder cin s
cout x
36 T y
CMOS Full Adder
5
Ripple Carry Adder
x3 cout
y3
x2
y2
x1
y1
x0
y0
FA
FA
FA
FA
s3
s2
s1
s0
cin
Worst case delay path
Serial Addition
6
Conditions, Flags, and Exceptions • Overflow - The output cannot be represented in the format of the result. • Sign -
1 if the result is negative, 0 if the result is positive.
• Zero -
The result is zero.
Signed Overflow • Overflowtwo’s-comp = sign of result is wrong = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 when ck −1 = 1 = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 14243 14243 0
Ck
when ck −1 = 0 = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 14243 14243 Ck
0
= ck ⋅ ck −1 + ck ⋅ ck −1 = ck ⊕ ck −1
7
Unsigned Overflow • Overflowunsigned = carry out of last stage = ck
Sign Sign signed = 0 when positive, 1 when negative = sk −1 when Overflow = 0 = sk −1 when Overflow = 1 = sk −1 ⊕ Overflow = sk −1 ⊕ ck ⊕ ck −1 Sign unsigned = 0
(always positive!)
8
Zero Zero = sk −1 ⋅ sk − 2 ⋅ L ⋅ s0
(both signed, unsigned)
k - input NOR gate zero
zero
Adder with Flag Detection
What’s wrong with the diagram from the book ?
9
Flag Summary Sign Flag ck ⊕ ck −1 Overflow ck ⊕ ck −1 ⊕ sk −1 Sign sk −1 ⋅ sk − 2 L s0 Zero
Unsigned ck 0 sk −1 ⋅ sk − 2 L s0
Analysis of Carry Propagation Average carry propagation Probability
Average worse case carry propagation Absolute worse case carry propagation
0 1 2
log2k
k
Asynchronous circuits must wait average worst case, synchronous circuits must wait absolute worst case
10
Carry Completion Detection • For asynchronous arithmetic, (not useful for synchronous arithmetic) • Carry Completion Detection gives a done signal when carry chain is done. • Average time ∝ log2 k bi ci • Two rail logic: 00 01 10
Carry not known yet Carry known to be 1 Carry known to be 0
Carry Completion Detection
bi and ci all start at 0’s
11
Speeding Up Addition Making Low Latency Carry Chains • From point of view of carry propagation – computation of sum is not important. – At each position a carry is either • generated : xi + yi ≥ r • propagated : xi + yi = r − 1, or • annihilated : xi + yi < r − 1
ci +1 = xi yi + ( xi ⊕ yi ) ⋅ ci { 1 424 3 gi
pi
ci +1 = g i + pi ⋅ ci si = pi ⊕ ci
“Carry Recurrence”
Propagation of Carry ci +1 = g i + pi ci = g i (1 + ci ) + pi ci = g i + ( g i + pi ) ⋅ ci 1 424 3 ti
= g i + ti ci ti = xi + yi
12
Propagation of Inverse Carry ci +1 = g i + pi ci = g i ⋅ ( pi + ci ) = g i pi + g i ci = ai + (ai + pi ) ⋅ ci = ai + pi ci
Manchester Carry Chains time( i ) ∝ i2
13
Lecture 6 Carry-Lookahead Adders
Unrolling the Carry Recurrence c1 = g 0 + p0 c0 c2 = g1 + p1c1 = g1 + p1 ( g 0 + p0 c0 ) = g1 + p1 g 0 + p1 p0 c0 c3 = g 2 + p2 c2 = g 2 + p2 g1 + p2 p1 g 0 + p2 p1 p0 c0 c4 = g 3 + p3c3 = g 3 + p3 g 2 + p3 p2 g1 + p3 p2 p1 g 0 + p3 p2 p1 p0 c0 M
1
4-bit Full Carry Lookahead
HP Carry Lookahead Circuit
2
Alternatives to Full Carry Lookahead • Full carry lookahead is impractical for wide addition • Tree Networks – less circuitry than full lookahead at the expense of increased latency
• Kinds of Tree Networks – High-radix addition (radix must be power of 2) – Multi-level carry lookahead (technique most used in practice)
4-bit Propagate & Generate g[ i ,i +3] = g i +3 + g i + 2 pi +3 + g i +1 pi + 2 pi +3 + g i pi +1 pi + 2 pi +3 p[i ,i +3] = pi pi +1 pi + 2 pi +3 g i +3 pi + 3
g i + 2 pi + 2
g i +1 pi +1
g i pi
4-bit Lookahead Carry Generator
ci + 4 = g[ i ,i +3] + p[ i ,i +3] ⋅ ci
g[i ,i +3] p[i ,i + 3]
3
4-bit Lookahead Carry Generator
General Propagate & Generate i< j
g[i , j −1] p[i , j −1]
Lookahead Carry Generator
k
j i 14243 14243 [ j , k −1] [ i , j −1] 14 44 4244 44 3 [ i , k −1]
g[ i , k −1] p[ i ,k −1]
4
16-bit Carry Chain with 2-level Carry Lookahead
What is the worst case delay path ?
Worst Case Latency • Producing the g and p for individual bit positions (1 gate delay) • Producing the g and p signals for 4-bit blocks (2 gate delays) • Predicting the carry-in signal c4, c8, c12 for the blocks (2 gate delays) • Predicting the internal carries within each 4-bit block (2 gate delays) • Computing the sum bits (2 gate delays)
5
Worst Case Latency • The delay of a k-bit carry-lookahead adder based on 4-bit lookahead blocks is: Time = 4 log4 k + 1
gate delays
Final cout = ck • Last carry is not used to compute any sums • Needed in many situations – Overflow computation, for example
• Three ways to compute it: – – –
ck = g[ 0,k −1] + c0 p[ 0, k −1] ck = g k −1 + ck −1 pk −1 ck = xk −1 yk −1 + sk −1 ( xk −1 + yk −1 )
6
64-bit Carry Lookahead Adder
Ling Adder [1981] ci = g i −1 + ci −1 pi −1 = g i −1 + ci −1ti −1 = = g i −1 + g i − 2ti −1 + g i −3ti −2ti −1 + g i − 4ti −3ti − 2ti −1 + ci − 4ti −4ti −3ti − 2ti −1 Ling’s idea was to propagate hi = ci + ci-1 instead of ci
hi = g i −1 + g i − 2 + g i −3ti − 2 + g i − 4ti −3ti − 2 + hi − 4ti − 4ti −3ti − 2 The carry chain is somewhat simpler, however, the sum equation is slightly more complex:
si = (ti ⊕ hi +1 ) + hi g i ti −1
7
Parallel Prefix Computations The " parallel prefix problem" is : Given : 1. Inputs : x0 , x1 , x2 ,L, xk −1 , and 2. An associative (but not necessarily commutative) operater : + Compute : x0 x0 + x1 x0 + x1 + x2 M x0 + x1 + x2 + L + xk −1
Carry Computation is a Parallel Prefix Computation Inputs : ( g 0 , p0 ), ( g1 , p1 ), ( g 2 , p2 ),L , ( g k −1 , pk −1 ) Operator : ¢ ( g , p ) = ( g ′, p′) ¢ ( g ′′, p′′) = ( g ′′ + g ′ ⋅ p′′, p′ ⋅ p′′)
Compute : ( g[ 0, 0] , p[ 0, 0] ) = ( g 0 , p0 ) ( g[ 0,1] , p[ 0,1] ) = ( g1 , p1 ) ¢ ( g 0 , p0 )
( g ′′, p′′) ( g ′, p′)
¢ ( g , p)
M ( g[ 0,k −1] , p[ 0, k −1] ) = ( g k −1 , pk −1 ) ¢ L ¢ ( g1 , p1 ) ¢ ( g 0 , p0 )
8
Combining (g, p) of Overlapping Blocks
(g, p) Networks • Any design for a parallel prefix problem can be adapted to a carry computation network. • Pairs of inputs can be combined in any way (re-associated according to the associative property) to compute block (g, p) signals. • (g, p) signals have additional flexibility: overlapping blocks can be combined.
9
Recursive Prefix Sum Network
Divide and Conquer I
10
Brent-Kung Parallel Prefix Network
Divide and Conquer II
Brent-Kung Parallel Prefix
11
Brent-Kung Parallel Prefix Network
Kogge-Stone Parallel Prefix Network
12
Hybrid Brent-Kung / Kogge-Stone
Network Comparisons Network
Max Delay
Cost
Divide & Conquer I
Log2 k
(k/2) Log2 k
High
Brent - Kung
2 Log2 k - 1
2k − 2 − Log2 k
Low
Kogge - Stone
Log2 k
k Log2 k − k + 1
Low
Hybrid B-K / K-S
Log2 k + 1
(k/2) Log2 k
Low
Fan Out
Cost is not a good estimate of Si area for these networks. Regularity and interconnect are large factors.
13
MCC on Am29050 Lookahead for 64-bit, radix-256 addition
Level 1 MCC (not shown on block diagram)
Level 2,3 MCC
14
Lecture 7 Variations in Fast Adders
Simple Carry-Skip Adders
1
Simplifying Assumptions • One skip delay (cin to cout) is equal to one ripple delay (cin to cout) • Total k-bit ripple delay is k × single delay
These assumptions may not be true in real life (CMOS implementation for example).
Worst Case Delay
b = fixed block width (ex : 4) k = number of bits (ex :16) Tdelay = (b − 1) + ({ 0.5) + (k b − 2) + (b − 1) 123 1 424 3 123 block 0
OR -gate
skips
block (k -1)
≈ 2b + k b − 3.5 stages (ex :12.5)
2
What is the optimal block size? dT =0 db 2. solve for b = bopt
1. set
k 2 t = number of blocks = k b bopt =
topt = 2k Topt = 2 2k − 3.5
Can we do better?
Path (1) is one delay longer that Path (2) → block t-2 can be one bit wider than block t-1. Path (1) is one delay longer that Path (3) → block 1 can be one bit wider than block 0.
3
Variable Block-Width Carry-Skip Adders Optimal Block Widths : b b +1 L b +
t t −1 b + −1 L b +1 b 2 2
t t − 1) + (b + − 1) + L + (b + 1) + b = k 2 2 → b = (k t ) − (t 4) + 1 2
b + (b + 1) + L + (b +
Optimal number of blocks 2k t 0.5) + (t − 2) = Tdelay = 2(b − 1) + ({ + − 2.5 123 123 2 t OR gate first + last stage Skip stages dT =0 dt 2. solve for t = topt
1. set
topt = 2 k bopt = 1 2 = 1 (stage 0, t-1 ; goes up to topt 2 = k ) Topt = 2 k − 2.5
4
Comparison Fixed-width Carry-Skip topt
2k 2
bopt
k 2
Topt
2 2k − 3.5
Variable-width Carry-Skip
2 k 1L k
k L1
2 k − 2.5
Conclusion: Variable-width is about 40% faster.
Multilevel Carry-Skip Adders One-level carry-skip adder
Two-level carry-skip adder
3 delays 1 delay
Notice simplifications in diagramming conventions
5
Multilevel Carry-Skip Adders • Allow carry to skip over several level-1 skip blocks at once. • Level-2 propagate is AND of level-1 propagates. • Assumptions: – OR gate is no delay (insignificant delay) – Basic delay = Skip delay = Ripple delay = Propagate Computation = Sum Computation
Simplifying the Circuit It doesn’t save any time to skip short carry-chains (1-2 cells long)
optimized
6
Build the Widest Single-Level Carry-Skip Adder with 8 delays max by input timing Limited by output timing 6Limited 44 4 7444 86 447448 1
2
3
4
4
3
1
Width = 1 + 3 + 4 + 4 + 3 + 2 + 1 = 18 bits
Build the Widest Two-Level Carry-Skip Adder with 8 delays max First, we need a new notation: Tproduce ≤ β
{β,α} Tassimilate ≤ α
144 42444 3 γ γ = min(β − 1, α)
7
8-delay, 2-level, continued 1. Find {β,α} for level two
Initial Timing Constraint, Level 2
8-delay, 2-level, continued 2. Given {β,α} for level two, derive level one
8
Generalization • Chan et al. [1992] relax assumptions to include general worst-case delays: • I(b) Internal carry-propagate delay for the block • G(b) Carry-generate delay for the block • A(b) Carry-assimilate delay for the block
• Used dynamic programming to obtain optimal configuration
Carry-Select Adders
9
Carry-select: Carried one step further
Two-Level Carry-Select Adder
Can be pipelined
Compare to Two-Level G-P Adder
Can not be pipelined as drawn
10
Conditional Sum Adder • The process that led to the two-level carryselect adder can be continued . . . • A logarithmic time conditional-sum adder results if we proceed to the extreme: – single bit adders at the top
• A conditional-sum adder is actually a (log2 k)-level carry-select adder
Cost and Delay of a Conditional-Sum Adder
More exact analysis gives actual cost =
Top-level block for one bit position of a conditional-sum adder C(1) and T(1) are the cost and time delay of this circuit.
11
Conditional-Sum Example
Hybrid Adder Designed • Hybrids are obtained by combining elements of: – – – – –
Ripple-carry adders Carry-lookahead (generate-propagate) adders Carry-skip adders Carry-select adders Conditional-sum adders
• You can obtain adders with – higher performance – greater cost-effectiveness – lower power consumption
12
Example 1 Carry-Select / Carry-Lookahead • One- and Two-level carry select adders are essentially hybrids, since the top level k/2or k/4-bit adders can be of any type. • Often combined with carry-lookahead adders.
Example 2 Carry-Lookahead/Carry-Select
13
Example 3 Multilevel Carry-Lookahead/Carry-Select
to Carry - Select Adders Can be pipelined
Example 4 Ripple-Carry/Carry-Lookahead
Simple and modular
14
Example 5 Carry Lookahead/Conditional-Sum • Reduces fan-out required to control the muxes at the lower level (a draw-back of wide conditional sum adders). • Use carry conditional-sum addition in smaller blocks, but form inter-block carries using carry-lookahead.
Open Questions • Application requirements may shift the balance in favor of a particular hybrid design. • What combinations are useful for: – low power addition – addition on an FPGA
15
Optimizations in Fast Adders • It is often possible to reduce the delay of adders (including hybrids) by optimizing block widths. • The exact optimal configuration is highly technology dependent. • Designs that minimize or regularize the interconnect may actually be more costeffective that a design with low gate count.
Other Optimizations • Assumption: all inputs are available at time zero. • But, sometimes that is not true: – I/O arrive/depart serially, or – Different arrival times are associated with input digits, or – Different production times are associated with output digits.
• Example: Addition of partial products in a multiplier.
16
Lecture 8 Multioperand Addition
Uses of Multioperand Addition • Multiplication – partial products are formed and must be added
• Inner-product computation (Dot Product, Convolution, FIR filter, IIR filter, etc.) – terms must be added
1
“Dot Notation” • Useful when positioning or alignment of the bits, rather that there values, is important. – Each dot represents a digit in a positional number system. – Dots in the same column have the same positional weight. – Rightmost column is the least significant position.
Serial Multioperand Addition
Operands x(0) , x(1) , … , x(n-1) are shifted in, one per clock cycle. Final sum can be as large as n(2k - 1). Partial sum register must be log2(n2k−n+1) ≈ k+log2n bits wide.
2
Pipelined Serial Addition
Binary Adder Tree
Ripple-carry might deliver better times than carry-lookahead !?
3
Analysis of Ripple-Carry Tree Adder
Whereas, for carry-lookahead adders ….
Can we do better?
where kn is the total number of input bits.
The minimum is achievable with …. (next slide please)
4
Carry-Save Adders Ripple-Carry
Reduce 2 numbers to their sum. Carry-Save
Reduce 3 numbers to two numbers.
More “Dot Notation”
5
Carry-Save Adder Tree A carry save tree can reduce n binary numbers to two numbers have the same sum in O(log n) levels.
Assumes fast logarithmic time adder
Tabular Form Dot Notation Form
Adding seven 6-bit numbers
6
Seven Input Wallace Tree In general, an n-input Wallace tree reduces its k-bit inputs to two outputs.
Analysis of Wallace Trees • The smallest height h(n) of an n-input Wallace tree, satisfies the recurrence: solution: • The number of inputs n(h) that can be reduced to two outputs by an h-level tree, satisfies the recurrence: solution: upper bound: lower bound:
7
Max number of inputs n(h) for an h-level tree
Wallace Tree • Reduce the number of operands at the earliest opportunity. • If there are m dots in a column, apply m 3 full adders to that column. • Tends to minimize overall delay by making the final CPA as short as possible.
8
Dadda Trees • Reduce the number of operands in the tree to the next lower n(h) number in the table using the fewest FA’s and HA’s possible. • Reduces the hardware cost without increasing the number of levels in the tree.
Dadda Tree for 7-input 6-bit addition
9
Taking advantage of the carry-in of the final CPA stage
Parallel Counters • Receives n inputs • Counts the number of 1’s among the n inputs • Outputs a
log 2 (n + 1) bit number
• Reduces n dots in the same bit position to log 2 (n + 1) dots in different positions.
10
Parallel Counters
• • ------• •
• • • ------• •
(2, 2) counter
(3, 2) counter
HA
FA
• • • • • • • ----------• • •
• • • • • • • • • • -----------• • • •
(7, 3) counter
(10, 4) counter
(10, 4) counter
11
Generalized Parallel Adders • Reduces “dot patterns” (not necessarily in the same column) to other dot patterns (not necessarily only one in the each column). • Book speaks less generally, and restricts output to only one dot in each column.
4 Examples • • • • • • • • ------------• • • • (4, 4; 4) counter
• • • • • • • • • • ------------• • • •
• • • • • • • • • • ------------• • • •
(5, 5; 4) counter
(4, 6; 4) counter
4-bit binary full adder, with carry in, is a (2, 2, 2, 3; 5) counter
12
Reducing 5 Numbers with (5, 5 ; 4) Counters
(n; 2) Counters • Difference in notation from other counters. • Reduce n (larger than 3) numbers to two numbers. • Each slice i of an (n; 2) counter: – receives carry bits from one or more positions to the right (i-1, i-2, ….) – produces outputs to positions i and i+1 – produces carries to one or more positions to the left (i+1, i+2, ….)
13
(n; 2) Counters Slice by Slice ψ3
•
ψ2
•
ψ3 ψ3 ψ3 ψ3 ψ2 ψ2 ψ2 ψ2 ψ1 ψ1 ψ1 ψ1
ψ1
•
• • • • ----------• • One slice
n
• • • • • • • • • • • • • • • • • • • • • • • • • • • • ----------------------------------------• • • • • • • •
n + ψ1 + ψ2 + ψ3 ≤ 3 + 2ψ1 + 4ψ2 + 8ψ3
n
Four slices
Adding Multiple Signed Numbers • By means of sign extension
• By method of negative weighted sign bits
14
Lecture 9 Basic Multiplication Schemes
Note on Notation
Right Shift Method
Left Shift Method
1
Right Shift Algorithm
After k iterations:
Left Shift Algorithm
After k iterations:
2
Right Shift Example
After k iterations:
Left Shift Example
After k iterations:
3
Programmed Multiplication • 6 to 7 instructions executed per loop plus overhead • >200 instructions for a 32-bit multiply • Specialized microcode would be smaller
Basic Hardware Multipliers
Right-Shift Multiplier
Combined multiplier / product register
4
Basic Hardware Multipliers
Left-Shift Multiplier Disadvantages: * Multiplier and Product can not share * Adder is twice as wide as Right-Shift Multiplier * Sign-extension is more difficult in partial product
Multiplication of Signed Numbers • Sign extend terms xja and pj when doing additions. • xk-1a term is subtracted instead of added (weight of xk-1 is negative) • In right-shift adders, sign-extension happens incrementally
5
Signed Multiplication using Booth’s recoding • Use Booth’s recoding to represent the multiplier x in signed-digit format • Booth’s recoding first proposed for speeding up radix-2 multiplication in early digital computers. • Used when shifting alone is faster that addition followed by shifting.
Booth’s Recoding • Booth observed that whenever there was a large number of consecutive ones, the corresponding additions could be replace by a single addition and a subtraction.
• The longer the sequence of ones, the greater the savings
6
Booth’s Recoding A Digit Set Conversion • The effect of this translation is to change a binary number with digit set [0,1] to a binary signed-digit number with digit set [-1,1].
Ignore the extra bit if x is a two’s complement negative number
Multiplication by Constants • In programs, a large percentage of multiplications are by a constant known at compile time. – Like in address computation
• Custom instruction sequences are often faster than calling a general purpose multiply routine.
7
Multiply R1 by 113 = (111001)two
Using Subtraction
8
Using Factoring Multiply R1 by 119
Speeding Up Multipliers • Reduce the number of operands being added. – Leads to high radix multipliers – Several bits of the multiplier are multiplied by the multiplicand in each cycle
• Speed up addition – Multioperand addition – Leads to tree and array multipliers
9
Chapter 10 High-Radix Multipliers
Radix-r Algorithms
r = 2s means shifting s bits per iteration
1
Radix-4 (2-bits at a time) Multiplication
Need multiples 0a, 1a, 2a, 3a. 3a can be precomputed and stored.
Radix-4 (2-bits at a time) Multiplication
3a may also be computed by subtracting a and forcing a carry (4a). 4a may be computed by adding 0 and forcing a carry. An extra cycle may be required at the end.
2
Radix-4 Booth’s Recoding • Converts from digit set [0,3] to [-2,2]
Radix-4 Booth’s Recoding • Again, ignoring the upper bit gives the correct signed interpretation of the number. • Radix-4 conversion entails no carry propagation. • Each digit is obtained independently by examining three bits of the multiplier. • Overlapped 3-bit scanning.
3
Addend Generation for Radix-4
Using Carry-Save Adders
Use CSA to compute the 3a term.
4
Using CSA to Reduce Addition Time
Using CSA to Do Both
5
Using CSA with Radix-4 Booth
Booth Recoding for Parallel Multiplication
6
Higher Radix Multipliers Radix -16
How Far Should You Go? Adventures in Space-Time Mapping
7
Twin-beat Multiplier to Multiply the Speed
8
Lecture 11 Tree and Array Multipliers
Full-Tree Multipliers
1
Full-Tree Multipliers ¬All multiples of the multiplicand are produced in parallel -k-input CSA tree is used to reduce them to two operands ®A CPA is used to reduce those two to the product • No feedback → pipelining is feasible • Different multiplier arrays are distinguished by the designs of the above three elements.
General Structure of a Full-tree Multiplier Binary, High-radix, or Recoded
2
Radix Tradeoffs • The higher the radix …. – The more complex the multiple-forming circuits, and – The less complex the reduction tree
• Where is the optimal cost-effectiveness? – Depends on design – Depends on technology
Tradeoffs in CSA Trees • Wallace Tree Multiplier – combine partial product bits at the earliest opportunity – leads to fastest possible design
• Dadda Tree Multiplier – combine as late as possible, while keeping the critical path length (# levels) of the tree minimal – leads to simpler CSA tree structure, but wider CPA at the end
• Hybrids – some where in between
3
Two Binary 4 × 4 Tree Multipliers
Reduction Tree • Results from chapter 8 apply to the design of partial product reduction trees. – General CSA Trees – Generalized Parallel Counters
4
CSA for 7 × 7 Tree Multiplier
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------
Structure of a CSA Tree • A logarithmic depth reduction trees based on CSAs (e.g. Wallace, Dadda. Etc.) – have an irregular structure – make design and layout difficult
• Connections and signal paths of various lengths – lead to logic hazards and signal skew – implication for both performance and power consumption
5
Alternative Reduction Trees (n;2) Counters (more suitable to VLSI) ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1
Slices of a (7;2) counter can reduce a7×7 multiplication
This is a regular circuit, however, many inputs are zeros.
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------------------------------------------------------------------------------• • • • • • • • • • • • • • • • • • • • • •
A Balanced (11;2) Counter Balanced: All outputs are produced after the same number of delays. All carries produced at level i enter FAs at level i+1 Can be laid out to occupy a narrow vertical slice. Can be easily expanded to a (18;2) counter
6
Tree Multiplier Based on (4,2) Counters
(4,2) Counters make Binary Reduction Trees
Layout of Binary Reduction Tree
7
Sign Extension
Signed Addition
8
Baugh-Wooley Arrays
Partial Tree Multipliers
9
Basic Array Multiplier Using a One-Sided CSA Tree
Unsigned Array Multiplier
10
Unsigned Array Multiplier
Baugh-Wooley Signed Multiply
11
Signed Multiplier Array
Mixed Positive and Negative Binary + Digit Set = {0,1} − Digit Set = {-1,0}
+
−
0 0
−1 0
1 1
0
1
Excess-1 encoding
Digit Encodings −2
k −1
2
k −2
− +
Two’s complement L number
L
0
2
+
Mixed +/− digit set number
12
Mixed Binary Full Adders − −
+ − x y
x y
− co FA
− co FA
ci+ + + x y
s +
+ co FA
− + x y
ci+
ci+
x y
ci−
s +
s +
+ co FA
− +
+ co FA
− co FA
ci −
s
− ci+
+ + x y
+ co FA
s
−
ci −
s +
x y
+ − x y
s
− −
− co FA
ci −
s
−
−
5 × 5 Signed Multiplier Using Mixed +/− Numbers - + + + + + + + + + + + - + - + + + + + + + + + + - +
- + -
+ + +
+ -
+
- + - + + + + + + - +
+ +
-
- + -
+ +
-
-1 +
+
+
+
+
13
Include AND Gates in Cells
Change the Terms of the Problem
14
Multiplier Without Final CPA
Conditional Carry-Save
15
Pipelined Partial-Tree Multiplier
Pipelined Array Multiplier
16
Lecture 12 Variations in Multiplier
Divide and Conquer You have a b × b multiplier. (Could be a lookup table.) You want a 2b × 2b multiplier.
1
Divide & Divide & Conquer You want a 2b × 2b multiplier. ( 4 b× b multiplies, 3 addends). You want a 3b × 3b multiplier. ( 9 b× b multiplies, 5 addends). You want a 4b × 4b multiplier. (16 b× b multiplies, 7 addends).
An 8 × 8 Multiplier using 4× 4 Multipliers
2
Additive Multiply Modules To synthesize large multipliers from smaller ones requires both multiplier and adder units. If we combine multiplication and addition into one unit, then we can use it to implement large multiplies.
An 8 × 8 Multiplier using 8 (4 × 2)-AMMs
3
A slower, but more regular 8 × 8 Multiplier
Bit-Serial Multipliers • • • • •
Smaller pin count Lower area needed in VLSI Run at high clock rate Can be pipelined Can be run in systolic arrays
4
Semi-Systolic #1: A parallel, X serial (LSB first) a3
a2
a1
a0 x0
x1 x2 x3 0 0 0 0
Semi-systolic Design #1 Requires 4 zeros to be shifted in after the x’s
5
Modified Design #1 Allows a new problem to be started 4 cycles earlier.
One Way to Look at Retiming
6
Semi-Systolic #2: A parallel, X serial (LSB first)
a3
a2
a1
a0 x0
x1 x2 x3 0 0 0 0
Semi-systolic Design #2
7
Systolic Design A parallel, X serial (LSB first)
a3
a2
a1
a0 x0
x1 x2 x3 0 0 0 0
Systolic Design
8
Both Inputs Serial Let a (i ) = 2i ai + a (i −1) , a ( 0) = a0 x (i ) = 2i xi + x (i −1) , x ( 0) = x0 m (i ) = a (i ) x (i ) = (2i ai + a (i −1) )(2i xi + x (i −1) ) = 2 2i ai xi + 2i (ai x ( i −1) + xi a (i −1) ) + a (i −1) x ( i −1) = 2 2i ai xi + 2i (ai x ( i −1) + xi a (i −1) ) + m ( i −1) i ( i +1) (i ) ( i −1) ( i −1) −i m4 + xi a (i −1) + 21 2 ⋅ 21−4 2m 4 3 = 2 ai xi + ai x 42 3 p(i )
p ( i −1)
2 p (i ) = 2i ai xi + ai x (i −1) + xi a (i −1) + p (i −1)
Bit-Serial Multiplier in Dot Notation
9
Bit-Serial Multiplier
(Error in book)
Modular Multipliers • Produces (a×b) mod m • Good for multiplication in residue number systems • Two special cases are easier to do – m = 2b – m = 2b − 1
• Modulo adders give rise to modulo multipliers.
10
Modulo-(2b-1) Carry-Save Adder
Design of a modulo-15 adder
16 mod 15 = 1 32 mod 15 = 2 64 mod 15 = 4
11
Design of a modulo-13 adder
16 mod 13 = 3 32 mod 13 = 6 64 mod 13 = 12
The number of dots is increased.
Remove one dot from col. 1 and replace it with two dots in col. 0 to balance the load.
General Method for Modular Addition Bits emerging from the left (>m) …. Are reduced modulo m and added into result.
12
The Special Case of Squaring • Any standard k × k multiplier may be used for computing p = x2 • However, a special purpose k-bit squarer requires significantly less hardware and is faster.
Design of a 5-bit Squarer
13
Reducing the final CPA Using the identity : x1 x0 + x1 = 2 x1 x0 + x1 − x1 x0 = 2 x1 x0 + x1 (1 − x0 ) = 2 x1 x0 + x1 x0 x1x0 x1
x1x0 x1x0
This postpones an addition, shortening the final CPA
A Multiplier Based on Squares 2 2 ( a + x ) − (a − x ) ax =
4
• Square Function can be done with a lookup table of size 2k × (2k − 2) – small compared to a multiply lookup table
14
Incorporating Wide Additive Inputs to a Multiplier
15
Lecture 13 Basic Division Schemes
Note on Notation
1
Sequential Division Algorithm
q Bit Selection: If (2s( j-1) - 2kd ) < 0 then qk-j= 0 Else qk-j= 1 After k iterations:
Overflow • The quotient of a 2k-bit number divided by a k-bit number may have more that k-bits. • An overflow check is needed.
Check: • The high order k bits of z must be strictly less than d. • Also checks for the divide-by-zero condition.
2
Radix-r division (r > 2) • Basically the same division algorithm, however: • Selection of quotient digit qk-j more difficult • Computation of term qk-jd more difficult
Restoring Division Hardware
3
Signed Division
Magnitudes of q and s are unaffected by the input signs. Output signs are only a function of input signs.
Non-Restoring Division Restoring Partial Remainder
u
Non-restoring
u
Double
k u − 2k d < 0 u − 2 d < 0 u (Yes) u − 2k d (No!) 2u 2(u − 2 k d )
Subtract 2kd
2u − 2 k d
Subtract 2kd Restore?
2(u − 2 k d ) + 2 k d = 2u − 2 k d Add instead !
4
Non-Restoring Signed Division • Each cycle, compute: s = u + qk − j 2kd • Each cycle, you either: – if sign(s j-1) ≠ sign (d) then qk − j = −1, subtract 2kd – if sign(s j-1) = sign (d) then qk − j = 1, add 2kd
• digits of qk − j are either −1 or 1
Two Problems at the End • The quotient with digits 1 and −1 must be converted to standard binary • A final correction step: If sign(s) != sign(z) – add ±d to the remainder, and – subtract ±1 from the quotient
5
Conversion of {-1, 1} to 2’s-Complement • Replace all -1 digits with 0s, leave the 1s • Complement the most-significant digit. • Shift left one position, inserting 1 in LSB -1 -1 • 0 0 • 1 0 • 101
11 11 01 011
-1 1 01 11 111
1 -1 1 0 0 0 001
q1 q0 p1 p0 p1 p0 p1 p0 1
Note : qi = 2 pi − 1
Proof of Method
6
Partial Remainders
Non-Restoring Division Hardware
7
Lecture 14 High-Radix Dividers
Basics of High-Radix Division z = (d × q ) + s sign( s) = sign( z ),
s
s ( j ) = rs ( j −1) − qk − j (r k d ) with s ( 0 ) = z and s ( k ) = r k s
1
Radix-4 Division in Dot Notation
Interesting dividers have radix r = 2b Reduces the number of cycles by a factor of b
Difficulty of High-Radix Division • Guessing the correct quotient digit is more difficult. • Division is naturally a sequential process: – guess a quotient digit qk − j k – compute term qk − j (r d ) – compute partial remainder
s ( j ) = rs ( j −1) − qk − j (r k d )
2
Carry-Save Remainders • More important for speed than high-radix • Lead to large performance increases by replacing carry-propagate adder with carrysave adder. • Key to keeping remainder in carry-save form is: Redundancy in the representation of q. – allows less precise guessing of quotient digit based on approximate magnitude of partial remainder – more redundancy → less precision required
Review of Non-Restoring Division (fractional operands)
s
s ( j ) = 2 s ( j −1) + d
s ( j ) = 2 s ( j −1) − d
3
Using q−j in {−1,0,1}
s ( j ) = 2 s ( j −1) + d
Just Shift
s ( j ) = 2 s ( j −1) − d
A Big Problem Q: How can you tell if shifted partial remainder is in [-d,d) ? A: You have to perform trial subtractions.
Q: Can you avoid trial subtractions ? A: Sweeny, Robertson, and Tocher-- SRT division
4
Radix-2 SRT Division • Assume d ≥ 1/2 (normalized) • Restrict partial remainder to constant range [−1/2, 1/2) instead of [−d, d ) – may require shifting dividend = initial partial remainder so that: 1/2 ≥ s(0) ≥ −1/2
• Once in the proper range, subsequent partial remainders will stay in the range….
Radix-2 SRT Division
Comparison with 1/2 and -1/2 is easy.
5
Simplified Digit Selection u0
u−1
2s ( j −1) = u0 . u−1
1
1
[ − 1 2 , 0) → q − j = 0
1
0
[−1, − 1 2) → q− j = −1
0
1
[1 2 , 1) → q− j = 1
0
0
[0, 1 2) → q− j = 0
Final Steps • {−1,1}-quotient conversion algorithm will not work to convert [−1, 1]-quotient to two’s-complement. – On-the-fly algorithm by Ercedovac 1987, or – Subtract negative digits from positive digits.
• Still requires a final correction step to make remainder positive.
6
Using Carry-Save Adders
Carry-Save Partial Remainders • Two numbers sum to the actual partial remainder • To perform exact comparison, a full CPA would be required • Overlaps in the selection regions allow us to perform approximate comparisons without risk of choosing a wrong digit.
7
Carry-Save Partial Remainders 2 s ( j −1) = u + v
u = (u1u0 . u −1u− 2 L)2's-comp v = (v1v0 . v−1v− 2 L)2's-comp
Let t = t1t0 . t −1t − 2 = u1u0 . u−1u − 2 + v1v0 . v−1v− 2 t is an approximation of u + v Truncation error is less than 1 4 + 1 4 = 1 2 : 0 ≤ (u + v) − t ≤ 1 2
Tolerating Truncation Error
Truncation Error
8
Digit Selection t1t 0 .t −1t − 2
2s ( j −1) = u0 . u −1
t1t 0 .t −1t − 2
2s ( j −1) = u0 . u−1
01.11
[1.75, 2.0) → q− j = 1
11.11
[−0.25, 0.00) → q− j = 0
01.10
[1.5, 1.75) → q− j = 1
11.10
[−0.5, − 0.25) → q− j = 0
01.01
[1.25, 1.5) → q− j = 1
11.01
[−0.75, − 0.5) → q− j = −1
01.00
[1.0, 1.25) → q− j = 1
11.00
[−1.0, − 0.75) → q− j = −1
00.11
[0.75, 1.0) → q− j = 1
10.11
[−1.25, − 1.0) → q− j = −1
00.10
[0.5, 0.75) → q− j = 1
10.10
[−1.5, − 1.25) → q− j = −1
00.01
[0.25, 0.5) → q− j = 1
10.01
[−1.75, − 1.5) → q− j = −1
00.00
[0.0, 0.25) → q− j = 1
10.00
[−2.0, − 1.75) → q− j = −1
Radix-2 Divider with CSA
9
Select Logic • Fast 4-bit CPA, plus decode logic, or • 256 × 2 Lookup table, or • 8 input, 2 output PLA u v
CLA with SRT Division?
What happens to overlap regions as d → 1 ?
10
Choosing Quotient Digits Using a P-D Plot s ( j ) = 2 s ( j −1) − q− j d
p = 2s(j-1)
d Horizontal decision lines: the value of d does not affect the choice.
Putting Both Charts Together s ( j ) = 2 s ( j −1) − q− j d
11
Radix-4 SRT Division • Radix r = 2b, b > 1 • Partial remainder kept in stored-carry form. • Requires a redundant digit set. • Example: – radix 4 – digit set [−3, 3]
New vs. Shifted Old Partial Remainder Radix = 4
Digit Set = [ −3, 3 ]
12
p-d Plot for Radix-4, [−3, 3 ] SRT Division
Only one quadrant shown
Radix-4 Digit Set [ −2, 2 ] • Avoids having to compute 3d as in digit set [−3, 3] • Fewer comparisons (fewer selection regions) • Less redundancy means less overlap in selection regions • Partial remainder must be restricted to ensure convergence
13
Restricting the Range of s − hd ≤ s ( j −1) < hd , for some h < 1 − 4hd ≤ 4 s ( j −1) < 4hd hd d ≤ 4s ( j −1) − q− j d < 41 hd d − −4 23 144 2+423 42 142 4 43 4 q − j = −2
s( j )
q− j = 2
− hd ≤ s ( j ) < hd hd = 4hd − 2d h=2 3
p-d Plot for Radix-4, [−2, 2 ] SRT Division
14
Observations • Restricting digit set to [−2, 2 ] results in less overlap in selection regions • Must examine p and d in greater detail to correctly choose the quotient digit. • Staircase boundaries: 4 bits of p and 4 bits of d are required to make the selection.
Block Diagram 4 bits
2d, d, 0, −d, −2d
15
Intel’s Pentium Division Bug • • • •
Intel used the Radix-4 SRT division algorithm Quotient selection was implemented as a PLA The p-d plot was numerically generated. Script to download entries into the PLA inadvertently removed a few table entries from the table. • When hit, these missing entries resulted in digit 0, instead of the intended digits ±2. • These entries are consulted very rarely, and thus the bug was very subtle and difficult to detect.
General High-Radix Dividers • Radix-8 is possible. – Minimal quotient digit set [−4, 4] – Partial remainder restricted to [-4d/7, 4d/7) – Requires a 3d multiple
• Digit sets with greater redundancy (such as [−7, 7] ) lead to: – wider overlap regions – more comparisons but simpler digit selection – more difficult multiples (±5, ±7)
16
Lecture 15 Variations in Dividers
Robertson Diagram with Radix r, Digit Set [−α, α]
Partial Remainder Shifted Partial Remainder Digit Set [− d , d )
[− rd , rd )
[−(r − 1), r − 1 ]
[− hd , hd )
[− rhd , rhd )
[ − α, α ]
h=
α r −1
1
Range of h h α Maximal Redundancy r − 1 1 Minimal Redundancy r 2 (1 2) +
α h= r −1
Derivation of h Bound on s(j-1 ) : − hd ≤ s ( j −1) < hd , for some h < 1 − rhd ≤ rs ( j −1) < rhd − α3 d ≤ rs ( j −1) − q− j d < rhd − α3 d 1rhd 42+4 142 4 43 4 1424 q− j = − α
s( j )
q− j = α
Bound on s(j) : − hd ≤ s ( j ) < hd hd = rhd − αd h=
α r −1
2
p-d Plot with Overlap Region Uncertainty Rectangle (because of truncation)
A: 4 bits of p, 3 bits of d OK
B: 3 bits of p, 4 bits of d Ambiguous
Choosing the Section Boundary 1. Tile with largest admissible rectangles 2. Verify that no tile intersects both boundaries.
3. Associate a quotient digit with each tile.
3
Tiles = Uncertainty Rectangles The larger the tiles, the fewer bits need to be inspected. If p is in carry-save form (u+v), then to get j bits of accuracy for p, we need to inspect j+1 bits of u and v.
Truncation Error of p
14243 Truncation Error of d
Determining Tile Sizes • Goal: Find the coarsest possible grid such that the staircase boundaries are entirely contained in the overlap areas. • There is no closed form for the number of bits required, given the parameters r and α. • However, we can derive lower bounds on the number of bits required.
4
Finding Lower Bounds on Number of Bits • By finding an upper bound on the dimension of the tile box, that determines a lower bound on the number of bits needed. • The narrowest overlap area is the area between the two largest digits: α and α−1 at d min • Find the minimum horizontal and vertical dimensions of the overlap area in that narrowest region
Establishing Upper Bound on Uncertainty ∆d = d min
α
2h − 1 −h+α
α
∆p = d min (2h − 1) ∆d
α ∆ α
Missing symbols in Fig. 15.4
bits of p = − log 2 ∆p
bits of d = − log 2 ∆d
∆
5
Automating the Process • Determining the bound on the number of bits required and generating the contents of the digit selection PLA can be easily automated. • However, the Intel Pentium bug teaches us an important lesson.
The Asymmetry of the Quotient Digit Section Process P can also go negative. The second quadrant is not a simple negation of the first quadrant, due to the asymmetric effect of truncation. Separate table entries for other quadrants must be derived.
6
Large Uncertainty Rectangles Only one of the large uncertainty rectangles is not totally in the overlap region. Break it into smaller rectangles. One extra bit for both p and d are needed for this case.
Division With Prescaling • The overlap regions are widest toward the high end of the divisor d range. • If we can restrict d to be large, then the selection of the quotient digits may become simpler (require fewer bits of p and d, possibly made independent of d altogether) • Instead of computing z/d, compute zm/dm • This is called prescaling.
7
Prescaling • Multiply both dividend and divisor by a constant m before beginning division. • For multiplier, use existing hardware in divider for multiplying divisor and quotient digits. • Speedup in selection logic must be weighed against extra multiplication steps at the beginning. • Table lookup to determine scaling factor.
Modular Dividers and Reducers • Remainder in modular division is always positive. (Requires a different correction step.) • Modular reduction (computing positive remainder) is faster and needs less work than a full blown division.
8
Restoring Array Divider What is the worst case path?
Nonrestoring Array Divider What is the worst case path?
9
Comparison to Array Multipliers • Similarity between array dividers and array multipliers is deceiving. • Array multipliers have O(k) delay. • Array dividers have O(k2) delay. • Both can be pipelined to increase throughput.
Multiplier and Divider Comparison
10
Combined Multiply/Divide Unit
Array Multiplier and Divider Comparison
11
I/O of a Universal Array Multiplier/Divider
12
Chapter 16 Division by Convergence
General Convergence Methods • Mutually recursive equations – one sequence converges to a constant (residual) – one sequence converges to the desired function
• The complexity depends on – ease of evaluating f and g – the number of iterations required to converge
1
Recurrence Equations for Division
An invariant is a predicate that is true at each step of the iteration. It is useful in debugging and leads to a formula for the computation.
Invariant:
Another Recurrence for Division Scaled Residual:
Invariant : s ( j ) = r j (z − q ( j ) r − j d ) s( j)r − j = z − q( j)r − j d
2
Variations • The many division schemes in chapters 13 15 correspond to: – variations in radix r – variations in the scaled residual bound d – variations the quotient selection rule
• This chapter explores schemes that require far fewer iterations ( O(log k) instead of O(k) )
Division by Repeated Multiplications
3
Three Questions
Recurrence Equations
x ( i ) is an approximation to 1/d ( i )
4
Substitute 2−d(i) for x(i)
How fast does it converge?
⇒ Called Quadratic convergence
Quadratic Convergence 1 − d ( 0) ≤ 2 −1 1 − d (1) ≤ 2 − 2 1 − d ( 2) ≤ 2−4 1 − d ( 3 ) ≤ 2 −8 M 1 − d ( k ) ≤ 2 −2 = 2−m k
m = log 2 k
5
Analysis of Errors
z(m) can be off from q by up to ulp when z=d, both d(i) and q(i) converge to 1−ulp To reduce error to ulp/2, add ulp to q(m) whenever q−1=1
Complexity To do a k-bit division requires : 2 log 2 k − 1 multiplications log 2 k
complementations
Intermediate computations need to be done with a minimum of k + log 2 m bits
6
Division By Reciprocation • To compute q = z / d – compute 1/d – multiply z times 1/d
• Particularly efficient if several divisions by the same divisor d need to be performed
Newton-Raphson Iteration x
( i +1)
f ( x (i ) ) =x − f ′( x (i ) ) (i )
Finds the root of f
7
Newton-Raphson for Division
Quadratic Convergence:
Initial Condition For Good:
Better:
8
Speedup of Convergence Division • Three types of speedup are possible: – reducing the number of multiplications – using narrower multiplications – performing the multiplications faster
• Convergence is slow in the beginning, fast at the end (the number of bits doubles each iteration)
Lookup Table • 8-bit estimate of 1/d replaces 3 iterations • Q: How many bits of d must be inspected to estimate w bits of 1/d ? – A: w – Table size will be 2w×w (proved in section 16.6) – Estimate for 1/d may have a positive or negative error -- the important thing is to reduce the magnitude of the error.
9
Lookup Table Size To get w (w ≥ 5) bits of convergence in the first iteration of division by repeated multiplication, w bits of d (beyond 0.1) must be inspected. The approximation x(0+) needed is w bits (beyond 1.) table lookup
0.11 xxx L 42 4 3x w− digits
→
1.1 xxx L 42 4 3x w − digits
Table size is 2w×w. First pair of multipliers use (w+1)-bit multiplier x(0+).
Convergence from Above and Below
10
Reducing the Width of Multiplications • The first pair of multiplications following the table-lookup involve a narrow multiplier. • If the results of multiplications are suitably truncated, then narrow multipliers can continue to be used.
The Effect of Truncation
Truncation Error:
Approximation Error:
11
Use of Truncated Multiplication
Truncation at Each Step
12
Example: 64-bit Divide • • • • •
256×8 = 2K bit Lookup Table (8-bit result) Two multipliers (9-bit) (16-bit result) Two multipliers (17-bit) (32-bit result) Two multipliers (33-bit) (64-bit result) One multiplier (64-bit)
Hardware Implementation • Convergence division methods are more likely to be used when a fast parallel tree multiplier is available. • The iterated multiply algorithm can also be pipelined.
13
Using a Two-Stage Pipelined Multiplier
Lookup Tables • The better the approximation – the fewer multiplies are required, but – the larger the lookup table
• Store reciprocal values for fewer points and use linear (one multiply-add operation) to higher order interpolation to get approximation at a specified initial value. • Formulate the starting approximation as a multioperand addition problem and use one pass through the multipliers CSA tree to compute it.
14
Lecture 17 Floating-Point Representations
Number Representations • No representation method is capable of representing all real numbers. • Most real values must be represented by an approximation • Various methods can be used: – – – –
Fixed-point number systems (0.xxxxxxxx) Rational number systems (xxxx/yyyy) Floating point number systems (1.xxxx × 2yyyy ) Logarithmic number systems (2yyyy.yyyy)
1
Fixed-Point • Maximum absolute error is same for all numbers – ±ulp with truncation – ±ulp/2 with rounding
• Maximum relative error is much worse for small number than for large numbers – x = (0000 0000. 0000 1001)two – y = (1001 0000. 0000 0000)two
• Small dynamic range: x2 and y2 cannot be represented
Floating-Point • Floating-point trades off precision for dynamic range – you can represent a wide range, from the very small to the extremely large – precision is acceptable over at all points within the range
2
Floating-Point • A floating-point number has 4 components: – – – –
the sign, ± the significand, s the exponent base, b (usually 2) the exponent, e , (which allows the point to float)
• x = ± s × be • Previous example: – +1.001two × 2-5 – +1.001two × 2+7
• More dynamic range
Typical Floating-Point Format
3
Two Signs • The sign of the significand is the sign of the number • The exponent sign is positive for large numbers and negative for small numbers
Representation of Exponent • Signed integer represented in biased number system, and placed to the left of the significand – does not affect speed or cost of exponent arithmetic (addition/subtraction) – Smallest exponent = 0 • facilitates zero detection, zero = all 0’s
– facilitates magnitude comparison • comparing normalized F.P. numbers as if they were integers
4
Range • Intervals [ −max, −min ] and [ min, max ] • max = largest significand × b largest exponent • min = smallest significand × b smallest exponent −∞
+∞
Normal Numbers • Significand is in a set range such as: – –
[1/2, 1) 0.1xxxxxxxx, or [1, 2) 1.xxxxxxxx
• Non-normal numbers may be normalized by shifting and adjusting the exponent. • Zero is never normal.
5
Unrepresentable Numbers • Unrepresentable means not representable as a normalized number. • Underflow – interval from −min to 0, and from 0 to min
• Overflow – interval from −∞ to −max, and from max to ∞
• Three special, singular values : −∞, 0, ∞ Represented by special encodings.
Floating-Point Format • The more bits allocated to the exponent, the larger the dynamic range • The more bits allocated to the significand, the larger the precision • Decisions: – – – – – –
Fix exponent base, b Number of exponent bits Number of significand bits Representation of exponent, e Representation of significand, s Placement of binary point in significand
6
IEEE Single (Double) Precision • • • •
Fix exponent base, b = 2 Number of exponent bits = 8 (11) Number of significand bits = 23 (52) Representation of exponent, e = biased with bias = 127 (1023) • Representation of significand, s = signedmagnitude • Placement of binary point in significand = One bit to the left of the first bit = Implicit bit, not represented because its always = 1.
Before Standardization • Every vendor had a different floating-point format. • Even after word widths were standardized to 32 and 64 bits, different floating-point standards persisted. • Programs were not predictable. • Data was not portable.
7
ANSI/IEEE Std 754-1985
• Single (32 bit) and double (64 bit) precision formats • Special codes for +0, −0, +∞, −∞ – Number ÷ +∞ = ±0 – Number × ∞ = ±∞ – Number ÷ 0 = ±∞
• NaN (Not a Number) – Ex: 0/0, ∞/∞, 0×∞, Sqrt(negative number) – Any operation involving another NaN
5 Formats (Single Precision) 1. NaN 2. Infinity 3. Normalized 4. Denormalized 5. Zero
(1) e = 255 and f ≠ 0, then v = NaN regardless of s (2) e = 255 and f = 0, then v = (−1) s ∞ (3) 0 < e < 255, then v = (−1) s 2( e −127 ) (1. f ) (4) e = 0 and f ≠ 0, then v = (−1) s 2 −126 (0. f ) (5) e = 0 and f = 0, then v = (−1) s 0
8
5 Formats (Double Precision) 1. NaN 2. Infinity 3. Normalized 4. Denormalized 5. Zero
(1) e = 2047 and f ≠ 0, then v = NaN regardless of s (2) e = 2047 and f = 0, then v = (−1) s ∞ (3) 0 < e < 2047, then v = (−1) s 2( e −1023) (1. f ) (4) e = 0 and f ≠ 0, then v = (−1) s 2 −1022 (0. f ) (5) e = 0 and f = 0, then v = (−1) s 0
Denormalized Numbers • Numbers without a hidden 1 and with the smallest possible exponent • Provided to make underflow less abrupt. • “Graceful underflow” • Ex: 0.0001two × 2−126 • Implementation is optional, and it usually is not implemented.
9
Operations Defined • • • • • • • •
Add Subtract Multiply Divide Square Root Remainder Comparison Conversions
Results must be same as if intermediate computations were done with infinite precision. Care must be taken in hardware for ensuring correctness and no undue loss of precision.
Addition • Align the exponents of the two operands by right-shifting the significand of the number with the smaller exponent. • Add or subtract the significands depending on the sign bits. – Add if signs are the same – Subtract if signs are different
• In the case of subtract, cancellation may have occurred, and post-normalization is necessary. • Both overflow and underflow are possible
10
Multiplication • Add the exponents and multiply the significands. • s1 in [1, 2) and s2 in [1,2) imply s1×s2 in [1, 4) – possible need for a single-bit right shift postnormalization
• Overflow and underflow are possible. – Post-normalization may also cause overflow.
Division • Subtract the exponents and divide the significands. • s1 in [1, 2) and s2 in [1,2) imply s1÷ s2 in (1/2, 2) – possible need for a single-bit left shift postnormalization
• Overflow and underflow are possible. – Post-normalization may also cause underflow.
• Division by zero fault must be detected.
11
Square Root • Make the exponent even by subtracting 1 from odd exponents and shifting the significand left by one bit. s in [1, 4) • Halve the exponent and compute the square root of the significand in [1,2) • Square root never Overflows, Underflows, or needs post-normalization. • Square root of negative non-zero number produces a NaN • − 0 = −0
Conversions • Integer to Floating Point – may require rounding – may be inexact
• Single Precision to Double Precision – always fits, never any problems
• Double Precision to Single Precision – may require rounding – may overflow or underflow
12
Exceptions • • • •
Divide by Zero Underflow Overflow Inexact – result needed rounding
• Invalid – the result is NaN
Rounding Schemes • Round toward zero (inward) – truncation or chopping of significand
• Round toward −∞ – truncation of a 2’s-complement number
• Round toward +∞ • Round toward nearest even – round to nearest value – if it’s a tie, round so that the ulp=0
13
Round toward 0
Round toward −∞
14
Round toward +∞
Round to nearest even
15
Rounding Algorithm • Inputs needed – – – – – –
rounding mode (2 bits) sign LSB of significand guard (bit to the right of LSB) round (bit to the right of guard) sticky (OR of all bits to the right of round)
• Conditionally add 1 ulp • Possibly post-normalize • Can overflow
16
Lecture 18 Floating-Point Operations
Unpacking • Separate the sign, exponent, and significand • Put in the implicit “1.” • Convert to internal format (perhaps an extended number) • Testing for special operands:
NaN , ± 0, ± ∞
1
Unpack Sub
Add/Sub
Sub
Swapper
Comp XOR Mux
Right Shifter
Adder & Subtractor
Add/Sub
Sign Logic
Round Mode
Add/Sub
Right/Left Shift Normalize Round Add
Add
Right Shift Normalize
Pack
Right Shifter • Barrel Shifter • Logarithmic Shifter
2
Barrel Shifter
Right3
Right2
Right1
Right0
Left1
Logarithmic Shifter
3
Comp Logic ∆e ∆s
0
s1 ≥ s2 X 0,+ s1 ≥ s2 − s1 < s2 , swap
−
X
+ 0
s1 < s2 , swap
Sign Logic Sign1
Sign′2
s1 ≥ s2
Signout
+ − + + − −
+ − − − + +
X X 1 0 1 0
+ − + − − +
Add / Sub Add Add Sub Sub Sub Sub
Sign′2 = Sign2 ⊕ Sub
4
Rounding Logic Before post - normalization L z−l +1 Already normal L z−l +1 L z −l + 2 L z −l
1 - bit right shift normalize 1 - bit left shift normalize
z −l z−l +1
| G | G | z −l
G
| R
z −l
R R∨S
S
G∨R∨S S
Round Mode
Right/Left Shift Normalize
Round Logic
Round Add
Round Logic
R = Guard S = Round ∨ Sticky
Mode Up Up Up Up Down Down Down Down Chop Even Even Even Even
Sign LSB R S 1 X X + X X 1 + 0 0 X + X X X − + X X X X 1 X − X X 1 − 0 0 X − X X X X X X 1 1 0 1 0 X 1 1 0 X X 0 X X
Round 1 1 0 0 0 1 1 0 0 1 0 1 0
5
Unpack Add Bias
Sub
Integer Multiply
Multiplier
XOR
Round Mode Add
Right Shift Normalize Round Add
Add
Right Shift Normalize
Pack
Unpack Sub Bias
Add
Divider
Integer Divide
XOR
Quo Sub
Rem
Round Mode
Left Shift Normalize Round Add
Add
Right Shift Normalize
Pack
6
Add
0
=
Bias = 127
Sub
Sub
1
Bias = 127
Add
=
Add
1
Sub
0
Exponent Logic for Multiply
Exponent Logic for Divide
Addition + −∞
−∞ −∞
−0 −∞
+0 −∞
+∞
−0 +0 +∞
−∞ −∞ NaN
−0 ± 0∗ +∞
± 0∗ +0 +∞
+∞ +∞ +∞
NaN
* (− −) If Rounding mode = Down (+) Otherwise
7
Subtraction −
−∞
−0
+0
+∞
−∞ −0
NaN +∞
−∞ +0
−∞ −0
−∞
+0
+∞
+0
+0
−∞
+∞
+∞
+∞
+∞
NaN
−∞
Multiplication ×
−∞
−0
+0
+∞
−∞ −0 +0
+∞ NaN NaN
NaN +0 −0
NaN −0 +0
−∞ NaN NaN
+∞
−∞
NaN
NaN
+∞
8
Division ÷
−∞
−0
−∞ −0
NaN +∞
+ ∞ * − ∞ * NaN NaN NaN − ∞
+0
−∞
NaN
+∞
NaN
− ∞ * + ∞ * NaN
+0
NaN
+∞
+∞
* Divide by zero exception
Addition/Subtraction in LNS Lz = log z = log( x ± y ) = log( x(1 ± y x) ) = log x + log(1 ± y x)
= log x + log(1 ± log −1 (log y − log x) )
= Lx + log(1 ± log −1 ( Ly − Lx) ) = Lx + ϕ( Ly − Lx) Do ϕ using a lookup table.
9
Arithmetic Unit for LNS
10
Lecture 19 Error and Error Control
Sources of Computational Errors • Representational Errors – Limited number of digits – Truncation Error
• Computational Errors
1
Example
Representational Errors FLP(r, p, A) • Radix r • Precision p = Number of radix-r digits • Approximation scheme A – – – –
chop round rtne (round to nearest even) chop(g) (chop with g guard digits kept in intermediate steps)
2
Relative Error
Error in Multiplication
3
Error in Division
Error in Addition
4
Error in Subtraction
• If x−y is small, the relative error can be very large – this is called cancellation, or loss of significance
• Arithmetic error η is also unbounded for subtraction without guard digits
The Need for Guard Digits
5
Example
Invalidated Associative Law
6
Using Unnormalized Arithmetic
Tell the truth about how much significance you are carrying.
Normalized Arithmetic with 2 Guard Digits
7
Other Laws that Do Not Hold True
Which Algebraic Equivalent Computation is Best? • No general procedure exists • Numerous empirical and theoretical results have been developed. • Two examples….
8
Example One
x = −b ± b 2 − c
(
)
− b m b 2 − c b 2 − (b 2 − c) = = − b ± b2 − c − b m b2 − c − b m b2 − c c −c = = − b m b2 − c b ± b2 − c
x1 = −b − b 2 − c x2 =
−c b + b2 − c
Example Two
9
Worst-Case Error Accumulation • In a sequence of computations, errors may accumulate
1024 ulp = 10 bits of precision lost
• An absolute error of m ulp has the effect of losing log2m bits of precision
Solution: Use multiple Guard Digits • Do computations in double precision for an accurate single precision result. • Have hardware keep several guard digits. • Reduce the number of cascade operations.
10
Kahan’s Summation Algorithm
Error Distribution and Expected Errors Maximum Worst Case Average Expected − p +1
Chop
r
Round
r − p +1 2
(r − 1) r − p 2 ln r ( r − 1)r − p 4 ln r
1 1 + r
Expected error of rounding is 3/4 (not 1/2) that of chopping.
11
Forward Error Analysis • Estimating, or bounding the relative error in a computation • Requires specific constraints on the input operand ranges • Dependent on the specific computation
Automatic Error Analysis • Run selected (worst case) test cases with higher precision and observe the differences. • If differences are insignificant, then the computation is probably safe. • Only as good as test cases.
12
Significance Arithmetic • Roughly same as unnormalized arithmetic. • Information about the precision is carried in the result
Would have been misleading.
Noisy-mode Computation • Pseudo random digits, (rather than zeros) are inserted during left shifts performed for normalization. • Needs hardware support, or significant software overhead. • Several runs are compared. If they are comparable, then computation is good.
13
Interval Arithmetic • A value x is represented by an interval • Upper and lower bounds are found for each computation. Example:
• Unfortunately, intervals tend to widen until, after many steps, they become so wide as to be virtually worthless. • Helpful in choosing best reformulation of a computation, and for certifiable accuracy claims.
Backward Error Analysis • Computation error is analyzed in terms of equivalent errors in input values. • If inputs are not precise to this level anyway, then arithmetic errors should not be a concern.
14
Lecture 20 Precise and Certifiable Arithmetic
High Precision and Certifiability • Floating point formats – remarkably accurate in most cases – errors are now reasonably well understood – errors can be controlled with algorithmic methods
• Sometimes, however, this is inadequate – not precise enough – cannot guarantee bounds on the errors – “credibility-gap problem …. We don’t know how much of the computer’s answer to believe.” -- Knuth
1
Approaches for Coping with the Credibility Gap • Perform arithmetic calculations exactly – not always cost-effective
• Make arithmetic highly precise by raising the precision – Multiprecision arithmetic – Variable-precision arithmetic – Both methods make bad results less likely, but provide no guarantee
• Keep track of error accumulation – Certify the result or produce a warning
Other Issues • Algorithm and hardware verification – remember the Pentium
• Fault detection – Detect that a hardware failure has occurred
• Fault tolerance – Continued operation in the presence of hardware failures
2
Exact Arithmetic Proposals have included: • Continued fractions • Rational numbers • p-adic representations
Continued Fractions Any unsigned rational number x = p q has a unique continued - fraction expansion : p 1 x = = a0 + 1 q a1 + 1 a2 + 1 O 1 am −1 + am a0 ≥ 0 a1 L am−1 ≥ 1 am ≥ 2
3
Procedure to Convert to CF if i = 0 : x s (i ) = else if s (i −1) = a (i −1) : 0 1 else (i −1) − a (i −1) s a ( i ) = s (i ) Invariant : s (i ) = s (i ) +
1 s (i +1)
Example 277 642 1 642 = = 277 − 0 277 642 1 277 = = 642 − 2 88 277 1 88 = = 277 − 3 13 88 13 1 = = 88 10 −6 13 1 10 = = 13 −1 3 10 1 3 = = 10 1 −3 3
s (0) = s (1)
s ( 2)
s ( 3)
s ( 4)
s (5)
s (6)
277 =0 a (0) = 642 642 =2 a (1) = 277 277 =3 a ( 2) = 88 88 a ( 3) = = 6 13 13 a ( 4) = = 1 10
Successively Better Approximations
10 a (5) = = 3 3 3 a (6) = = 3 1
4
Continuations • Represent a number by a finite number of digits, plus a continuation [Vuillemin, 1990] • A continuation is a procedure to for obtaining: – the next digit – a new continuation
• Notation: [digit, digit, …. , digit ; continuation]
Periodic CF Numbers
5
Arithmetic • Unfortunately, arithmetic with continued fractions are quite complicated.
Fixed-Slash Number Systems • Rational number p/q – p and q are fixed width integers – Sign bit – Inexact bit • result has been “rounded” to fit the format
• Normalized if gcd(p,q) = 1 • Special values are representable – – – –
Rational number p ≠ 0, q ≠ 0 ±0 p = 0, q odd ±∞ p odd, q = 0 NaN otherwise
6
Multiple representations • 1/1 = 2/2 = 3/3 = 4/4 = …. • How many bits are wasted due to multiple representations? • Two randomly selected numbers in [1, n] are relatively prime with a probability of 0.608 [Dirichlet] • 61% of the codes represent unique numbers • Waste is < 1 bit.
Rational Arithmetic • Reciprocal – exchange p and q
• Negation – change the sign
• Multiplication/Division – 2 multiplies, normalize
• Addition/Subtraction – 3 multiplies and 1 addition, normalize
• Normalization – compute gcd(p,q), the most costly step – rounding complex
7
Floating Slash Number Systems
• Allows bits to be allocated where they are needed. • Ex: Integers q=1 (only needs one bit for q)
Multiprecision Arithmetic • Representing numbers using multi-word structures • Perform arithmetic by means of software routines that manipulate these structures. • Example applications: – Cryptography – Large prime research
8
Fixed Multi-precision Formats
Integer
Floating-Point
Multi-precision Arithmetic • Computation can be very slow • Depends heavily on available hardware arithmetic capabilities. • Research has been done on using parallel computers on distributed multi-precision data.
9
Variable-precision Arithmetic • Like multi-precision, except number of words can vary dynamically. • Helpful for both high precision and low precision needs. • “Little Endian” is slightly more efficient for addition.
Variable-precision Formats
Integer
Floating-Point
10
Variable-precision FP Addition X=u-word, Y=v-word with h shift
Using an exponent base of 2k instead of 2 allows shifting to be done by indexing, rather than actual data movement.
FP Addition Algorithm
11
Digit Serial • One digit per clock cycle • Allows dispensing precision on demand.
Error Bounding Using Interval Arithmetic • Using interval arithmetic, a result range is computed. • Midpoint of interval is used for approximate value of result. • Interval size, w, is used as the extent of uncertainty, with worst case error w/2
12
Combining and Comparing Intervals
Interval Arithmetic
13
Interval Arithmetic
Using Interval Arithmetic to Choose Precision • Theorem 20.1 implies that when you narrow the intervals of the inputs, you narrow the intervals of the outputs. • Theorem 20.2 states that if you reduce the relative error of the inputs, you reduce the relative error of the outputs by the same amount. • You can devise a practical strategy for obtaining results with a desired bound on error.
14
Choosing a Precision • Run a trial calculation with – – – –
p radix-r digits of precision wmax = maximum interval width of result ε is desired bound on absolute error if wmax ≤ ε then trial result is good, otherwise ….
• Rerun calcultation with – q radix-r digits of precision, where – q = p + log r wmax − log r ε
Adaptive and Lazy Arithmetic • Not all computations require the same precision. • Adaptive arithmetic systems can deliver varying amounts of precision • Lazy evaluation = postpone all computations until they become irrelevant, or unavoidable, produce digits on demand.
15
Lecture 21 Square Root
Pencil & Paper Algorithm z
Radicand
z2 k −1 z2 k − 2 L z1 z0
q Square Root s Remainder ( z − q 2 )
(2k digits)
qk −1 L q0
(k digits)
sk sk −1 L s0
z ( 0) = q ( 0 ) = s ( 0 ) = 0
(k + 1 digits)
(Initialization)
z (i ) = r 2 z (i −1) + ( z2 ( k −i ) z 2( k −i ) +1 ) r q (i ) = rq (i −1) + qi
s (i ) = z (i ) − (q (i ) )
2
= r 2 z (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − (rq (i −1) + qi )
(Invariant)
2
= r 2 z (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − r 2 (q ( i −1) ) − 2rq ( i −1) qi − qi2 2
= r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i )+1 ) r − (2rq (i −1) + qi )qi
1
Decimal Interpretation s (i ) = r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i )+1 ) r − (2rq (i −1) + qi ) × qi 4 43 4 1444424444 3 142 Shift remainder and bring down two digits
Double partial root, Shift left 1, and Append new digit - Then multiply by the new digit
0 ≤ s ( i ) < 2q ( i ) If s (i ) ≥ 2q (i ) then qi is too small. If s (i ) < 0 then qi is too big.
Decimal Example
2
Binary Interpretation s (i ) = r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − (2rq (i −1) + qi ) × qi 1444424444 3 1442443 Shift remainder 2 places left and bring down two bits
qi = 0 → 0 qi =1→( q ( i −1) 01) 2
0 ≤ s ( i ) < 2q ( i ) If s (i ) ≥ 2q ( i ) then qi is too small. If s (i ) < 0 then qi is too big.
Binary Example
3
Binary in Dot Notation
Restoring Shift/Subtract Algorithm z
z1 z0 .z −1 z − 2 L z −l
IEEE Radicand
1.q−1 L q−l
q Square Root s
Remainder ( z − q 2 )
s1s0 .s−1s− 2 L s−l
q ( 0) = 1 s (0) = z − 1
(Initialization)
q (i ) = q (i −1) + 2 −i q−i
( ) = z − (q +2 q ) = z − (q ) − 2 ⋅ 2 q
s (i ) = z − q (i )
2
( i −1)
( i −1) 2
−i
(Invariant)
2
−i
−i
( i −1)
q−i − 2 − 2i q−2i
= s (i −1) − 2 −i ( 2q (i −1) + 2 −i q−i )q−i
(Recurrence)
2i s (i ) = 2 ⋅ 2i −1 s (i −1) − (2q (i −1) + 2 −i q−i ) q−i
(Recurrence)
4
Restoring Square Root Interpretation z
z1 z0 .z −1 z − 2 L z −l
Radicand
q Square Root s
Remainder ( z − q 2 )
1.q−1 L q−l s1s0 .s−1s− 2 L s−l
q ( 0) = 1 s (0) = z − 1
(Initialization)
q (i ) = q (i −1) + 2 −i q−i
( )
s (i ) = z − q (i )
2
(Invariant)
2i s (i ) = 2 ⋅ 2i −1 s ( i −1) − (2q (i −1) + 2 −i q−i )q−i 144 42444 3
(Recurrence)
qi =1→(1q−1 . q−2 Lq−i+1 01)two qi = 0 → 0
Example of Restoring Algorithm
5
Sequential Shift/Subtract Restoring Square-Rooter
IEEE Square Root • • •
q− l
1
q 1 q 1 (in this case, ( 1) ≠ exact midway case)
6
Binary Non-Restoring Algorithm Interpretation q ( 0) = 1 s (0 ) = z − 1
(Initialization)
q (i ) = q (i −1) + 2 −i q−i s (i ) = z − (q (i ) )
2
(Invariant)
2i s (i ) = 2 ⋅ 2i −1 s (i −1) −
(2q (i −1) + 2 −i q−i )q−i 144 42444 3
(Recurrence)
qi =1→2 q ( i −1) + 2 −i = (1q−1 .q−2 Lq−i+1 01)two qi = −1→− ( 2 q ( i −1) − 2 −i ) =???
q ( j −1) = Q (Partial Root) q ( j −1) − 2 − j = Q * (Diminished Partial Root) q−i = 1 → Subtract (Q01)Shift Left 1
Add (Q *11)Shift Left 1
q− i = − 1 →
High-Radix Square-Root z
IEEE Radicand
q Square Root s
Scaled Remainder r i ( z − q 2 )
q (0) = 1 s (0) = z − 1
z1 z0 .z −1 z − 2 L z −l 1.q−1 L q−l s1s0 .s−1s− 2 L s−l (Initialization)
q ( i ) = q (i −1) + 2 −i q−i
(
( ))
s (i ) = r i z − q (i )
2
s ( i ) = r ⋅ s ( i −1) − (2q (i −1) + r −i q−i ) q−i
(Invariant) (Recurrence)
7
For radix = 4 : s
(i )
= 4 ⋅ s (i −1) − (2q (i −1) + 4 −i q−i )q−i
(Recurrence)
Let register Q = q (i −1) and Q* = q (i −1) − 4 −i +1 q−i = 2 ⇒
High-Radix Square-Root
s (i ) = 4 s (i −1) − (4q (i −1) + 4 −i +1 ) = 4 s (i −1) − (Q010 )Shift Left 2
Q = Q 10 q−i = 1 ⇒
Q* = Q * 01
Q = Q 01 q−i = 0 ⇒
Q* = Q * 00
s (i ) = 4 s (i −1) − (2q (i −1) + 4 −i ) = 4 s (i −1) − (Q001)Shift Left 1
s (i ) = 4 s (i −1) Q = Q 00
Q* = Q *11
q − i = −1 ⇒
s (i ) = 4 s (i −1) + ( 2q (i −1) − 4 −i ) = 4 s (i −1) + (Q *111)Shift Left 1
Q = Q *11
Q* = Q * 10
q − i = −2 ⇒
s (i ) = 4 s (i −1) + (4q (i −1) + 4 −i +1 ) = 4 s (i −1) + (Q *110 )Shift Left 2
Q = Q *10
Q* = Q * 01
Digit Selection • As in division, digit selection can be based on examining just a few bits of the partial remainder s(i). • s(i) can be kept in carry-save form • The exact same lookup table can be used for square-root as is used for division if the digit set {−1, −1/2, 0, 1/2, 1} is used.
8
Square-Root by Convergence Using Newton - Raphson : x (i +1) = x ( i ) −
f ( x (i ) ) f ′( x ( i ) )
f ( x) = x 2 − z x (i +1) =
x (i ) +
z x (i )
2
δ (i +1) = z − x (i +1) = z −
z (i ) 2 (i ) 2 x (i ) = − ( z − x ) = − δ 2 2 x (i ) 2 x (i )
x (i) +
Square-Root by Convergence • Convergence is quadratic. (Number of bits accuracy doubles each iteration.) • Since δ is negative, the recurrence approaches the answer from above. • An initial table-lookup step can be used to obtain a better initial estimate, and reduce the number of iterations.
9
Example
Approximation Functions For Fractional z : 0.5 ≤ z < 1 1+ z x (0) = error < 6.07% 2 For Integer z = 2 2 m −1 + z rest : x ( 0 ) = 2 m −1 + 2 −( m +1) z = (3 × 2 m −2 ) + 2 −( m +1) z rest error < 6.07%
10
Division-Free Variants • With reciprocal circuit or lookup-table – each iteration requires a table lookup, a 1-bit shift, 2 multiplications, and 2 additions – multiplication must be twice as fast as division to make this cost effective – convergence rate will be less than quadratic because of error in reciprocal
x
( i +1)
(
= x + 0. 5 1 x (i )
(i )
)⋅ (z − (x ) ) (i ) 2
Division-Free Variants • With Newton-Raphson approximation of reciprocal – each iteration a 1-bit shift, 3 multiplications, and 2 additions – convergence rate will be less than quadratic because of error in reciprocal – two equations can be computed in parallel
x (i +1) = x (i ) + 0.5 (x ( i ) + zy (i ) ) y (i +1) = y ( i ) (2 − x ( i ) y ( i ) )
11
Example
Division-Free Variants Using Newton - Raphson : f ( x) =
1 −z x2
x (i +1) = x (i ) −
Root at x =
(
( )
x (i +1) = 0.5 x (i ) 3 − z x (i )
2
)
f ( x (i ) ) f ′( x (i ) )
1 z
• Solve for inverse of square-root instead. – – – –
Requires 3 multiplications and 1 addition Quadratic convergence Final answer = z × x(k) Used in Cray-2 supercomputer, 1989
12
Example
Parallel Hardware Square-Root • Square-Root is very similar to Division • Usually possible to modify divide units to do square root. • A non-restoring square-root array can be derived directly from the dot notation, similar to the way the non-restoring divide array was derived.
13
Non-Restoring Square-Root Array
14
Lecture 22 The CORDIC Algorithms
CORDIC • Coordinate Rotation Digital Computer • Invented in late 1950’s • Based on the observation that: – if you rotate a unit-length vector (1,0) – by an angle z – its new end-point will be at (cos z, sin z)
• Can evaluate virtually all functions of interest • k iterations require for k-bits accuracy
1
1959
1971
2
1977
Rotations and Pseudo-Rotations
3
True Rotations
Pseudo-Rotations
4
After m Real Rotations Rotate by angles α1 , α 2 , L , α m
After m Pseudo-Rotations Rotate by angles α1 , α 2 , L , α m
5
Expansion Factor K
• By product of pseudo-rotations • Depends on the rotation angles. • However, if we always uses the same rotation angles (with positive and negative signs), then – K is a constant – Can be precomputed and stored – Its reciprocal can also be computed and stored
Basic CORDIC Iterations Pick fixed rotation angles ± α ( i ) such that : á ( i ) = tan −1 2 −i i α (i ) tan α ( i ) = 2 −i 0 45.0 1.000 1
26.6 0.500
2 14.0 0.250 3 7.1 0.125
6
CORDIC Pseudo-Rotations
á ( i ) = tan −1 2 − i tan α (i ) = 2 −i
Approximate Angle Table
Table can be done in degrees or radians.
7
CORDIC Iterations with Table
Basic CORDIC Iterations • Each CORDIC rotation requires: – 2 shifts – 1 table lookup – 3 additions
• By rotating by the same set of angles (with + or − signs) , the expansion factor K can be precomputed
8
Rotation Mode Rules • Initialize: – z=z – x=x – y=y
• Iterate with di = sign(z(i)) • Finally (after m steps): – – –
Example
9
Example (First 3 Rotations)
Trig Function Computation • Initialize: – z=z – x = 1/K = 0.607 252 935 …. – y=0
• Iterate with di = sign(z(i)) • Finally (after m steps): – – – –
z≈0 x ≈ cos(z) y ≈ sin (z) y/x ≈ tan (z)
10
Precision in CORDIC • For k bits of precision in trig functions, k iterations are needed. • For large i, tan(2−i) ≈ 2−i • For i > k , change in z < ulp • Convergence is guaranteed for angles in range −99.7 ≤ z ≤ 99.7 – (99.7 is sum of all angles in the table)
• For angles outside this range, use standard trig identities to convert angle to one in the range
Vectoring Mode Rules • Initialize: z = z, x = x, y = y • Iterate with di = −sign( x(i) y(i) ) –
Forces y(m) to 0
• Finally (after m steps): – – –
11
Trig Function Computation • Initialize: – z=0 – x=1 – y=y
• Iterate with di = −sign(x(i) y(i)) = −sign(y(i)) • Finally (after m steps): – z ≈ tan−1(y) – Use identity: to limit range of fixed-point numbers
CORDIC Hardware
12
Bit-Serial CORDIC • For low cost, low speed application (hand held calculators) bit-serial implementations are possible.
Generalized CORDIC
13
Circular Circular Rotation Mode
Circular Vectoring Mode
Linear Linear Rotation Mode
Linear Vectoring Mode
14
Hyperbolic Hyperbolic Rotation Mode
Hyperbolic Vectoring Mode
Vector Length and Rotation Angle Circular
Linear
Hyperbolic
15
Convergence • Circular and Linear CORDIC converge for suitably restricted values of x, y, and z. • Hyperbolic will not converge for all cases. • A simple solution to this problem is to repeat steps: i = 4, 13, 40, 121, …. , j, 3j+1, …. • With these precautions, hyperbolic will converge for suitable restricted values of x, y, and z
Using CORDIC Directly computes :
Also directly computes :
sin
tan −1 ( y x) y + xz
cos tan −1 sinh cosh tanh −1
x2 + y2 x2 − y2 e z = sinh z + cosh z
× ÷
16
Using CORDIC Indirectly Computes : sin z tan z = cos z sinh z tanh z = cosh z w −1 w +1 log b w = K × ln w
ln w = 2 tanh −1
w =e t
t ln w
−1
cos w = tan
−1
sin −1 w = tan −1
( = ln (w +
1 − w2 w w 1 − w2
) 1+ w )
cosh −1 = ln w + 1 − w 2 sinh −1
2
w = ( w + 1 4) 2 − ( w − 1 4) 2
Summary Table of CORDIC Algorithms
17
Variable Factor CORDIC • Allows termination of CORDIC before m iterations • Allows digit set {-1, 0, 1} • Must compute expansion factor via recurrence: • At the end, must divide results by the square root of (K(m))2. • Constant-factor CORDIC is almost always preferred.
High Speed CORDIC • Do first k/2 iterations as normal. – z becomes very small
• Combine remain k/2 iterations into one step.
• K doesn’t change.
18
Lecture 23 Variations in Function Evaluation
Choices are Good • CORDIC can compute virtually all elementary functions of interest. • However, alternatives may have advantages: – implementation – adaptability to a specific technology – performance
1
Recurrence Methods
• Normalization - make u converge to a constant • Additive Normalization - make u converge to a constant by adding a term to u each iteration. • Multiplicative Normalization - make u converge to a constant by multiplying each iteration.
Which is better ? • Additive is easier to evaluate. – Addition is cheap – CORDIC is additive
• Multiplicative is slower, but usually converges faster – often quadratic convergence – steps are more expensive, but there are fewer of them – sometime multiplicative reduces to shift & add
2
Logarithm Multiplicative Normalization:
x(0) = x
y(0) = y
After m steps:
Read from a table
Logarithm, continued Domain of convergence:
Rate of convergence: One bit per iteration for large k. Invariants: y(i) = y + ln( x / x(i) ) x(i) = x e( y−y(i) )
3
A Radix-4 Logarithm
Scale so digit selection is the same each iteration.
Read from a table
Radix-4 Logarithm Digit Selection
Initialization : u = 4(δx − 1) Range of Convergence : ä=2→ 1 2≤ x≤5 8 ä =1→ 5 8 ≤ x ≤1
y = − ln δ
4
A Clever Base-2 Log Method y = log 2 x < 1 x = 2 y = 2 0. y−1 y−2 L Step 1 : Square x : x = x 2 = 2 2 y = 2 y−1 . y−2 L Step 2 : If x ≥ 2 then x = 21. y−2 L ≥ 2 → Step 3 :
y−1 = 1
Divide x by 2 : x = x 2 = 20. y−2 L Else x = 20. y−2 L < 2 →
y−1 = 0
Step 4 : Goto step 1 and repeat to get next digit.
Hardware for Clever Log2 Method
5
Generalization to Base-b Logarithms
Exponentiation
Invariants: x(i) = x + ln( y / y(i) )
Read from a table
Same recurrence we used for logarithm with x and y switched.
y(i) = y e( x−x(i) ) Initialization: x(0) = x y(0) = 1 As x(k) goes to 0, y(k) goes to ex Rate of convergence: One bit per iteration for large k.
6
Elimination of k/2 iterations • After k/2 iterations, ln(1 ± 2−k) ≈ ± 2−k • When x(j) = 0.00 …. 00xx …. xx , ln(1+ x(j) ) ≈ x(j) allowing us to perform the final computation step:
• Last step combines k/2 iterations, but contains a “true” multiplication
Radix-4 ex
Read from a table
Scale so digit selection is the same each iteration.
7
Radix-4 Exponentiation Digit Selection
Initialization : u = 4(δx − 1) Range of Convergence :
y = eδ
ä = −1 2 → x ≤ −1 4 ä=0→ δ =1 2 →
−1 4 ≤ x ≤ 1 4 x ≥1 4
General Exponentiation
1. Compute Ln x 2. Multiply
y Ln x
3. Compute Exp(y ln x)
8
General Approach to Division Additive Normalization Method
Scale so digit selection is the same each iteration.
Invariants: Unscaled: s(i) = z − q(i) × d Scaled: s(i) = z r i − q(i) × d
(
γ ( i ) ≈ r r i q * − q (i ) estimate
)
General Approach to Square Root Multiplicative Normalization Method
Initialization: x(0) = y(0) = z Invariant: z x(i) = y(i)2 Scale
Convergence: As x(i) goes to 1, y(i) goes to Sqrt(z)
Initialization: u(0) = z −1 y(0) = z Invariant: z (2−i u(i)+1) = y(i)2 Convergence: As 2−i u(i) goes to 0, y(i) goes to Sqrt(z)
9
General Approach to Square Root Additive Normalization Method
Initialization: x(0) = z y(0) = 0
Scale
Invariant: x(i) = z − y(i)2 Convergence: As x(i) goes to 0, y(i) goes to Sqrt(z)
Initialization: u(0) = z y(0) = 0 Invariant: 2−i u(i) = z − y(i)2 Convergence: As 2−i u(i) goes to 0, y(i) goes to Sqrt(z)
General Computation Steps • Preprocessing steps: – Use identities to bring operand within appropriate ranges – Initialize recurrences • sometimes using approximation functions
• Processing steps: – Iterations
• Postprocessing steps – Compute final result – Normalization
10
Approximating Functions Taylor Series
Maclaurin Series (a = 0 )
Error Bound
Error Bound
Horner’s Method
Coefficients c(i) can be stored in a table, or computed on the fly from c(i-1). Ex: For Sin(x)
11
Divide and Conquer Evaluation • Divide input x into integer and high order fraction bits xH and low order fractional bits xL
• Write a Taylor series expansion about x = xH
• Approximate with first two terms
Table lookup
12
Rational Approximation
Merged Arithmetic • When very high performance is required, you can build hardware to evaluate nonelementary functions. – Higher speeds – Lower component count – Lower power
13
Example
14
Lecture 24 Arithmetic by Table Lookup
Uses of Lookup Tables • Digit Selection in high-radix arithmetic • Initial result approximation to speed-up iterative methods • Store CORDIC constants • Store polynomial coefficients • Can be mixed with logic in hybrid schemes
1
Advantages of Table Lookup • • • • • • •
Memory is much denser than logic Easy to layout in VLSI Large memories are cheap and practical Design and testing is easy Flexibility for last minute changes Any function can be implemented Can be error-encoded to make more robust
Direct Table Lookup • Size – input operands = u bits total – output results = v bits total – table size = 2u × v bits
• Flexible • Size is not practical in most cases – Exponential growth gets you
• Best for unary (single operand) functions
2
Indirect Table Lookup • Hybrid scheme that uses preprocessing and postprocessing blocks to reduce the size of the table.
Binary to Unary Reduction • Idea: Evaluate a binary function using an auxiliary unary function • Indirect method – Preprocess: Convert binary operands to unary operand(s). – Table lookup – Postprocess: Convert unary results to binary result
3
Example 1: Log(x ± y) Lz = log z = log( x ± y ) = log( x(1 ± y x) ) = log x + log(1 ± y x)
= log x + log(1 ± log −1 (log y − log x) )
= Lx + log(1 ± log −1 ( Ly − Lx) ) = Lx + ϕ( Ly − Lx) Do ϕ using a lookup table.
Example 2: Multiplication
[
]
1 (x + y )2 − (x − y )2 4 x, y are k - bit values → xy requires 2 (k + 1) - bit lookups xy =
Observation : The least significant bits of ( x + y ) and ( x − y ) are either both even or both odd.
[
]
1 (x + y )2 − (x − y )2 = x + y − x − y + εy 4 2 2 where ε = 0 if x + y is even, ε = 1 if x + y is odd. xy requires 2 k - bit lookups and a three operand addition 2
2
4
Further Reductions for Squaring • Last two bits of square x2 are always ….0x0 and thus do not need to be in the table. • Split table approach – Operand is split and two smaller tables are used – Size of two split tables is less than single table – Results of the two tables are combined to form the result.
Tables in Bit-Serial Arithmetic Example: Autoregressive Filter
5
Define the Lookup Table
LSB-First Bit Serial Filter
6
Interpolating Memory • Instead of programming a table f (x) for all possible values of an operand x …. • only program table for a small set of values of x at regular intervals (intervals of 2i are best) • To evaluate the function f (x) : – read table at end-points of interval that contains x, and interpolate between them to approximate f (x)
Interpolation xlo ≤ x ≤ xhi If f ( xlo ) and f ( xhi ) are known, Then f ( x) may be approximated : f ( xhi ) − f ( xlo ) f ( x) = f ( xlo ) + ( x − xlo ) ⋅ xhi − xlo f ( xhi ) − f ( xlo ) = f ( xlo ) + ∆x ⋅ xhi − xlo = a + b ⋅ ∆x
7
Example: Log2 1≤ x ≤ 2 f ( xlo ) = log 2 1 = 0 f ( xhi ) = log 2 2 = 1 log 2 x ≈ x − 1 = the fractional part of x Maximum absolute error : 0.086071 Maximum relative error : 0.061476
Improved Linear Approximation • Choose a line that minimizes worst case absolute or relative error. • May not be exact at endpoints
8
Improved Linear Approximation log 2 x ≈ 0.043036 + ∆x ∆x = 1 − x Maximum absolute error : 0.043036 Half the error from previous approximation But not good enough ...
More Improvement • Two choices: – Go to two or more linear intervals – Use one interval, but with 2nd degree polynomial interpolation: f (xk + ∆x) ≈ ak + bk ∆x + ck ∆x2 Program xk → ak, bk, ck in lookup table.
• Both result in larger table, but which is best?
9
Log2 with 4 Linear Intervals
Trade-Offs in Cost, Speed, and Accuracy • Interpolation using – h bits of x – degree m polynomial – requires table of (m+1)2h entries
• As m increases, complexity increases, speed goes down. • It is seldom cost effective to go beyond second-degree interpolation.
10
Maximum Absolute Error in Computing Log2
Piecewise Lookup Tables • Evaluation base on table lookup using fragments of the operands. • Indirect table lookup method.
11
Example 1: IEEE Floating Point
x:
t 8 bits
u 6 bits
v 6 bits
w 6 bits
IEEE 26-bit (2.24) significand
Taylor Series
12
Method of Evaluation
Example 2: Modular Reduction
mod z p { b − bit 14243 d − bit
13
Divide and Conquer Modular Reduction
Alternate Method: Successive Refinement
14