Computer arithmetic lectures

Lecture 1 Numbers and Arithmetic Front Page Computer Arithmetic • March 1994 – Thomas Nicely, mathematician at Lynchbur...

43 downloads 1541 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture 1 Numbers and Arithmetic

Front Page Computer Arithmetic • March 1994 – Thomas Nicely, mathematician at Lynchburg College, Virginia – Pentium processor did not match other processors in computing 1/p + 1/(p+2)

• October 1994 – Convinced that Pentium was at fault. – Exchanged results with other researchers – Posted results on Internet

1

The Diagnosis • Tim Coe, engineer at Vitesse Semiconductor – build a model of Pentium’s floating-point division hardware based on radix-4 SRT algorithm – diagnosed problem: – 4,195,835 / 3,145,727 = 1.333 820 44 – but, on Pentium, it was = 1.333 739 06 – (accurate only to 14 bits)

Intel’s Response • Dismissed severity of problem • Admitted to a “subtle flaw” • Claimed probability of 1 in 9 billion (once every 27,000 yrs.) for average user • Published a white paper describing problem. • Announced replacement policy: – replacement of defective part based on customer need, customers had to show their applications required correct arithmetic.

2

Customer Response • Heavy criticism from customers – Lots of bad press – On-line criticism

• Intel revised its policy: no-questions-asked replacement policy • First instance of arithmetic becoming frontpage news

Moral • Glaring software faults have become routine, (ref. Microsoft) but … • Hardware bugs are rare, untolerated, and newsworthy • Computer arithmetic is important

3

What is computer arithmetic? • Major field in computer architecture • Implementation of arithmetic functions – Arithmetic algorithms for software or firmware. – Hardware algorithms – High-speed circuits for computation

Applications of Computer Arithmetic • Design of top-of-the-line CPU’s • High-performance arithmetic circuits. • Designs for embedded application-specific circuits. • Arithmetic algorithms for software. • Understand what went wrong in the Pentium ...

4

Numbers and their Encodings • Number representations have advanced in parallel with the evolution of language – Use of sticks and stones – Grouping of sticks into groups of 5 or 10. – Symbolic forms

Roman Numeral System • 1, 5, 10, 50, 100, 1000 = I, V, X, L, D, C, M • Problems – not suitable for representing large numbers – difficult to do arithmetic with

5

Positional Number Systems • First used by Chinese • Value of a symbol depends on where it is. • Ex: 222 = 200 + 20 + 2 – Each symbol “2” has a different value

Fixed-Radix System • Each position is worth a constant multiple of the position to the right: ∆∆∆ L than than L ∆ Rtimes larger  → ∆ Rtimes larger  → ∆ →

• binary = Positional, Fixed Radix R=2 • decimal = Positional, Fixed Radix R=10

6

Mixed-Radix System • Radix vector gives weights • Ex: time intervals: days : hours : minutes : seconds × × × days 24 → hours 60 → minutes 60 → seconds

R = [0 24 60 60]

Digital Systems • Numbers are encoded using 0’s and 1’s • Suppose system has 4 bits ⇒ 16 codes • You are free to assign the 16 codes to numbers as you please. Examples: – – – –

Binary ⇒ [0,15] Signed-magnitude ⇒ [-7,7] , 0 encoded twice 2’s complement ⇒ [-8,7] 3.1 fixed point ⇒ [0, 7.5]

7

Fixed-Radix Position Number Systems k −1 ( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l ) = ∑ xi r i i =−l

(Error on page 8)

r is the radix x is a digit

{0,1,L, r − 1} is the implicit digit set k .l digits : k digits for the whole part l digits for the fractional part • is the radix - point

Example:

Balanced Ternary System r =3

{

digit set = 1, 0, 1

}

1 = −1

L , 1 1, 1, 0, 1, 11, 10, 11, 111, 11 0,L

8

Example:

Redundant Signed-Digit Radix-4 r=4

{

digit set = 2, 1, 0, 1, 2

}

5decimal = 11 6 decimal = 12 = 22

(redundant)

Other Fancy Radices • • • •

Negative radix systems Fractional radix systems Irrational radix systems Complex radix systems

– see examples in the book

9

How many digits are needed? To represent the natural numbers in [0, max] in radix r with digit set = [ 0, r-1 ] requires k digits : xk −1 xk −2 L x0 max = r k − 1 k = log r max  + 1 = log r (max + 1)

Fixed-point Numbers Radix r and digit set [ 0, r-1 ] k whole and l fractional digits r −l = ulp unit in the least (significant) position max = r k − r −l Example : Binary max = 1111.11 = 2 4 − 2 − 2 = 15.75

10

Number Radix Conversion Assume that the unsigned value u has exact representations in both radices r and R : u = w.v

(a fixed - point number)

= ( xk −1 xk − 2 L x0 . x−1 x− 2 L x−l )

r = ( X K −1 X K − 2 L X 0 . X −1 X − 2 L X − L )

R

Conversion Problem Given r , the old radix R , the new radix xi ' s , digits in radix - r that represent u find the X i ' s , digits in radix - R that represent u

11

Method #1 • Use when radix r arithmetic is easier than radix R arithmetic • Convert whole part using repeated division by (in radix r) by R. The remainders are the digits X0 , X1 , . . . • Convert fractional part using repeated multiplication (in radix r) by R. The whole parts at each step are digits X-1 , X-2 , . . .

Method #2 • Use when radix R arithmetic is easier that radix r arithmetic • Convert the whole part by using Horner’s method: uk = r uk+1 + xk • Convert the fractional part by converting r l v using above method, and then dividing by r l.

12

Shortcuts • When the old and new radices are integral powers of a common base b, that is, r = b g, and R = b G, then conversion can be done with no computation, using a table lookup. • Example: hex to octal

13

Lecture 2 Representing Signed Numbers

Lecture 2 • In lecture 1, we talked about natural numbers [0…max] – often referred to as unsigned numbers

• In this lecture, we will talk about signed numbers – include both positive and negative values

1

Signed-Magnitude Representation • One bit (MSB) is devoted to sign. – By convention, 1 = negative, 0 = positive

• k-1 bits are available to represent magitude • Range of k-bit signed-magnitude number is:

[ − (2

k −1

− 1) , 2 k −1 − 1

]

Signed-magnitude Representation

2

Signed-Magnitude Representation • Advantages – intuitive appeal & conceptual simplicity – symmetric range – simple negation

• Disadvantages – fewer numbers encoded (two encodings for 0) – subtraction is more complicated than in 2’s-comp

Circuit for Sign-Magnitude Arithmetic

3

Biased Representations • Signed numbers are converted into unsigned numbers by adding a constant bias signed number unsigned number • 64biased 44 74448 6 474 8 [− bias , max − bias ] + bias = [0 , max] • Such an encoding is sometimes referred to as an excess-bias coding. • Excess-127 and excess-1023 codes used for exponents in IEEE floating point.

Biased Representations Example: excess-8 coding

4

Biased Representations • Do not lend themselves to simple arithmetic xbias = x + bias

ybias = y + bias

( x + y )bias = xbias + ybias − bias ( x − y )bias = xbias + ybias + bias

• Multiplication and division performed directly on biased numbers is very difficult.

Converting biased numbers xbias = x + bias ⇒ x = xbias − bias xbias bias 67 8 67 8 k −1

k −1

= ∑ xi 2 − ∑ bi 2i i

i =0

=

i =0

∑ x 2 + ∑ ( x − 1)2 i

i:bi = 0

i

i:bi =1

i

if bi = 0 i  xi = ∑ 2 i = 0 ( xi − 1) if bi = 1  k −1

+ and - digits

i

+ value 0 0 1 1

− value 0 −1 1 0

5

Converting biased numbers Example + value

− value

0 0 1 1

0 1

(10010010) baised

−1 0 bias = (11000011) base − 2

−− +++ +−−

1 0 0 1 0 0 1 0 = 0 ⋅ 2 7 − 1 ⋅ 2 6 + 0 ⋅ 25 + 1 ⋅ 2 4 + L − 1 ⋅ 2 0 = −49 check : 146 − 195 = −49

Complement Representations • Like biased representation, except: – only negative numbers are biased – bias M is large enough to bias negative numbers above the positive number range

x < 0 : x + M xcomp − M =  x x ≥ 0 : • To represent numbers in range [-N,+P], M ≥ N + P + 1 ( = for max coding efficiency)

6

Complement Representation

Complement Representations • Subtraction is performed by – complementing the subtrahend – performing addition modulo-M

• Addition and subtraction are essentially the same operation. • This is the primary advantage of complement representations.

7

Complement Arithmetic • Two auxiliary operations are required to do complement arithmetic: – complementation (change of sign)

(− x ) M −comp = M − x (− − x) M −comp = M − ( M − x ) = x – computation of residues mod M

( x + M ) mod M = x

Addition of Complement Signed Numbers

8

Radix-complement • For a k-digit, radix r number • M = rk • Auxiliary functions become: – modulo-M reduction: ignore carry-out from digit position k-1 – complement of x (M-x): replace each non-zero digit xi with r-1- xi and add ulp (particularly simple if r is a power of 2).

Digit-complement (or Diminished-radix-complement) • For a k-digit, radix r number (possibly with fractional digits as well) • M = r k - ulp (= r k - r -l for a k.l-digit number) • Auxiliary functions become: – modulo-M reduction: add carry-out from digit position k-1 to result – complement of x (M-x): replace each non-zero digit xi with r-1- xi (particularly simple if r is a power of 2).

9

Two’s-complement • Radix-complement with radix = 2 • Complementation constant M = 2 k for a k-digit binary number

[

• Range = − 2 k −1 , 2 k −1 − ulp

]

• The name “2’s” complement comes from case when k = 1, and M = 2.

2’s-complement Representation

10

Finding the two’s-complement (negation) of a number

(

)

2 k − x = ( 2 k − ulp ) − x + ulp = (11L1.1L11 − x) + ulp = x + ulp Negate: complement each bit and add one ulp. Because of the slightly asymmetric range, negation may lead to overflow!

Two’s-complement Addition and Subtraction • To add numbers modulo 2 k, simply drop the carry-out from bit position k-1. This carry is worth 2 k. • To subtract, complement subtrahend, then add, and drop the carry-out from bit position k-1.

11

Add/Sub Circuit for 2’s-complement Arithmetic

This method of negation never overflows.

Two’s-complement Sign Extension • To extend a k.l-digit number to k’.l’-digits, the complementation constant increases from M = 2 k to M’ = 2 k’ • The difference of the two constants: M’ - M = 2 k’ - 2 k = 2 k (2 k’-k - 1) must be added to the representation of any negative number. This is equal to (k’-k) 1’s followed by k 0’s.

12

One’s-complement • Digit-complement with radix = 2 • Complementation constant M = 2 k - ulp for a k-digit binary number • Range =

[ − (2

k −1

− ulp) , 2 k −1 − ulp

]

1’s-complement Representation

13

Finding the one’s-complement (negation) of a number (2 k − ulp ) − x = 11L1.1L11 − x =x Negate: complement each bit. Because of the symmetric range, negation cannot overflow.

One’s-complement Addition and Subtraction • To add numbers modulo 2 k-ulp, simply drop the carry-out from bit position k-1 and simultaneously insert a carry into position -l. The net effect is to reduce the result by 2 k-ulp. This is known as end-around carry.

14

Oops • End-around carry does not reduce (mod 2 k-ulp) a result that equals 2 k-ulp – no carry is generated

• However, 2 k-ulp = all 1’s = (-0)1’s-complement • If it were reduced (mod 2 k-ulp), it would reduced to 0. • But -0 = 0, so does it matter??

One’s-complement Sign Extension • To extend a k.l-digit number to k’.l’-digits, the complementation constant increases from M = 2 k-ulp to M’ = 2 k’-ulp’ • This leads to the rule that a one’scomplement number must be sign-extended on both ends.

15

Comparing radix- and digitcomplement Systems

Indirect Signed Arithmetic • If you only have hardware that does unsigned arithmetic, and you want to do signed arithmetic, you can by: – converting signed operands to unsigned operands – a result is obtained based on unsigned operands – the result is converted back to the signed representation

• This is called indirect arithmetic

16

Direct vs. Indirect Operations on Signed Numbers

Using Signed Positions or Signed Digits • The value of a two’s-complement number can be found using the standard binary-todecimal conversion, except the weight of the MSB (sign bit) is taken to be negative ( -2 k-1 )

17

Another way to Interpret 2’s-complement Numbers

(10100110)two 's −comp = (10100110) radix − 2

Interpretation of Two’s-complement Numbers k −1

( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l )base − 2 = ∑ xi r i i=−l

( xk −1 xk − 2 L x0 • x−1 x− 2 L x−l ) 2 's −comp = − xk −1r

k −1

k −2

+ ∑ xi r i i =−l

18

Generalization • Assign negative weights to an arbitrary subset of the k+l digit positions in a radixr number, and positive weights to the other positions. • Negative weight digit is in set {-1, 0} • Positive weight digit is in set {0, 1}

More Generalization • Any set [-α, β] of – r or more consecutive integers ( α+β+1 ≥ r ) – that include 0

can be used for the digit set for radix r. • If α+β+1 > r , then the number system is redundant, and ρ = α+β+1 - r is the redundancy index.

19

Converting to another Digit Set {0,1,2,3}radix − 4 → {1,0,1,2}radix − 4 0 00 1 01 2 02 3 11

Converting to another Digit Set {0,1,2,3}radix −4 → {2,1,0,1,2}radix − 4 0 00 1 01 2 02 12 3 11 Transfers do not propagate, and thus this conversion is always carry-free.

20

Lecture 3 Redundant Number Systems

Addition • Addition is the basic building block of all arithmetic operations • If addition is slow, then all other operations suffer in speed or cost. • Carry propagation is either – slow ( O(n) for carry ripple ), or – expensive ( carry lookahead, etc. )

1

Coping with the Carry Problem • Limit carry propagation to within a small number of bits. • Detect the end of propagation rather that wait for worst-case time • Speed up propagation using lookahead or other techniques • Ideal: Eliminate carry propagation altogether !

Eliminating Carry Propagation • Can numbers be represented in such a way that addition does not require carry propagation ?? • Decimal [0,18]

But, a second addition could cause problems . . .

2

Limiting Carry Propagation • Consider adding 2 numbers with digit set [0,18] • Interval arithmetic: [0,18] + [0,18] = [0,36] = 10[0,2] + [0,16] • Carry: [0,16] + [0,2] = [0,18] – No additional carry’s are generated. – Carry propagates only one position. – This is referred to as carry-free addition

Carry-Free Addition

How much redundancy in the digit set is needed to enable carry-free addition ?

3

[0,18] is more than you need: Carry-Free using [0,11]

The Key to Carry-Free Addition • Redundant representations provide multiple encodings for some numbers • Interim sums + Transfer carrys fit into digit set and do not propagate carrys. • Single stage propagation is eliminated with a simple lookahead scheme: – sum i = f ( x i , y i , x i-1 , y i-1 ) – no carry

4

Carry-Free Addition

Redundancy in Computer Arithmetic • Redundancy is used extensively for speeding up arithmetic operations. • First example introduced in 1959: Carry-save addition

5

Carry-Ripple Addition A3 A

B3

A2 B2

A1 B1

B

A

A

B

B

A

B

Co Ci

Co Ci

Co Ci

S

S

S

S

B

A

C2

B

A

C1

B

A

C0

B

A

B

Co Ci

Co Ci

Co Ci

Co Ci

Co Ci

S

S

S

S

S

D3 A

B

B0

Co Ci

C3 A

A0

A

B

A

D2

B

A

D1

B

A

D0

B

A

B

Co Ci

Co Ci

Co Ci

Co Ci

Co Ci

Co Ci

S

S

S

S

S

S

S5

S4

S3

S2

S1

S0

Carry-Save Addition A3 A

B3 C3 A2 B2 C2 A1 B1 C1 A0

B0 C0

B

B

A

B

S5

A

Co Ci

Co Ci

S

S

S

S

A

B

B

Co Ci

D3

A

A

Co Ci

B

D2 A

B

D1 A

B

D0 A

B

Co Ci

Co Ci

Co Ci

Co Ci

S

S

S

S

A

B

A

B

A

B

Co Ci

Co Ci

Co Ci

Co Ci

S

S

S

S

S4

S3

S2

S1

S0

6

Carry-Save Numbers Digit : Representation 0 : (0,0) 1 : (0,1) or (1,0) 2 : (1,1)

Carry-save Addition

Digit Set Conversion

[0,2] + [0,1] + [0,1] = 2*[0,1] + [0,2]

7

Digit Sets and Digit-set Conversion Radix-r Digit set [-λ, µ] , λ+µ+1≥r Radix-r Digit set [-α, β], α+β+1≥r Essentially a digit-serial process, like carry propagation, that begins at the right and ripples left.

[0, 18] → [0, 9]

8

[0, 2] → [0, 1]

[0, 18] → [-6, 5]

9

[0, 2] → [-1, 1]

Generalized Signed-digit Numbers • A digit set need not be the standard [0, r-1] • radix-2 [-1, 1] was first proposed in 1897. • Avizienis (1961) defined the class of signed-digit number systems with symmetric digit sets [-α, α] in radix r > 2.

r 2 + 1 ≤ α ≤ r − 1

10

Catagorizing Digit Sets • GSD [-α, β] Generalized Signed-digit – symmetric : α=β , asymmetric : α≠β – minimal : ρ = α+β+1-r = 1 (minimal redundancy)

• OSD [-α, α] Ordinary Signed-digit (Avizienis) • BSC [0, 2] , radix 2 , Binary Stored-carry • BSD [-1, 1] , radix 2, Binary Signed Digit

A Taxonomy of Positional Number Systems

11

Encodings for [-1, 1] • To represent a number in binary, α+β+1 digits must be encoded in binary.

– There are many other encodings – Encoding efficiency is total number of different numbers represented / 2bits

Hybrid Signed-digit • Redundancy in select positions only. BSD = [-1,1] B = [0,1]

Worst case propagation is 3 stages

12

GSD Carry-free Addition Algorithm • Compute the position sums: p i = x i + y i • Separate each p i into a transfer t i+1 and interim sum w i such that p i = r t i+1 + w i • Add the incoming transfers to obtain the sum digits si = wi+ ti with no new transfer.

wi

Conditions for Carry-free Addition • t i is from digit set [-λ, µ] • s i is from digit set [-α, β] • To ensure s i = w i + t i with no new transfer, -α + λ ≤ w i ≤ β - µ where the interim sum w i = p i - r t i+1 • This can be shown true if λ ≥ α /(r-1) and µ ≥ β /(r-1)

13

Selection of Transfer Digit • The value pi is compared to a set of selection constants: Cj : - λ ≤ j ≤ µ+1 • if Cj ≤ pi < Cj+1 then ti+1 = j

Selecting the Transfer Value pi = [-10, 18] λ ≥ 5/9 µ ≥ 1 ti = [- λ, µ] = [-1, 1] wi = [-α + λ, β - µ] si = [-α, β]

β=9 β-µ=8

ti+1 = -1

ti+1 = 0

ti+1 = 1

wi C0

0 -10

C1 0

pi 18

-α+λ=-4 -α=-5 w i = p i - 10 t i+1

14

How are Selection Constants chosen ?

Adding with radix-10, [-5, 9]

15

How much redundancy is needed? • Carry-free addition is possible iff one of the following sets of conditions is satisfied: – –

r>2, ρ≥3 r>2, ρ=2, α≠1, β≠1

• Does not work for – – –

r=2 ρ=1 ρ = 2 , α = 1 or β = 1

Limited-carry algorithm for GSD numbers • Use when carry-free algorithms do not exist

16

Implementations of Limited-carry Addition

ei pi -2 -1 0 1 2

ti+1 = [-1,1] low high [-1,0] [0,1] (-1,0) (-1,1) (0,-1) (0,0) (0,1) (1,-1) (1,0) (ti+1 , wi)

17

ei pi 0 1 2 3 4 5 6

ti+1 = [0,3] low high [0,2] [1,3] (0, 0) (0, 1) (1,-1) (1, 0) (1, 1) (2,-1) (2, 0) (2, 1) (3,-1) - (3, 0) (ti+1 , wi)

18

Conversions • Outside world is binary or decimal • To do GSD arithmetic internally, conversions are required. • Conversion is done at input and output.

19

Example: Conversion of BSD to Signed Binary

Support Functions • Zero detection – used for equality testing

• Sign test – used for relational comparison (< , ≤ , ≥ , >)

• Overflow handling

20

Zero Test • Zero may have multiple representations • If α
Sign Test • The sign of a GSD number generally depends on all of its digits. • In general, sign test is: – slow if done by carry propagation, or – expensive if done by fast lookahead

• If α
21

Overflow • Detection is difficult in GSD arithmetic. Even when tk ≠0, the result still might be representable in k digits.

True overflow detection is possible, but slow.

Difficulties with GSD • Difficulties with sign test and overflow detection can nullify some, or all of the speed advantages of GSD number representations. • Applications of GSD are presently limited to special-purpose systems, or to internal number representations.

22

Lecture 4 Residue Number Systems

RNS Representation and Arithmetic • Given: äx

mod 7 = 2 ä x mod 5 = 3 ä x mod 3 = 2

• What is x ? • ( 2 | 3 | 2 ) RNS(7 | 5 | 3 ) = ?

1

Residue Number Systems (RNS) • X = (xk-1 |… | x1 | x0 ) • Positional number system, with different weights for each position. • Position weights are mutually prime moduli mk-1 , … , m1 , m0 • mk-1 > … > m1 > m0 • xi = X mod mi = X mi xi = [0, mi-1 ] • Dynamic range: M = mk-1 × … × m1 × m0 ä The

number of distinct values that can be represented.

Default RNS • RNS ( 8 | 7 | 5 | 3 ) • M = 8 × 7 × 5 × 3 = 840 ä 840

= the total number of distinct values that can be represented with RNS ( 8 | 7 | 5 | 3 ) ä [0, 839] unsigned , or ä [-420, 419] signed , or ä any interval of 840 consecutive integers

2

Some examples ä(

0 | 0 | 0 | 0 ) = 0 or 840 or … ä ( 1 | 1 | 1 | 1 ) = 1 or 841 or … ä ( 2 | 2 | 2 | 2 ) = 2 or 842 or … ä ( 0 | 1 | 3 | 2 ) = 8 or 848 or … ä ( 5 | 0 | 1 | 0 ) = 21 or 861 or … ä ( 0 | 1 | 4 | 1 ) = 64 or 904 or … ä ( 2 | 0 | 0 | 2 ) = -70 or 770 or … ä ( 7 | 6 | 4 | 2 ) = -1 or 839 or …

Negative RNS Numbers M m =0 • − x mi = M − x mi i • Given the RNS representation of x , the representation of -x can be found by complementing each of the digits xi with respect to moduli mi (0 digits will remain unchanged). ä ä

21 = ( 5 | 0 | 1 | 0 ) RNS ( 8 | 7 | 5 | 3 ) -21 = ( 8-5 | 0 | 5-1 | 0 ) = ( 3 | 0 | 4 | 0 )

3

Converting RNS to Decimal • Any RNS can be viewed as a weighted positional representation. • For RNS(8|7|5|3) the weights associated with the four positions are: ä 105

120 336 280

• Example ä

(1|2|4|0)RNS = ( 105 × 1 + 120 × 2 + 336 × 4 + 280 × 0 ) mod 840 = 1689 mod 840 = 9

Representational Efficiency • Each digit must be encoded in binary • RNS ( 8 | 7 | 5 | 3 ) requires 3 + 3 + 3 + 2 = 11 bits • Representational efficiency = 840/2048 = 41%

4

RNS Arithmetic • Negation, Addition, Subtraction, and Multiplication can be performed independently by operating on each digit individually.

Advantages of RNS Arithmetic • No carry problem • Digits are small ä Digit

operations can easily be done with lookup tables. ä With 6 bit residue digits each operation requires a 4K × 6 table.

• Fast and simple

5

Disadvantages of RNS Arithmetic • Division, Sign Test, Magnitude Comparision, and Overflow Detection are difficult and complex. • These difficulties have thus far limited the application of RNS to certain signal processing problems where ä addition

and multiplication are the predominant operations ä results are within know ranges

A distant light ? • Developments in recent years by Hung and Parhami (1994) have greatly reduced the cost for division and sign detection. • May lead to more widespread application of RNS in the future.

6

Choosing the RNS Moduli • The set of moduli chosen affects ä representational

efficiency ä complexity of arithmetic algorithms

• The magnitude of the largest modulus dictates the speed of arithmetic operations ä make

all the moduli comparable in size to the largest one ä doesn’t change the speed of arithmetic

• Moduli must be mutually prime

Example: [ 0, 100,00010] • Normally requires 17 bits to represent • Choose mutually prime moduli until dynamic range M > 100000. ä RNS(13|11|7|5|3|2)

M=30030

• too small ä RNS(17|13|11|7|5|3|2)

M=510510

• 5.1 times too big, in fact • So remove the 5 ä RNS(17|13|11|7|3|2)

M=102102

7

Example, continued ä RNS(17|13|11|7|3|2)

M=102102

• Bits = 5+4+4+3+2+1 = 19 • Speed dictated by 5 bits • Combine moduli 2 &13 and 3 &7 with no speed penalty ä RNS(26|21|17|11)

• Still needs 5+5+5+4 = 19 bits • but two fewer modules

Another Approach • Better results can be obtained if we proceed like before, but include powers of smaller primes. ä RNS(22|3)

M=12 ä RNS(32|23|7|5) M=2520 ä RNS(11|32|23|7|5) M=27720 ä RNS(13|11|32|23|7|5) M=360360 • 3.6 times too large, replace 9 with 3, combine 3 & 5 ä RNS(15|13|11|

23|7) M=120120

• 4+4+4+3+3 = 18 bits fewer bits than before • faster because largest residue is 4 instead of 5

8

Low-cost Moduli • 2k moduli simplify the required arithmetic operations (particularly the mod operation) ä modulus

mod-16 is easier than mod-13

• 2k-1 moduli are also easy ä k-bit

adder with end-around carry

• 2a-1 and 2b-1 are relatively prime iff a and b are relatively prime • k-modulus system: RNS (2 ak −2 | 2 ak −2 − 1 | L | 2 a1 − 1 | 2 a0 − 1) ak − 2 > L > a1 > a0 and are mutually prime

Try it on [0, 100000] • RNS(25 | 25-1 | 24-1 | 23-1 ) Basis: 5,4,3 ä ä

RNS(32|31|15|7) M = 104160 5+5+4+3 = 17 bits efficiency ≈ 100% • provably > 50% efficiency, worst case (no more than 1 extra bit).

ä

largest residue = 5 bits • but power of 2 makes it simple

ä best

choice yet for [0,100000]

9

Choosing the Moduli Summary • In general, restricting moduli to low-cost moduli tends to increase the width of the largest residues. • The optimal choice is dependent on both: ä ä

the application, and the target implementation technology

Encoding and Decoding Numbers • Binary to RNS xi = ( yk −1 L y1 y0 ) two =

yk −1 2 k −1

mi

mi

= yk −1 2 k −1 + L + y1 21 + y0

+ L + y1 21

mi

+ y0

mi

mi

mi

j and store 2 mi for each mi ä residue xi = y m = ( y mod mi ) is computed by i modulo-mi addition of selected stored constants

ä precompute

10

Precomputed Residues of 20, 21, … , 29 for RNS(8|7|5|3)

Why are the residues for (2j)8 not shown ?

Convert 16410 to RNS(8|7|5|3)

11

Conversion from RNS to Mixed-radix Form • Associated with any RNS(mk-1|…|m2|m1|m0) is a mixed-radix number system MRS(mk-1|…|m2|m1|m0) , which is essentially äa

k-digit positional number system with weights: (mk-2…m2m1m0) … (m2m1m0) (m1m0) (m0) (1) ä and digit sets: [0,mk-1-1] … [0,m2-1] [0,m1-1] [0,m0-1] ( digit set are same ranges as RNS digits, but digits themselves are different )

Example • MRS(8|7|5|3) has position weights: 7×5 × 3=105 , 5 × 3=15 , 3 , 1 • (2|3|4|1)MRS = 2 × 105 + 3 × 15 + 4 × 3 + 1 = 210 + 45 + 12 + 1 = 268

12

Conversion from RNS to Mixed-radix Form • The RNS-to-MRS conversion is to find the digits zi of MRS, given the digits xi of RNS: y = ( xk −1 | L | x2 | x1 | x0 ) RNS = ( zk −1 | L | z 2 | z1 | z0 ) MRS

= z k −1 (mk − 2 L m2 m1m0 ) + L + z2 (m1m0 ) + z1 (m0 ) + z0

• To find each digit: ä x0 = y

m0

= zk −1 (mk − 2 L m2 m1m0 ) + L + z 2 (m1m0 ) + z1 (m0 ) + z0

= z0 from both sides, y′ = y − z0 ä divide both sides by m0 , y ′′ = y ′ m0 ä repeat process on y ′′ to get next digit

m0

= z0

ä subtract x0

RNS Arithmetic • To compute: y′ = ( xk′ −1 | L | x1′ | x0′ ) RNS = y − z0 ä

z0 = ( z0 | L | z0 | z0 ) RNS

x′j = x j − z0

mj

• To compute: y′′ = ( xk′′−1 | L | x1′′ | x0′′) RNS = y′ m0 ä Much

easier than general division ä Called scaling ä For each digit find multiplicative inverse ij of m0 with respect to mj , such that i j × m0 = 1 ä

i = (ik −1 | L | i1 | i0 ) RNS x′j′ = i j × x′j

mj

y′′ = y ′ m0 = i × y′

mj

13

Example: y=(0|6|3|0)RNS

After Conversion • Mixed-radix representation allows ä compare

the magnitudes of two RNS numbers ä detect the sign of a number

• (0|6|3|0)RNS ? (5|3|0|0)RNS --- convert to MRS --(0|3|1|0)MRS ? (0|3|0|0)MRS using ordinary comparison (0|3|1|0)MRS > (0|3|0|0)MRS

14

Conversion from RNS to Binary/Decimal • Method #1 RNS → MRS → Decimal/Binary • Method #2 (direct) RNS → Decimal/Binary using RNS position weights computed using the Chinese remainder theorem (CRT)

Example: (3|2|4|2)RNS • Consider conversion of y=(3|2|4|2)RNS to decimal. Based on RNS properties:

15

Example: (3|2|4|2)RNS • Knowing the values of the following four constants (the RNS position weights) would allow us to convert any number from RNS(8|7|5|3) to decimal using four multiplications and three additions.

• Thus,

How are the weights derived? • w3 = (1|0|0|0)RNS = 105 ? • Since the last three residues are 0’s, w3 is divisible by 3, 5, and 7. Hence it is a multiple of 105. • We must pick the right multiple of 105 such that its residue with respect to 8 is 1. n × 105 8 = 1 for w3 : n = 1

16

Chinese Remainder Theorem

x3 = 3 m3 = 8 M 3 = M m3 = 840 8 = 105 α3 = M 3 M 3 α 3 x3

−1 m3 m3

= 1 since 105 × 1 8 = 1

= 105 × 1× 3 8 = 315

To avoid multiplication in the conversion process, we can store premultiplied constants. Conversion is then performed by only by doing table lookups and modulo-M additions.

17

Difficult Arithmetic Operations • Sign Test • Magnitude Comparison • Overflow Detection ä The

above 3 are essential the same problem ä Two methods • Convert to MRS or binary and do comparison • Do approximate CRT decoding and compare

• General Division ä Discussed

in chapters 14 and 15

Redundant RNS representation • Example: modulus m=13 • Normal digit set [0,12] (4 bits) • Redundant digit set [0,15] (still 4 bits) ä Residues

0,1,2 have redundant representations 13, 14, 15 respectively, since • 0 mod 13 = 13 mod 13 • 1 mod 13 = 14 mod 13 • 2 mod 13 = 15 mod 13

ä Modulo

addition is done by 4-bit adder.

• Carry out causes 3 to be added to result as adjustment

18

Limits of Fast Arithmetic in RNS • Addition of binary numbers in the range [0, M-1] can be done in: ä O(log

log M) time ä O(log M) cost (using carry lookahead, etc)

• Addition of low-cost residue numbers: ä O(log

log M) time ä O(log M) cost (using carry lookahead, etc)

• Asymptotically, RNS offers little advantage over standard binary

19

Lecture 5 Basic Addition and Counting

Half Adders and Full Adders Basic building blocks for arithmetic circuits

Half Adder Inputs : x, y Outputs : s = x ⊕ y c = x⋅ y x

Outputs : s = x ⊕ y ⊕ cin cout = x ⋅ y + ( x + y ) ⋅ cin

y

HA c

Full Adder Inputs : x, y, cin

s

x cout

y

FA s

cin Also called a (3,2) Counter

1

Full Adder x

y x

HA cin

cout

=

cout

HA

y

FA

cin

s

s

Mixed Positive and Negative Binary Full Adders + Digit Set = {0,1} − Digit Set = {-1,0}

+ 0 0

− −1 0

1 1

0

1

Excess-1 encoding

Digit Encodings

− +

+ − x y

x y

+ co FA s

−

ci+

+ co FA s

−

+ + x y ci+

+ co FA

ci −

s

−

2

More Mixed Binary Full Adders − −

+ − x y

x y

− co FA

− co FA

ci+

s +

− +

x y

ci−

s +

s +

− −

+ + x y + co FA

ci −

− co FA

x y

ci+

− co FA

s +

Amazingly all done using the same Full Adder

ci −

s

−

Mixed Binary Additions − − −

x y

++ x y

+ − x y

− − x y

+ + x y

FA

FA −

FA −

FA

FA

s

+

+

s

−

s

+

+

s

+

− +

x y

+

s

+

FA

+ ci

s

−

Propagate the digit sets. You can add any combination of +/− to any other combination of +/−

3

Converting Two’s Comp to +/− Two’s complement number − 2 k −1 2 k −2

L

20

− +

L

+

Mixed +/− digit set number

Half Adder c x s y 18 T

4

Full adder cin s

cout x

36 T y

CMOS Full Adder

5

Ripple Carry Adder

x3 cout

y3

x2

y2

x1

y1

x0

y0

FA

FA

FA

FA

s3

s2

s1

s0

cin

Worst case delay path

Serial Addition

6

Conditions, Flags, and Exceptions • Overflow - The output cannot be represented in the format of the result. • Sign -

1 if the result is negative, 0 if the result is positive.

• Zero -

The result is zero.

Signed Overflow • Overflowtwo’s-comp = sign of result is wrong = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 when ck −1 = 1 = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 14243 14243 0

Ck

when ck −1 = 0 = xk −1 yk −1ss −1 + xk −1 yk −1ss −1 14243 14243 Ck

0

= ck ⋅ ck −1 + ck ⋅ ck −1 = ck ⊕ ck −1

7

Unsigned Overflow • Overflowunsigned = carry out of last stage = ck

Sign Sign signed = 0 when positive, 1 when negative = sk −1 when Overflow = 0 = sk −1 when Overflow = 1 = sk −1 ⊕ Overflow = sk −1 ⊕ ck ⊕ ck −1 Sign unsigned = 0

(always positive!)

8

Zero Zero = sk −1 ⋅ sk − 2 ⋅ L ⋅ s0

(both signed, unsigned)

k - input NOR gate zero

zero

Adder with Flag Detection

What’s wrong with the diagram from the book ?

9

Flag Summary Sign Flag ck ⊕ ck −1 Overflow ck ⊕ ck −1 ⊕ sk −1 Sign sk −1 ⋅ sk − 2 L s0 Zero

Unsigned ck 0 sk −1 ⋅ sk − 2 L s0

Analysis of Carry Propagation Average carry propagation Probability

Average worse case carry propagation Absolute worse case carry propagation

0 1 2

log2k

k

Asynchronous circuits must wait average worst case, synchronous circuits must wait absolute worst case

10

Carry Completion Detection • For asynchronous arithmetic, (not useful for synchronous arithmetic) • Carry Completion Detection gives a done signal when carry chain is done. • Average time ∝ log2 k bi ci • Two rail logic: 00 01 10

Carry not known yet Carry known to be 1 Carry known to be 0

Carry Completion Detection

bi and ci all start at 0’s

11

Speeding Up Addition Making Low Latency Carry Chains • From point of view of carry propagation – computation of sum is not important. – At each position a carry is either • generated : xi + yi ≥ r • propagated : xi + yi = r − 1, or • annihilated : xi + yi < r − 1

ci +1 = xi yi + ( xi ⊕ yi ) ⋅ ci { 1 424 3 gi

pi

ci +1 = g i + pi ⋅ ci si = pi ⊕ ci

“Carry Recurrence”

Propagation of Carry ci +1 = g i + pi ci = g i (1 + ci ) + pi ci = g i + ( g i + pi ) ⋅ ci 1 424 3 ti

= g i + ti ci ti = xi + yi

12

Propagation of Inverse Carry ci +1 = g i + pi ci = g i ⋅ ( pi + ci ) = g i pi + g i ci = ai + (ai + pi ) ⋅ ci = ai + pi ci

Manchester Carry Chains time( i ) ∝ i2

13

Lecture 6 Carry-Lookahead Adders

Unrolling the Carry Recurrence c1 = g 0 + p0 c0 c2 = g1 + p1c1 = g1 + p1 ( g 0 + p0 c0 ) = g1 + p1 g 0 + p1 p0 c0 c3 = g 2 + p2 c2 = g 2 + p2 g1 + p2 p1 g 0 + p2 p1 p0 c0 c4 = g 3 + p3c3 = g 3 + p3 g 2 + p3 p2 g1 + p3 p2 p1 g 0 + p3 p2 p1 p0 c0 M

1

4-bit Full Carry Lookahead

HP Carry Lookahead Circuit

2

Alternatives to Full Carry Lookahead • Full carry lookahead is impractical for wide addition • Tree Networks – less circuitry than full lookahead at the expense of increased latency

• Kinds of Tree Networks – High-radix addition (radix must be power of 2) – Multi-level carry lookahead (technique most used in practice)

4-bit Propagate & Generate g[ i ,i +3] = g i +3 + g i + 2 pi +3 + g i +1 pi + 2 pi +3 + g i pi +1 pi + 2 pi +3 p[i ,i +3] = pi pi +1 pi + 2 pi +3 g i +3 pi + 3

g i + 2 pi + 2

g i +1 pi +1

g i pi

4-bit Lookahead Carry Generator

ci + 4 = g[ i ,i +3] + p[ i ,i +3] ⋅ ci

g[i ,i +3] p[i ,i + 3]

3

4-bit Lookahead Carry Generator

General Propagate & Generate i< j
g[i , j −1] p[i , j −1]

Lookahead Carry Generator

k

j i 14243 14243 [ j , k −1] [ i , j −1] 14 44 4244 44 3 [ i , k −1]

g[ i , k −1] p[ i ,k −1]

4

16-bit Carry Chain with 2-level Carry Lookahead

What is the worst case delay path ?

Worst Case Latency • Producing the g and p for individual bit positions (1 gate delay) • Producing the g and p signals for 4-bit blocks (2 gate delays) • Predicting the carry-in signal c4, c8, c12 for the blocks (2 gate delays) • Predicting the internal carries within each 4-bit block (2 gate delays) • Computing the sum bits (2 gate delays)

5

Worst Case Latency • The delay of a k-bit carry-lookahead adder based on 4-bit lookahead blocks is: Time = 4 log4 k + 1

gate delays

Final cout = ck • Last carry is not used to compute any sums • Needed in many situations – Overflow computation, for example

• Three ways to compute it: – – –

ck = g[ 0,k −1] + c0 p[ 0, k −1] ck = g k −1 + ck −1 pk −1 ck = xk −1 yk −1 + sk −1 ( xk −1 + yk −1 )

6

64-bit Carry Lookahead Adder

Ling Adder [1981] ci = g i −1 + ci −1 pi −1 = g i −1 + ci −1ti −1 = = g i −1 + g i − 2ti −1 + g i −3ti −2ti −1 + g i − 4ti −3ti − 2ti −1 + ci − 4ti −4ti −3ti − 2ti −1 Ling’s idea was to propagate hi = ci + ci-1 instead of ci

hi = g i −1 + g i − 2 + g i −3ti − 2 + g i − 4ti −3ti − 2 + hi − 4ti − 4ti −3ti − 2 The carry chain is somewhat simpler, however, the sum equation is slightly more complex:

si = (ti ⊕ hi +1 ) + hi g i ti −1

7

Parallel Prefix Computations The " parallel prefix problem" is : Given : 1. Inputs : x0 , x1 , x2 ,L, xk −1 , and 2. An associative (but not necessarily commutative) operater : + Compute : x0 x0 + x1 x0 + x1 + x2 M x0 + x1 + x2 + L + xk −1

Carry Computation is a Parallel Prefix Computation Inputs : ( g 0 , p0 ), ( g1 , p1 ), ( g 2 , p2 ),L , ( g k −1 , pk −1 ) Operator : ¢ ( g , p ) = ( g ′, p′) ¢ ( g ′′, p′′) = ( g ′′ + g ′ ⋅ p′′, p′ ⋅ p′′)

Compute : ( g[ 0, 0] , p[ 0, 0] ) = ( g 0 , p0 ) ( g[ 0,1] , p[ 0,1] ) = ( g1 , p1 ) ¢ ( g 0 , p0 )

( g ′′, p′′) ( g ′, p′)

¢ ( g , p)

M ( g[ 0,k −1] , p[ 0, k −1] ) = ( g k −1 , pk −1 ) ¢ L ¢ ( g1 , p1 ) ¢ ( g 0 , p0 )

8

Combining (g, p) of Overlapping Blocks

(g, p) Networks • Any design for a parallel prefix problem can be adapted to a carry computation network. • Pairs of inputs can be combined in any way (re-associated according to the associative property) to compute block (g, p) signals. • (g, p) signals have additional flexibility: overlapping blocks can be combined.

9

Recursive Prefix Sum Network

Divide and Conquer I

10

Brent-Kung Parallel Prefix Network

Divide and Conquer II

Brent-Kung Parallel Prefix

11

Brent-Kung Parallel Prefix Network

Kogge-Stone Parallel Prefix Network

12

Hybrid Brent-Kung / Kogge-Stone

Network Comparisons Network

Max Delay

Cost

Divide & Conquer I

Log2 k

(k/2) Log2 k

High

Brent - Kung

2 Log2 k - 1

2k − 2 − Log2 k

Low

Kogge - Stone

Log2 k

k Log2 k − k + 1

Low

Hybrid B-K / K-S

Log2 k + 1

(k/2) Log2 k

Low

Fan Out

Cost is not a good estimate of Si area for these networks. Regularity and interconnect are large factors.

13

MCC on Am29050 Lookahead for 64-bit, radix-256 addition

Level 1 MCC (not shown on block diagram)

Level 2,3 MCC

14

Lecture 7 Variations in Fast Adders

Simple Carry-Skip Adders

1

Simplifying Assumptions • One skip delay (cin to cout) is equal to one ripple delay (cin to cout) • Total k-bit ripple delay is k × single delay

These assumptions may not be true in real life (CMOS implementation for example).

Worst Case Delay

b = fixed block width (ex : 4) k = number of bits (ex :16) Tdelay = (b − 1) + ({ 0.5) + (k b − 2) + (b − 1) 123 1 424 3 123 block 0

OR -gate

skips

block (k -1)

≈ 2b + k b − 3.5 stages (ex :12.5)

2

What is the optimal block size? dT =0 db 2. solve for b = bopt

1. set

k 2 t = number of blocks = k b bopt =

topt = 2k Topt = 2 2k − 3.5

Can we do better?

Path (1) is one delay longer that Path (2) → block t-2 can be one bit wider than block t-1. Path (1) is one delay longer that Path (3) → block 1 can be one bit wider than block 0.

3

Variable Block-Width Carry-Skip Adders Optimal Block Widths : b b +1 L b +

t t −1 b + −1 L b +1 b 2 2

t t − 1) + (b + − 1) + L + (b + 1) + b = k 2 2 → b = (k t ) − (t 4) + 1 2

b + (b + 1) + L + (b +

Optimal number of blocks 2k t 0.5) + (t − 2) = Tdelay = 2(b − 1) + ({ + − 2.5 123 123 2 t OR gate first + last stage Skip stages dT =0 dt 2. solve for t = topt

1. set

topt = 2 k bopt = 1 2 = 1 (stage 0, t-1 ; goes up to topt 2 = k ) Topt = 2 k − 2.5

4

Comparison Fixed-width Carry-Skip topt

2k 2

bopt

k 2

Topt

2 2k − 3.5

Variable-width Carry-Skip

2 k 1L k

k L1

2 k − 2.5

Conclusion: Variable-width is about 40% faster.

Multilevel Carry-Skip Adders One-level carry-skip adder

Two-level carry-skip adder

3 delays 1 delay

Notice simplifications in diagramming conventions

5

Multilevel Carry-Skip Adders • Allow carry to skip over several level-1 skip blocks at once. • Level-2 propagate is AND of level-1 propagates. • Assumptions: – OR gate is no delay (insignificant delay) – Basic delay = Skip delay = Ripple delay = Propagate Computation = Sum Computation

Simplifying the Circuit It doesn’t save any time to skip short carry-chains (1-2 cells long)

optimized

6

Build the Widest Single-Level Carry-Skip Adder with 8 delays max by input timing Limited by output timing 6Limited 44 4 7444 86 447448 1

2

3

4

4

3

1

Width = 1 + 3 + 4 + 4 + 3 + 2 + 1 = 18 bits

Build the Widest Two-Level Carry-Skip Adder with 8 delays max First, we need a new notation: Tproduce ≤ β

{β,α} Tassimilate ≤ α

144 42444 3 γ γ = min(β − 1, α)

7

8-delay, 2-level, continued 1. Find {β,α} for level two

Initial Timing Constraint, Level 2

8-delay, 2-level, continued 2. Given {β,α} for level two, derive level one

8

Generalization • Chan et al. [1992] relax assumptions to include general worst-case delays: • I(b) Internal carry-propagate delay for the block • G(b) Carry-generate delay for the block • A(b) Carry-assimilate delay for the block

• Used dynamic programming to obtain optimal configuration

Carry-Select Adders

9

Carry-select: Carried one step further

Two-Level Carry-Select Adder

Can be pipelined

Compare to Two-Level G-P Adder

Can not be pipelined as drawn

10

Conditional Sum Adder • The process that led to the two-level carryselect adder can be continued . . . • A logarithmic time conditional-sum adder results if we proceed to the extreme: – single bit adders at the top

• A conditional-sum adder is actually a (log2 k)-level carry-select adder

Cost and Delay of a Conditional-Sum Adder

More exact analysis gives actual cost =

Top-level block for one bit position of a conditional-sum adder C(1) and T(1) are the cost and time delay of this circuit.

11

Conditional-Sum Example

Hybrid Adder Designed • Hybrids are obtained by combining elements of: – – – – –

Ripple-carry adders Carry-lookahead (generate-propagate) adders Carry-skip adders Carry-select adders Conditional-sum adders

• You can obtain adders with – higher performance – greater cost-effectiveness – lower power consumption

12

Example 1 Carry-Select / Carry-Lookahead • One- and Two-level carry select adders are essentially hybrids, since the top level k/2or k/4-bit adders can be of any type. • Often combined with carry-lookahead adders.

Example 2 Carry-Lookahead/Carry-Select

13

Example 3 Multilevel Carry-Lookahead/Carry-Select

      to Carry - Select Adders    Can be  pipelined 

Example 4 Ripple-Carry/Carry-Lookahead

Simple and modular

14

Example 5 Carry Lookahead/Conditional-Sum • Reduces fan-out required to control the muxes at the lower level (a draw-back of wide conditional sum adders). • Use carry conditional-sum addition in smaller blocks, but form inter-block carries using carry-lookahead.

Open Questions • Application requirements may shift the balance in favor of a particular hybrid design. • What combinations are useful for: – low power addition – addition on an FPGA

15

Optimizations in Fast Adders • It is often possible to reduce the delay of adders (including hybrids) by optimizing block widths. • The exact optimal configuration is highly technology dependent. • Designs that minimize or regularize the interconnect may actually be more costeffective that a design with low gate count.

Other Optimizations • Assumption: all inputs are available at time zero. • But, sometimes that is not true: – I/O arrive/depart serially, or – Different arrival times are associated with input digits, or – Different production times are associated with output digits.

• Example: Addition of partial products in a multiplier.

16

Lecture 8 Multioperand Addition

Uses of Multioperand Addition • Multiplication – partial products are formed and must be added

• Inner-product computation (Dot Product, Convolution, FIR filter, IIR filter, etc.) – terms must be added

1

“Dot Notation” • Useful when positioning or alignment of the bits, rather that there values, is important. – Each dot represents a digit in a positional number system. – Dots in the same column have the same positional weight. – Rightmost column is the least significant position.

Serial Multioperand Addition

Operands x(0) , x(1) , … , x(n-1) are shifted in, one per clock cycle. Final sum can be as large as n(2k - 1). Partial sum register must be log2(n2k−n+1) ≈ k+log2n bits wide.

2

Pipelined Serial Addition

Binary Adder Tree

Ripple-carry might deliver better times than carry-lookahead !?

3

Analysis of Ripple-Carry Tree Adder

Whereas, for carry-lookahead adders ….

Can we do better?

where kn is the total number of input bits.

The minimum is achievable with …. (next slide please)

4

Carry-Save Adders Ripple-Carry

Reduce 2 numbers to their sum. Carry-Save

Reduce 3 numbers to two numbers.

More “Dot Notation”

5

Carry-Save Adder Tree A carry save tree can reduce n binary numbers to two numbers have the same sum in O(log n) levels.

Assumes fast logarithmic time adder

Tabular Form Dot Notation Form

Adding seven 6-bit numbers

6

Seven Input Wallace Tree In general, an n-input Wallace tree reduces its k-bit inputs to two outputs.

Analysis of Wallace Trees • The smallest height h(n) of an n-input Wallace tree, satisfies the recurrence: solution: • The number of inputs n(h) that can be reduced to two outputs by an h-level tree, satisfies the recurrence: solution: upper bound: lower bound:

7

Max number of inputs n(h) for an h-level tree

Wallace Tree • Reduce the number of operands at the earliest opportunity. • If there are m dots in a column, apply m 3 full adders to that column. • Tends to minimize overall delay by making the final CPA as short as possible.

8

Dadda Trees • Reduce the number of operands in the tree to the next lower n(h) number in the table using the fewest FA’s and HA’s possible. • Reduces the hardware cost without increasing the number of levels in the tree.

Dadda Tree for 7-input 6-bit addition

9

Taking advantage of the carry-in of the final CPA stage

Parallel Counters • Receives n inputs • Counts the number of 1’s among the n inputs • Outputs a

log 2 (n + 1) bit number

• Reduces n dots in the same bit position to log 2 (n + 1) dots in different positions.

10

Parallel Counters

• • ------• •

• • • ------• •

(2, 2) counter

(3, 2) counter

HA

FA

• • • • • • • ----------• • •

• • • • • • • • • • -----------• • • •

(7, 3) counter

(10, 4) counter

(10, 4) counter

11

Generalized Parallel Adders • Reduces “dot patterns” (not necessarily in the same column) to other dot patterns (not necessarily only one in the each column). • Book speaks less generally, and restricts output to only one dot in each column.

4 Examples • • • • • • • • ------------• • • • (4, 4; 4) counter

• • • • • • • • • • ------------• • • •

• • • • • • • • • • ------------• • • •

(5, 5; 4) counter

(4, 6; 4) counter

4-bit binary full adder, with carry in, is a (2, 2, 2, 3; 5) counter

12

Reducing 5 Numbers with (5, 5 ; 4) Counters

(n; 2) Counters • Difference in notation from other counters. • Reduce n (larger than 3) numbers to two numbers. • Each slice i of an (n; 2) counter: – receives carry bits from one or more positions to the right (i-1, i-2, ….) – produces outputs to positions i and i+1 – produces carries to one or more positions to the left (i+1, i+2, ….)

13

(n; 2) Counters Slice by Slice ψ3

•

ψ2

•

ψ3 ψ3 ψ3 ψ3 ψ2 ψ2 ψ2 ψ2 ψ1 ψ1 ψ1 ψ1

ψ1

•

• • • • ----------• • One slice

  n  

• • • • • • • • • • • • • • • • • • • • • • • • • • • • ----------------------------------------• • • • • • • •

n + ψ1 + ψ2 + ψ3 ≤ 3 + 2ψ1 + 4ψ2 + 8ψ3

  n  

Four slices

Adding Multiple Signed Numbers • By means of sign extension

• By method of negative weighted sign bits

14

Lecture 9 Basic Multiplication Schemes

Note on Notation

Right Shift Method

Left Shift Method

1

Right Shift Algorithm

After k iterations:

Left Shift Algorithm

After k iterations:

2

Right Shift Example

After k iterations:

Left Shift Example

After k iterations:

3

Programmed Multiplication • 6 to 7 instructions executed per loop plus overhead • >200 instructions for a 32-bit multiply • Specialized microcode would be smaller

Basic Hardware Multipliers

Right-Shift Multiplier

Combined multiplier / product register

4

Basic Hardware Multipliers

Left-Shift Multiplier Disadvantages: * Multiplier and Product can not share * Adder is twice as wide as Right-Shift Multiplier * Sign-extension is more difficult in partial product

Multiplication of Signed Numbers • Sign extend terms xja and pj when doing additions. • xk-1a term is subtracted instead of added (weight of xk-1 is negative) • In right-shift adders, sign-extension happens incrementally

5

Signed Multiplication using Booth’s recoding • Use Booth’s recoding to represent the multiplier x in signed-digit format • Booth’s recoding first proposed for speeding up radix-2 multiplication in early digital computers. • Used when shifting alone is faster that addition followed by shifting.

Booth’s Recoding • Booth observed that whenever there was a large number of consecutive ones, the corresponding additions could be replace by a single addition and a subtraction.

• The longer the sequence of ones, the greater the savings

6

Booth’s Recoding A Digit Set Conversion • The effect of this translation is to change a binary number with digit set [0,1] to a binary signed-digit number with digit set [-1,1].

Ignore the extra bit if x is a two’s complement negative number

Multiplication by Constants • In programs, a large percentage of multiplications are by a constant known at compile time. – Like in address computation

• Custom instruction sequences are often faster than calling a general purpose multiply routine.

7

Multiply R1 by 113 = (111001)two

Using Subtraction

8

Using Factoring Multiply R1 by 119

Speeding Up Multipliers • Reduce the number of operands being added. – Leads to high radix multipliers – Several bits of the multiplier are multiplied by the multiplicand in each cycle

• Speed up addition – Multioperand addition – Leads to tree and array multipliers

9

Chapter 10 High-Radix Multipliers

Radix-r Algorithms

r = 2s means shifting s bits per iteration

1

Radix-4 (2-bits at a time) Multiplication

Need multiples 0a, 1a, 2a, 3a. 3a can be precomputed and stored.

Radix-4 (2-bits at a time) Multiplication

3a may also be computed by subtracting a and forcing a carry (4a). 4a may be computed by adding 0 and forcing a carry. An extra cycle may be required at the end.

2

Radix-4 Booth’s Recoding • Converts from digit set [0,3] to [-2,2]

Radix-4 Booth’s Recoding • Again, ignoring the upper bit gives the correct signed interpretation of the number. • Radix-4 conversion entails no carry propagation. • Each digit is obtained independently by examining three bits of the multiplier. • Overlapped 3-bit scanning.

3

Addend Generation for Radix-4

Using Carry-Save Adders

Use CSA to compute the 3a term.

4

Using CSA to Reduce Addition Time

Using CSA to Do Both

5

Using CSA with Radix-4 Booth

Booth Recoding for Parallel Multiplication

6

Higher Radix Multipliers Radix -16

How Far Should You Go? Adventures in Space-Time Mapping

7

Twin-beat Multiplier to Multiply the Speed

8

Lecture 11 Tree and Array Multipliers

Full-Tree Multipliers

1

Full-Tree Multipliers ¬All multiples of the multiplicand are produced in parallel -k-input CSA tree is used to reduce them to two operands ®A CPA is used to reduce those two to the product • No feedback → pipelining is feasible • Different multiplier arrays are distinguished by the designs of the above three elements.

General Structure of a Full-tree Multiplier Binary, High-radix, or Recoded

2

Radix Tradeoffs • The higher the radix …. – The more complex the multiple-forming circuits, and – The less complex the reduction tree

• Where is the optimal cost-effectiveness? – Depends on design – Depends on technology

Tradeoffs in CSA Trees • Wallace Tree Multiplier – combine partial product bits at the earliest opportunity – leads to fastest possible design

• Dadda Tree Multiplier – combine as late as possible, while keeping the critical path length (# levels) of the tree minimal – leads to simpler CSA tree structure, but wider CPA at the end

• Hybrids – some where in between

3

Two Binary 4 × 4 Tree Multipliers

Reduction Tree • Results from chapter 8 apply to the design of partial product reduction trees. – General CSA Trees – Generalized Parallel Counters

4

CSA for 7 × 7 Tree Multiplier

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------

Structure of a CSA Tree • A logarithmic depth reduction trees based on CSAs (e.g. Wallace, Dadda. Etc.) – have an irregular structure – make design and layout difficult

• Connections and signal paths of various lengths – lead to logic hazards and signal skew – implication for both performance and power consumption

5

Alternative Reduction Trees (n;2) Counters (more suitable to VLSI) ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ2 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1 ψ1

Slices of a (7;2) counter can reduce a7×7 multiplication

This is a regular circuit, however, many inputs are zeros.

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • --------------------------------------------------------------------------------------------------------------• • • • • • • • • • • • • • • • • • • • • •

A Balanced (11;2) Counter Balanced: All outputs are produced after the same number of delays. All carries produced at level i enter FAs at level i+1 Can be laid out to occupy a narrow vertical slice. Can be easily expanded to a (18;2) counter

6

Tree Multiplier Based on (4,2) Counters

(4,2) Counters make Binary Reduction Trees

Layout of Binary Reduction Tree

7

Sign Extension

Signed Addition

8

Baugh-Wooley Arrays

Partial Tree Multipliers

9

Basic Array Multiplier Using a One-Sided CSA Tree

Unsigned Array Multiplier

10

Unsigned Array Multiplier

Baugh-Wooley Signed Multiply

11

Signed Multiplier Array

Mixed Positive and Negative Binary + Digit Set = {0,1} − Digit Set = {-1,0}

+

−

0 0

−1 0

1 1

0

1

Excess-1 encoding

Digit Encodings −2

k −1

2

k −2

− +

Two’s complement L number

L

0

2

+

Mixed +/− digit set number

12

Mixed Binary Full Adders − −

+ − x y

x y

− co FA

− co FA

ci+ + + x y

s +

+ co FA

− + x y

ci+

ci+

x y

ci−

s +

s +

+ co FA

− +

+ co FA

− co FA

ci −

s

− ci+

+ + x y

+ co FA

s

−

ci −

s +

x y

+ − x y

s

− −

− co FA

ci −

s

−

−

5 × 5 Signed Multiplier Using Mixed +/− Numbers - + + + + + + + + + + + - + - + + + + + + + + + + - +

- + -

+ + +

+ -

+

- + - + + + + + + - +

+ +

-

- + -

+ +

-

-1 +

+

+

+

+

13

Include AND Gates in Cells

Change the Terms of the Problem

14

Multiplier Without Final CPA

Conditional Carry-Save

15

Pipelined Partial-Tree Multiplier

Pipelined Array Multiplier

16

Lecture 12 Variations in Multiplier

Divide and Conquer You have a b × b multiplier. (Could be a lookup table.) You want a 2b × 2b multiplier.

1

Divide & Divide & Conquer You want a 2b × 2b multiplier. ( 4 b× b multiplies, 3 addends). You want a 3b × 3b multiplier. ( 9 b× b multiplies, 5 addends). You want a 4b × 4b multiplier. (16 b× b multiplies, 7 addends).

An 8 × 8 Multiplier using 4× 4 Multipliers

2

Additive Multiply Modules To synthesize large multipliers from smaller ones requires both multiplier and adder units. If we combine multiplication and addition into one unit, then we can use it to implement large multiplies.

An 8 × 8 Multiplier using 8 (4 × 2)-AMMs

3

A slower, but more regular 8 × 8 Multiplier

Bit-Serial Multipliers • • • • •

Smaller pin count Lower area needed in VLSI Run at high clock rate Can be pipelined Can be run in systolic arrays

4

Semi-Systolic #1: A parallel, X serial (LSB first) a3

a2

a1

a0 x0

x1 x2 x3 0 0 0 0

Semi-systolic Design #1 Requires 4 zeros to be shifted in after the x’s

5

Modified Design #1 Allows a new problem to be started 4 cycles earlier.

One Way to Look at Retiming

6

Semi-Systolic #2: A parallel, X serial (LSB first)

a3

a2

a1

a0 x0

x1 x2 x3 0 0 0 0

Semi-systolic Design #2

7

Systolic Design A parallel, X serial (LSB first)

a3

a2

a1

a0 x0

x1 x2 x3 0 0 0 0

Systolic Design

8

Both Inputs Serial Let a (i ) = 2i ai + a (i −1) , a ( 0) = a0 x (i ) = 2i xi + x (i −1) , x ( 0) = x0 m (i ) = a (i ) x (i ) = (2i ai + a (i −1) )(2i xi + x (i −1) ) = 2 2i ai xi + 2i (ai x ( i −1) + xi a (i −1) ) + a (i −1) x ( i −1) = 2 2i ai xi + 2i (ai x ( i −1) + xi a (i −1) ) + m ( i −1) i ( i +1) (i ) ( i −1) ( i −1) −i m4 + xi a (i −1) + 21 2 ⋅ 21−4 2m 4 3 = 2 ai xi + ai x 42 3 p(i )

p ( i −1)

2 p (i ) = 2i ai xi + ai x (i −1) + xi a (i −1) + p (i −1)

Bit-Serial Multiplier in Dot Notation

9

Bit-Serial Multiplier

(Error in book)

Modular Multipliers • Produces (a×b) mod m • Good for multiplication in residue number systems • Two special cases are easier to do – m = 2b – m = 2b − 1

• Modulo adders give rise to modulo multipliers.

10

Modulo-(2b-1) Carry-Save Adder

Design of a modulo-15 adder

16 mod 15 = 1 32 mod 15 = 2 64 mod 15 = 4

11

Design of a modulo-13 adder

16 mod 13 = 3 32 mod 13 = 6 64 mod 13 = 12

The number of dots is increased.

Remove one dot from col. 1 and replace it with two dots in col. 0 to balance the load.

General Method for Modular Addition Bits emerging from the left (>m) …. Are reduced modulo m and added into result.

12

The Special Case of Squaring • Any standard k × k multiplier may be used for computing p = x2 • However, a special purpose k-bit squarer requires significantly less hardware and is faster.

Design of a 5-bit Squarer

13

Reducing the final CPA Using the identity : x1 x0 + x1 = 2 x1 x0 + x1 − x1 x0 = 2 x1 x0 + x1 (1 − x0 ) = 2 x1 x0 + x1 x0 x1x0 x1

x1x0 x1x0

This postpones an addition, shortening the final CPA

A Multiplier Based on Squares 2 2 ( a + x ) − (a − x ) ax =

4

• Square Function can be done with a lookup table of size 2k × (2k − 2) – small compared to a multiply lookup table

14

Incorporating Wide Additive Inputs to a Multiplier

15

Lecture 13 Basic Division Schemes

Note on Notation

1

Sequential Division Algorithm

q Bit Selection: If (2s( j-1) - 2kd ) < 0 then qk-j= 0 Else qk-j= 1 After k iterations:

Overflow • The quotient of a 2k-bit number divided by a k-bit number may have more that k-bits. • An overflow check is needed.

Check: • The high order k bits of z must be strictly less than d. • Also checks for the divide-by-zero condition.

2

Radix-r division (r > 2) • Basically the same division algorithm, however: • Selection of quotient digit qk-j more difficult • Computation of term qk-jd more difficult

Restoring Division Hardware

3

Signed Division

Magnitudes of q and s are unaffected by the input signs. Output signs are only a function of input signs.

Non-Restoring Division Restoring Partial Remainder

u

Non-restoring

u

Double

k u − 2k d < 0 u − 2 d < 0 u (Yes) u − 2k d (No!) 2u 2(u − 2 k d )

Subtract 2kd

2u − 2 k d

Subtract 2kd Restore?

2(u − 2 k d ) + 2 k d = 2u − 2 k d Add instead !

4

Non-Restoring Signed Division • Each cycle, compute: s = u + qk − j 2kd • Each cycle, you either: – if sign(s j-1) ≠ sign (d) then qk − j = −1, subtract 2kd – if sign(s j-1) = sign (d) then qk − j = 1, add 2kd

• digits of qk − j are either −1 or 1

Two Problems at the End • The quotient with digits 1 and −1 must be converted to standard binary • A final correction step: If sign(s) != sign(z) – add ±d to the remainder, and – subtract ±1 from the quotient

5

Conversion of {-1, 1} to 2’s-Complement • Replace all -1 digits with 0s, leave the 1s • Complement the most-significant digit. • Shift left one position, inserting 1 in LSB -1 -1 • 0 0 • 1 0 • 101

11 11 01 011

-1 1 01 11 111

1 -1 1 0 0 0 001

q1 q0 p1 p0 p1 p0 p1 p0 1

Note : qi = 2 pi − 1

Proof of Method

6

Partial Remainders

Non-Restoring Division Hardware

7

Lecture 14 High-Radix Dividers

Basics of High-Radix Division z = (d × q ) + s sign( s) = sign( z ),

s
s ( j ) = rs ( j −1) − qk − j (r k d ) with s ( 0 ) = z and s ( k ) = r k s

1

Radix-4 Division in Dot Notation

Interesting dividers have radix r = 2b Reduces the number of cycles by a factor of b

Difficulty of High-Radix Division • Guessing the correct quotient digit is more difficult. • Division is naturally a sequential process: – guess a quotient digit qk − j k – compute term qk − j (r d ) – compute partial remainder

s ( j ) = rs ( j −1) − qk − j (r k d )

2

Carry-Save Remainders • More important for speed than high-radix • Lead to large performance increases by replacing carry-propagate adder with carrysave adder. • Key to keeping remainder in carry-save form is: Redundancy in the representation of q. – allows less precise guessing of quotient digit based on approximate magnitude of partial remainder – more redundancy → less precision required

Review of Non-Restoring Division (fractional operands)

s

s ( j ) = 2 s ( j −1) + d

s ( j ) = 2 s ( j −1) − d

3

Using q−j in {−1,0,1}

s ( j ) = 2 s ( j −1) + d

Just Shift

s ( j ) = 2 s ( j −1) − d

A Big Problem Q: How can you tell if shifted partial remainder is in [-d,d) ? A: You have to perform trial subtractions.

Q: Can you avoid trial subtractions ? A: Sweeny, Robertson, and Tocher-- SRT division

4

Radix-2 SRT Division • Assume d ≥ 1/2 (normalized) • Restrict partial remainder to constant range [−1/2, 1/2) instead of [−d, d ) – may require shifting dividend = initial partial remainder so that: 1/2 ≥ s(0) ≥ −1/2

• Once in the proper range, subsequent partial remainders will stay in the range….

Radix-2 SRT Division

Comparison with 1/2 and -1/2 is easy.

5

Simplified Digit Selection u0

u−1

2s ( j −1) = u0 . u−1

1

1

[ − 1 2 , 0) → q − j = 0

1

0

[−1, − 1 2) → q− j = −1

0

1

[1 2 , 1) → q− j = 1

0

0

[0, 1 2) → q− j = 0

Final Steps • {−1,1}-quotient conversion algorithm will not work to convert [−1, 1]-quotient to two’s-complement. – On-the-fly algorithm by Ercedovac 1987, or – Subtract negative digits from positive digits.

• Still requires a final correction step to make remainder positive.

6

Using Carry-Save Adders

Carry-Save Partial Remainders • Two numbers sum to the actual partial remainder • To perform exact comparison, a full CPA would be required • Overlaps in the selection regions allow us to perform approximate comparisons without risk of choosing a wrong digit.

7

Carry-Save Partial Remainders 2 s ( j −1) = u + v

u = (u1u0 . u −1u− 2 L)2's-comp v = (v1v0 . v−1v− 2 L)2's-comp

Let t = t1t0 . t −1t − 2 = u1u0 . u−1u − 2 + v1v0 . v−1v− 2 t is an approximation of u + v Truncation error is less than 1 4 + 1 4 = 1 2 : 0 ≤ (u + v) − t ≤ 1 2

Tolerating Truncation Error

Truncation Error

8

Digit Selection t1t 0 .t −1t − 2

2s ( j −1) = u0 . u −1

t1t 0 .t −1t − 2

2s ( j −1) = u0 . u−1

01.11

[1.75, 2.0) → q− j = 1

11.11

[−0.25, 0.00) → q− j = 0

01.10

[1.5, 1.75) → q− j = 1

11.10

[−0.5, − 0.25) → q− j = 0

01.01

[1.25, 1.5) → q− j = 1

11.01

[−0.75, − 0.5) → q− j = −1

01.00

[1.0, 1.25) → q− j = 1

11.00

[−1.0, − 0.75) → q− j = −1

00.11

[0.75, 1.0) → q− j = 1

10.11

[−1.25, − 1.0) → q− j = −1

00.10

[0.5, 0.75) → q− j = 1

10.10

[−1.5, − 1.25) → q− j = −1

00.01

[0.25, 0.5) → q− j = 1

10.01

[−1.75, − 1.5) → q− j = −1

00.00

[0.0, 0.25) → q− j = 1

10.00

[−2.0, − 1.75) → q− j = −1

Radix-2 Divider with CSA

9

Select Logic • Fast 4-bit CPA, plus decode logic, or • 256 × 2 Lookup table, or • 8 input, 2 output PLA u v

CLA with SRT Division?

What happens to overlap regions as d → 1 ?

10

Choosing Quotient Digits Using a P-D Plot s ( j ) = 2 s ( j −1) − q− j d

p = 2s(j-1)

d Horizontal decision lines: the value of d does not affect the choice.

Putting Both Charts Together s ( j ) = 2 s ( j −1) − q− j d

11

Radix-4 SRT Division • Radix r = 2b, b > 1 • Partial remainder kept in stored-carry form. • Requires a redundant digit set. • Example: – radix 4 – digit set [−3, 3]

New vs. Shifted Old Partial Remainder Radix = 4

Digit Set = [ −3, 3 ]

12

p-d Plot for Radix-4, [−3, 3 ] SRT Division

Only one quadrant shown

Radix-4 Digit Set [ −2, 2 ] • Avoids having to compute 3d as in digit set [−3, 3] • Fewer comparisons (fewer selection regions) • Less redundancy means less overlap in selection regions • Partial remainder must be restricted to ensure convergence

13

Restricting the Range of s − hd ≤ s ( j −1) < hd , for some h < 1 − 4hd ≤ 4 s ( j −1) < 4hd hd d ≤ 4s ( j −1) − q− j d < 41 hd d − −4 23 144 2+423 42 142 4 43 4 q − j = −2

s( j )

q− j = 2

− hd ≤ s ( j ) < hd hd = 4hd − 2d h=2 3

p-d Plot for Radix-4, [−2, 2 ] SRT Division

14

Observations • Restricting digit set to [−2, 2 ] results in less overlap in selection regions • Must examine p and d in greater detail to correctly choose the quotient digit. • Staircase boundaries: 4 bits of p and 4 bits of d are required to make the selection.

Block Diagram 4 bits

2d, d, 0, −d, −2d

15

Intel’s Pentium Division Bug • • • •

Intel used the Radix-4 SRT division algorithm Quotient selection was implemented as a PLA The p-d plot was numerically generated. Script to download entries into the PLA inadvertently removed a few table entries from the table. • When hit, these missing entries resulted in digit 0, instead of the intended digits ±2. • These entries are consulted very rarely, and thus the bug was very subtle and difficult to detect.

General High-Radix Dividers • Radix-8 is possible. – Minimal quotient digit set [−4, 4] – Partial remainder restricted to [-4d/7, 4d/7) – Requires a 3d multiple

• Digit sets with greater redundancy (such as [−7, 7] ) lead to: – wider overlap regions – more comparisons but simpler digit selection – more difficult multiples (±5, ±7)

16

Lecture 15 Variations in Dividers

Robertson Diagram with Radix r, Digit Set [−α, α]

Partial Remainder Shifted Partial Remainder Digit Set [− d , d )

[− rd , rd )

[−(r − 1), r − 1 ]

[− hd , hd )

[− rhd , rhd )

[ − α, α ]

h=

α r −1

1

Range of h h α Maximal Redundancy r − 1 1 Minimal Redundancy r 2 (1 2) +

α h= r −1

Derivation of h Bound on s(j-1 ) : − hd ≤ s ( j −1) < hd , for some h < 1 − rhd ≤ rs ( j −1) < rhd − α3 d ≤ rs ( j −1) − q− j d < rhd − α3 d 1rhd 42+4 142 4 43 4 1424 q− j = − α

s( j )

q− j = α

Bound on s(j) : − hd ≤ s ( j ) < hd hd = rhd − αd h=

α r −1

2

p-d Plot with Overlap Region Uncertainty Rectangle (because of truncation)

A: 4 bits of p, 3 bits of d OK

B: 3 bits of p, 4 bits of d Ambiguous

Choosing the Section Boundary 1. Tile with largest admissible rectangles 2. Verify that no tile intersects both boundaries.

3. Associate a quotient digit with each tile.

3

Tiles = Uncertainty Rectangles The larger the tiles, the fewer bits need to be inspected. If p is in carry-save form (u+v), then to get j bits of accuracy for p, we need to inspect j+1 bits of u and v.

 Truncation   Error of   p 

14243 Truncation Error of d

Determining Tile Sizes • Goal: Find the coarsest possible grid such that the staircase boundaries are entirely contained in the overlap areas. • There is no closed form for the number of bits required, given the parameters r and α. • However, we can derive lower bounds on the number of bits required.

4

Finding Lower Bounds on Number of Bits • By finding an upper bound on the dimension of the tile box, that determines a lower bound on the number of bits needed. • The narrowest overlap area is the area between the two largest digits: α and α−1 at d min • Find the minimum horizontal and vertical dimensions of the overlap area in that narrowest region

Establishing Upper Bound on Uncertainty ∆d = d min

α

2h − 1 −h+α

α

∆p = d min (2h − 1) ∆d

α ∆ α

Missing symbols in Fig. 15.4

bits of p = − log 2 ∆p 

bits of d = − log 2 ∆d 

∆

5

Automating the Process • Determining the bound on the number of bits required and generating the contents of the digit selection PLA can be easily automated. • However, the Intel Pentium bug teaches us an important lesson.

The Asymmetry of the Quotient Digit Section Process P can also go negative. The second quadrant is not a simple negation of the first quadrant, due to the asymmetric effect of truncation. Separate table entries for other quadrants must be derived.

6

Large Uncertainty Rectangles Only one of the large uncertainty rectangles is not totally in the overlap region. Break it into smaller rectangles. One extra bit for both p and d are needed for this case.

Division With Prescaling • The overlap regions are widest toward the high end of the divisor d range. • If we can restrict d to be large, then the selection of the quotient digits may become simpler (require fewer bits of p and d, possibly made independent of d altogether) • Instead of computing z/d, compute zm/dm • This is called prescaling.

7

Prescaling • Multiply both dividend and divisor by a constant m before beginning division. • For multiplier, use existing hardware in divider for multiplying divisor and quotient digits. • Speedup in selection logic must be weighed against extra multiplication steps at the beginning. • Table lookup to determine scaling factor.

Modular Dividers and Reducers • Remainder in modular division is always positive. (Requires a different correction step.) • Modular reduction (computing positive remainder) is faster and needs less work than a full blown division.

8

Restoring Array Divider What is the worst case path?

Nonrestoring Array Divider What is the worst case path?

9

Comparison to Array Multipliers • Similarity between array dividers and array multipliers is deceiving. • Array multipliers have O(k) delay. • Array dividers have O(k2) delay. • Both can be pipelined to increase throughput.

Multiplier and Divider Comparison

10

Combined Multiply/Divide Unit

Array Multiplier and Divider Comparison

11

I/O of a Universal Array Multiplier/Divider

12

Chapter 16 Division by Convergence

General Convergence Methods • Mutually recursive equations – one sequence converges to a constant (residual) – one sequence converges to the desired function

• The complexity depends on – ease of evaluating f and g – the number of iterations required to converge

1

Recurrence Equations for Division

An invariant is a predicate that is true at each step of the iteration. It is useful in debugging and leads to a formula for the computation.

Invariant:

Another Recurrence for Division Scaled Residual:

Invariant : s ( j ) = r j (z − q ( j ) r − j d ) s( j)r − j = z − q( j)r − j d

2

Variations • The many division schemes in chapters 13 15 correspond to: – variations in radix r – variations in the scaled residual bound d – variations the quotient selection rule

• This chapter explores schemes that require far fewer iterations ( O(log k) instead of O(k) )

Division by Repeated Multiplications

3

Three Questions

Recurrence Equations

x ( i ) is an approximation to 1/d ( i )

4

Substitute 2−d(i) for x(i)

How fast does it converge?

⇒ Called Quadratic convergence

Quadratic Convergence 1 − d ( 0) ≤ 2 −1 1 − d (1) ≤ 2 − 2 1 − d ( 2) ≤ 2−4 1 − d ( 3 ) ≤ 2 −8 M 1 − d ( k ) ≤ 2 −2 = 2−m k

m = log 2 k 

5

Analysis of Errors

z(m) can be off from q by up to ulp when z=d, both d(i) and q(i) converge to 1−ulp To reduce error to ulp/2, add ulp to q(m) whenever q−1=1

Complexity To do a k-bit division requires : 2 log 2 k − 1 multiplications log 2 k

complementations

Intermediate computations need to be done with a minimum of k + log 2 m bits

6

Division By Reciprocation • To compute q = z / d – compute 1/d – multiply z times 1/d

• Particularly efficient if several divisions by the same divisor d need to be performed

Newton-Raphson Iteration x

( i +1)

f ( x (i ) ) =x − f ′( x (i ) ) (i )

Finds the root of f

7

Newton-Raphson for Division

Quadratic Convergence:

Initial Condition For Good:

Better:

8

Speedup of Convergence Division • Three types of speedup are possible: – reducing the number of multiplications – using narrower multiplications – performing the multiplications faster

• Convergence is slow in the beginning, fast at the end (the number of bits doubles each iteration)

Lookup Table • 8-bit estimate of 1/d replaces 3 iterations • Q: How many bits of d must be inspected to estimate w bits of 1/d ? – A: w – Table size will be 2w×w (proved in section 16.6) – Estimate for 1/d may have a positive or negative error -- the important thing is to reduce the magnitude of the error.

9

Lookup Table Size To get w (w ≥ 5) bits of convergence in the first iteration of division by repeated multiplication, w bits of d (beyond 0.1) must be inspected. The approximation x(0+) needed is w bits (beyond 1.) table lookup

0.11 xxx L 42 4 3x w− digits

→

1.1 xxx L 42 4 3x w − digits

Table size is 2w×w. First pair of multipliers use (w+1)-bit multiplier x(0+).

Convergence from Above and Below

10

Reducing the Width of Multiplications • The first pair of multiplications following the table-lookup involve a narrow multiplier. • If the results of multiplications are suitably truncated, then narrow multipliers can continue to be used.

The Effect of Truncation

Truncation Error:

Approximation Error:

11

Use of Truncated Multiplication

Truncation at Each Step

12

Example: 64-bit Divide • • • • •

256×8 = 2K bit Lookup Table (8-bit result) Two multipliers (9-bit) (16-bit result) Two multipliers (17-bit) (32-bit result) Two multipliers (33-bit) (64-bit result) One multiplier (64-bit)

Hardware Implementation • Convergence division methods are more likely to be used when a fast parallel tree multiplier is available. • The iterated multiply algorithm can also be pipelined.

13

Using a Two-Stage Pipelined Multiplier

Lookup Tables • The better the approximation – the fewer multiplies are required, but – the larger the lookup table

• Store reciprocal values for fewer points and use linear (one multiply-add operation) to higher order interpolation to get approximation at a specified initial value. • Formulate the starting approximation as a multioperand addition problem and use one pass through the multipliers CSA tree to compute it.

14

Lecture 17 Floating-Point Representations

Number Representations • No representation method is capable of representing all real numbers. • Most real values must be represented by an approximation • Various methods can be used: – – – –

Fixed-point number systems (0.xxxxxxxx) Rational number systems (xxxx/yyyy) Floating point number systems (1.xxxx × 2yyyy ) Logarithmic number systems (2yyyy.yyyy)

1

Fixed-Point • Maximum absolute error is same for all numbers – ±ulp with truncation – ±ulp/2 with rounding

• Maximum relative error is much worse for small number than for large numbers – x = (0000 0000. 0000 1001)two – y = (1001 0000. 0000 0000)two

• Small dynamic range: x2 and y2 cannot be represented

Floating-Point • Floating-point trades off precision for dynamic range – you can represent a wide range, from the very small to the extremely large – precision is acceptable over at all points within the range

2

Floating-Point • A floating-point number has 4 components: – – – –

the sign, ± the significand, s the exponent base, b (usually 2) the exponent, e , (which allows the point to float)

• x = ± s × be • Previous example: – +1.001two × 2-5 – +1.001two × 2+7

• More dynamic range

Typical Floating-Point Format

3

Two Signs • The sign of the significand is the sign of the number • The exponent sign is positive for large numbers and negative for small numbers

Representation of Exponent • Signed integer represented in biased number system, and placed to the left of the significand – does not affect speed or cost of exponent arithmetic (addition/subtraction) – Smallest exponent = 0 • facilitates zero detection, zero = all 0’s

– facilitates magnitude comparison • comparing normalized F.P. numbers as if they were integers

4

Range • Intervals [ −max, −min ] and [ min, max ] • max = largest significand × b largest exponent • min = smallest significand × b smallest exponent −∞

+∞

Normal Numbers • Significand is in a set range such as: – –

[1/2, 1) 0.1xxxxxxxx, or [1, 2) 1.xxxxxxxx

• Non-normal numbers may be normalized by shifting and adjusting the exponent. • Zero is never normal.

5

Unrepresentable Numbers • Unrepresentable means not representable as a normalized number. • Underflow – interval from −min to 0, and from 0 to min

• Overflow – interval from −∞ to −max, and from max to ∞

• Three special, singular values : −∞, 0, ∞ Represented by special encodings.

Floating-Point Format • The more bits allocated to the exponent, the larger the dynamic range • The more bits allocated to the significand, the larger the precision • Decisions: – – – – – –

Fix exponent base, b Number of exponent bits Number of significand bits Representation of exponent, e Representation of significand, s Placement of binary point in significand

6

IEEE Single (Double) Precision • • • •

Fix exponent base, b = 2 Number of exponent bits = 8 (11) Number of significand bits = 23 (52) Representation of exponent, e = biased with bias = 127 (1023) • Representation of significand, s = signedmagnitude • Placement of binary point in significand = One bit to the left of the first bit = Implicit bit, not represented because its always = 1.

Before Standardization • Every vendor had a different floating-point format. • Even after word widths were standardized to 32 and 64 bits, different floating-point standards persisted. • Programs were not predictable. • Data was not portable.

7

ANSI/IEEE Std 754-1985

• Single (32 bit) and double (64 bit) precision formats • Special codes for +0, −0, +∞, −∞ – Number ÷ +∞ = ±0 – Number × ∞ = ±∞ – Number ÷ 0 = ±∞

• NaN (Not a Number) – Ex: 0/0, ∞/∞, 0×∞, Sqrt(negative number) – Any operation involving another NaN

5 Formats (Single Precision) 1. NaN 2. Infinity 3. Normalized 4. Denormalized 5. Zero

(1) e = 255 and f ≠ 0, then v = NaN regardless of s (2) e = 255 and f = 0, then v = (−1) s ∞ (3) 0 < e < 255, then v = (−1) s 2( e −127 ) (1. f ) (4) e = 0 and f ≠ 0, then v = (−1) s 2 −126 (0. f ) (5) e = 0 and f = 0, then v = (−1) s 0

8

5 Formats (Double Precision) 1. NaN 2. Infinity 3. Normalized 4. Denormalized 5. Zero

(1) e = 2047 and f ≠ 0, then v = NaN regardless of s (2) e = 2047 and f = 0, then v = (−1) s ∞ (3) 0 < e < 2047, then v = (−1) s 2( e −1023) (1. f ) (4) e = 0 and f ≠ 0, then v = (−1) s 2 −1022 (0. f ) (5) e = 0 and f = 0, then v = (−1) s 0

Denormalized Numbers • Numbers without a hidden 1 and with the smallest possible exponent • Provided to make underflow less abrupt. • “Graceful underflow” • Ex: 0.0001two × 2−126 • Implementation is optional, and it usually is not implemented.

9

Operations Defined • • • • • • • •

Add Subtract Multiply Divide Square Root Remainder Comparison Conversions

Results must be same as if intermediate computations were done with infinite precision. Care must be taken in hardware for ensuring correctness and no undue loss of precision.

Addition • Align the exponents of the two operands by right-shifting the significand of the number with the smaller exponent. • Add or subtract the significands depending on the sign bits. – Add if signs are the same – Subtract if signs are different

• In the case of subtract, cancellation may have occurred, and post-normalization is necessary. • Both overflow and underflow are possible

10

Multiplication • Add the exponents and multiply the significands. • s1 in [1, 2) and s2 in [1,2) imply s1×s2 in [1, 4) – possible need for a single-bit right shift postnormalization

• Overflow and underflow are possible. – Post-normalization may also cause overflow.

Division • Subtract the exponents and divide the significands. • s1 in [1, 2) and s2 in [1,2) imply s1÷ s2 in (1/2, 2) – possible need for a single-bit left shift postnormalization

• Overflow and underflow are possible. – Post-normalization may also cause underflow.

• Division by zero fault must be detected.

11

Square Root • Make the exponent even by subtracting 1 from odd exponents and shifting the significand left by one bit. s in [1, 4) • Halve the exponent and compute the square root of the significand in [1,2) • Square root never Overflows, Underflows, or needs post-normalization. • Square root of negative non-zero number produces a NaN • − 0 = −0

Conversions • Integer to Floating Point – may require rounding – may be inexact

• Single Precision to Double Precision – always fits, never any problems

• Double Precision to Single Precision – may require rounding – may overflow or underflow

12

Exceptions • • • •

Divide by Zero Underflow Overflow Inexact – result needed rounding

• Invalid – the result is NaN

Rounding Schemes • Round toward zero (inward) – truncation or chopping of significand

• Round toward −∞ – truncation of a 2’s-complement number

• Round toward +∞ • Round toward nearest even – round to nearest value – if it’s a tie, round so that the ulp=0

13

Round toward 0

Round toward −∞

14

Round toward +∞

Round to nearest even

15

Rounding Algorithm • Inputs needed – – – – – –

rounding mode (2 bits) sign LSB of significand guard (bit to the right of LSB) round (bit to the right of guard) sticky (OR of all bits to the right of round)

• Conditionally add 1 ulp • Possibly post-normalize • Can overflow

16

Lecture 18 Floating-Point Operations

Unpacking • Separate the sign, exponent, and significand • Put in the implicit “1.” • Convert to internal format (perhaps an extended number) • Testing for special operands:

NaN , ± 0, ± ∞

1

Unpack Sub

Add/Sub

Sub

Swapper

Comp XOR Mux

Right Shifter

Adder & Subtractor

Add/Sub

Sign Logic

Round Mode

Add/Sub

Right/Left Shift Normalize Round Add

Add

Right Shift Normalize

Pack

Right Shifter • Barrel Shifter • Logarithmic Shifter

2

Barrel Shifter

Right3

Right2

Right1

Right0

Left1

Logarithmic Shifter

3

Comp Logic ∆e ∆s

0

s1 ≥ s2 X 0,+ s1 ≥ s2 − s1 < s2 , swap

−

X

+ 0

s1 < s2 , swap

Sign Logic Sign1

Sign′2

s1 ≥ s2

Signout

+ − + + − −

+ − − − + +

X X 1 0 1 0

+ − + − − +

Add / Sub Add Add Sub Sub Sub Sub

Sign′2 = Sign2 ⊕ Sub

4

Rounding Logic Before post - normalization L z−l +1 Already normal L z−l +1 L z −l + 2 L z −l

1 - bit right shift normalize 1 - bit left shift normalize

z −l z−l +1

| G | G | z −l

G

| R

z −l

R R∨S

S

G∨R∨S S

Round Mode

Right/Left Shift Normalize

Round Logic

Round Add

Round Logic

R = Guard S = Round ∨ Sticky

Mode Up Up Up Up Down Down Down Down Chop Even Even Even Even

Sign LSB R S 1 X X + X X 1 + 0 0 X + X X X − + X X X X 1 X − X X 1 − 0 0 X − X X X X X X 1 1 0 1 0 X 1 1 0 X X 0 X X

Round 1 1 0 0 0 1 1 0 0 1 0 1 0

5

Unpack Add Bias

Sub

Integer Multiply

Multiplier

XOR

Round Mode Add

Right Shift Normalize Round Add

Add

Right Shift Normalize

Pack

Unpack Sub Bias

Add

Divider

Integer Divide

XOR

Quo Sub

Rem

Round Mode

Left Shift Normalize Round Add

Add

Right Shift Normalize

Pack

6

Add

0

=

Bias = 127

Sub

Sub

1

Bias = 127

Add

=

Add

1

Sub

0

Exponent Logic for Multiply

Exponent Logic for Divide

Addition + −∞

−∞ −∞

−0 −∞

+0 −∞

+∞

−0 +0 +∞

−∞ −∞ NaN

−0 ± 0∗ +∞

± 0∗ +0 +∞

+∞ +∞ +∞

NaN

* (− −) If Rounding mode = Down (+) Otherwise

7

Subtraction −

−∞

−0

+0

+∞

−∞ −0

NaN +∞

−∞ +0

−∞ −0

−∞

+0

+∞

+0

+0

−∞

+∞

+∞

+∞

+∞

NaN

−∞

Multiplication ×

−∞

−0

+0

+∞

−∞ −0 +0

+∞ NaN NaN

NaN +0 −0

NaN −0 +0

−∞ NaN NaN

+∞

−∞

NaN

NaN

+∞

8

Division ÷

−∞

−0

−∞ −0

NaN +∞

+ ∞ * − ∞ * NaN NaN NaN − ∞

+0

−∞

NaN

+∞

NaN

− ∞ * + ∞ * NaN

+0

NaN

+∞

+∞

* Divide by zero exception

Addition/Subtraction in LNS Lz = log z = log( x ± y ) = log( x(1 ± y x) ) = log x + log(1 ± y x)

= log x + log(1 ± log −1 (log y − log x) )

= Lx + log(1 ± log −1 ( Ly − Lx) ) = Lx + ϕ( Ly − Lx) Do ϕ using a lookup table.

9

Arithmetic Unit for LNS

10

Lecture 19 Error and Error Control

Sources of Computational Errors • Representational Errors – Limited number of digits – Truncation Error

• Computational Errors

1

Example

Representational Errors FLP(r, p, A) • Radix r • Precision p = Number of radix-r digits • Approximation scheme A – – – –

chop round rtne (round to nearest even) chop(g) (chop with g guard digits kept in intermediate steps)

2

Relative Error

Error in Multiplication

3

Error in Division

Error in Addition

4

Error in Subtraction

• If x−y is small, the relative error can be very large – this is called cancellation, or loss of significance

• Arithmetic error η is also unbounded for subtraction without guard digits

The Need for Guard Digits

5

Example

Invalidated Associative Law

6

Using Unnormalized Arithmetic

Tell the truth about how much significance you are carrying.

Normalized Arithmetic with 2 Guard Digits

7

Other Laws that Do Not Hold True

Which Algebraic Equivalent Computation is Best? • No general procedure exists • Numerous empirical and theoretical results have been developed. • Two examples….

8

Example One

x = −b ± b 2 − c

(

)

 − b m b 2 − c  b 2 − (b 2 − c) = = − b ± b2 − c   − b m b2 − c  − b m b2 − c   c −c = = − b m b2 − c b ± b2 − c

x1 = −b − b 2 − c x2 =

−c b + b2 − c

Example Two

9

Worst-Case Error Accumulation • In a sequence of computations, errors may accumulate

1024 ulp = 10 bits of precision lost

• An absolute error of m ulp has the effect of losing log2m bits of precision

Solution: Use multiple Guard Digits • Do computations in double precision for an accurate single precision result. • Have hardware keep several guard digits. • Reduce the number of cascade operations.

10

Kahan’s Summation Algorithm

Error Distribution and Expected Errors Maximum Worst Case Average Expected − p +1

Chop

r

Round

r − p +1 2

(r − 1) r − p 2 ln r ( r − 1)r − p 4 ln r

 1 1 +   r

Expected error of rounding is 3/4 (not 1/2) that of chopping.

11

Forward Error Analysis • Estimating, or bounding the relative error in a computation • Requires specific constraints on the input operand ranges • Dependent on the specific computation

Automatic Error Analysis • Run selected (worst case) test cases with higher precision and observe the differences. • If differences are insignificant, then the computation is probably safe. • Only as good as test cases.

12

Significance Arithmetic • Roughly same as unnormalized arithmetic. • Information about the precision is carried in the result

Would have been misleading.

Noisy-mode Computation • Pseudo random digits, (rather than zeros) are inserted during left shifts performed for normalization. • Needs hardware support, or significant software overhead. • Several runs are compared. If they are comparable, then computation is good.

13

Interval Arithmetic • A value x is represented by an interval • Upper and lower bounds are found for each computation. Example:

• Unfortunately, intervals tend to widen until, after many steps, they become so wide as to be virtually worthless. • Helpful in choosing best reformulation of a computation, and for certifiable accuracy claims.

Backward Error Analysis • Computation error is analyzed in terms of equivalent errors in input values. • If inputs are not precise to this level anyway, then arithmetic errors should not be a concern.

14

Lecture 20 Precise and Certifiable Arithmetic

High Precision and Certifiability • Floating point formats – remarkably accurate in most cases – errors are now reasonably well understood – errors can be controlled with algorithmic methods

• Sometimes, however, this is inadequate – not precise enough – cannot guarantee bounds on the errors – “credibility-gap problem …. We don’t know how much of the computer’s answer to believe.” -- Knuth

1

Approaches for Coping with the Credibility Gap • Perform arithmetic calculations exactly – not always cost-effective

• Make arithmetic highly precise by raising the precision – Multiprecision arithmetic – Variable-precision arithmetic – Both methods make bad results less likely, but provide no guarantee

• Keep track of error accumulation – Certify the result or produce a warning

Other Issues • Algorithm and hardware verification – remember the Pentium

• Fault detection – Detect that a hardware failure has occurred

• Fault tolerance – Continued operation in the presence of hardware failures

2

Exact Arithmetic Proposals have included: • Continued fractions • Rational numbers • p-adic representations

Continued Fractions Any unsigned rational number x = p q has a unique continued - fraction expansion : p 1 x = = a0 + 1 q a1 + 1 a2 + 1 O 1 am −1 + am a0 ≥ 0 a1 L am−1 ≥ 1 am ≥ 2

3

Procedure to Convert to CF   if i = 0 : x  s (i ) =  else if s (i −1) = a (i −1) : 0  1  else (i −1) − a (i −1) s  a ( i ) = s (i )  Invariant : s (i ) = s (i )  +

1 s (i +1)

Example 277 642 1 642 = = 277 − 0 277 642 1 277 = = 642 − 2 88 277 1 88 = = 277 − 3 13 88 13 1 = = 88 10 −6 13 1 10 = = 13 −1 3 10 1 3 = = 10 1 −3 3

s (0) = s (1)

s ( 2)

s ( 3)

s ( 4)

s (5)

s (6)

 277  =0 a (0) =   642   642  =2 a (1) =   277   277  =3 a ( 2) =   88   88  a ( 3) =   = 6 13  13  a ( 4) =   = 1 10 

Successively Better Approximations

10  a (5) =   = 3 3 3 a (6) =   = 3 1 

4

Continuations • Represent a number by a finite number of digits, plus a continuation [Vuillemin, 1990] • A continuation is a procedure to for obtaining: – the next digit – a new continuation

• Notation: [digit, digit, …. , digit ; continuation]

Periodic CF Numbers

5

Arithmetic • Unfortunately, arithmetic with continued fractions are quite complicated.

Fixed-Slash Number Systems • Rational number p/q – p and q are fixed width integers – Sign bit – Inexact bit • result has been “rounded” to fit the format

• Normalized if gcd(p,q) = 1 • Special values are representable – – – –

Rational number p ≠ 0, q ≠ 0 ±0 p = 0, q odd ±∞ p odd, q = 0 NaN otherwise

6

Multiple representations • 1/1 = 2/2 = 3/3 = 4/4 = …. • How many bits are wasted due to multiple representations? • Two randomly selected numbers in [1, n] are relatively prime with a probability of 0.608 [Dirichlet] • 61% of the codes represent unique numbers • Waste is < 1 bit.

Rational Arithmetic • Reciprocal – exchange p and q

• Negation – change the sign

• Multiplication/Division – 2 multiplies, normalize

• Addition/Subtraction – 3 multiplies and 1 addition, normalize

• Normalization – compute gcd(p,q), the most costly step – rounding complex

7

Floating Slash Number Systems

• Allows bits to be allocated where they are needed. • Ex: Integers q=1 (only needs one bit for q)

Multiprecision Arithmetic • Representing numbers using multi-word structures • Perform arithmetic by means of software routines that manipulate these structures. • Example applications: – Cryptography – Large prime research

8

Fixed Multi-precision Formats

Integer

Floating-Point

Multi-precision Arithmetic • Computation can be very slow • Depends heavily on available hardware arithmetic capabilities. • Research has been done on using parallel computers on distributed multi-precision data.

9

Variable-precision Arithmetic • Like multi-precision, except number of words can vary dynamically. • Helpful for both high precision and low precision needs. • “Little Endian” is slightly more efficient for addition.

Variable-precision Formats

Integer

Floating-Point

10

Variable-precision FP Addition X=u-word, Y=v-word with h shift

Using an exponent base of 2k instead of 2 allows shifting to be done by indexing, rather than actual data movement.

FP Addition Algorithm

11

Digit Serial • One digit per clock cycle • Allows dispensing precision on demand.

Error Bounding Using Interval Arithmetic • Using interval arithmetic, a result range is computed. • Midpoint of interval is used for approximate value of result. • Interval size, w, is used as the extent of uncertainty, with worst case error w/2

12

Combining and Comparing Intervals

Interval Arithmetic

13

Interval Arithmetic

Using Interval Arithmetic to Choose Precision • Theorem 20.1 implies that when you narrow the intervals of the inputs, you narrow the intervals of the outputs. • Theorem 20.2 states that if you reduce the relative error of the inputs, you reduce the relative error of the outputs by the same amount. • You can devise a practical strategy for obtaining results with a desired bound on error.

14

Choosing a Precision • Run a trial calculation with – – – –

p radix-r digits of precision wmax = maximum interval width of result ε is desired bound on absolute error if wmax ≤ ε then trial result is good, otherwise ….

• Rerun calcultation with – q radix-r digits of precision, where – q = p + log r wmax − log r ε

Adaptive and Lazy Arithmetic • Not all computations require the same precision. • Adaptive arithmetic systems can deliver varying amounts of precision • Lazy evaluation = postpone all computations until they become irrelevant, or unavoidable, produce digits on demand.

15

Lecture 21 Square Root

Pencil & Paper Algorithm z

Radicand

z2 k −1 z2 k − 2 L z1 z0

q Square Root s Remainder ( z − q 2 )

(2k digits)

qk −1 L q0

(k digits)

sk sk −1 L s0

z ( 0) = q ( 0 ) = s ( 0 ) = 0

(k + 1 digits)

(Initialization)

z (i ) = r 2 z (i −1) + ( z2 ( k −i ) z 2( k −i ) +1 ) r q (i ) = rq (i −1) + qi

s (i ) = z (i ) − (q (i ) )

2

= r 2 z (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − (rq (i −1) + qi )

(Invariant)

2

= r 2 z (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − r 2 (q ( i −1) ) − 2rq ( i −1) qi − qi2 2

= r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i )+1 ) r − (2rq (i −1) + qi )qi

1

Decimal Interpretation s (i ) = r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i )+1 ) r − (2rq (i −1) + qi ) × qi 4 43 4 1444424444 3 142 Shift remainder and bring down two digits

Double partial root, Shift left 1, and Append new digit - Then multiply by the new digit

0 ≤ s ( i ) < 2q ( i ) If s (i ) ≥ 2q (i ) then qi is too small. If s (i ) < 0 then qi is too big.

Decimal Example

2

Binary Interpretation s (i ) = r 2 s (i −1) + ( z2 ( k −i ) z2 ( k −i ) +1 ) r − (2rq (i −1) + qi ) × qi 1444424444 3 1442443 Shift remainder 2 places left and bring down two bits

qi = 0 → 0 qi =1→( q ( i −1) 01) 2

0 ≤ s ( i ) < 2q ( i ) If s (i ) ≥ 2q ( i ) then qi is too small. If s (i ) < 0 then qi is too big.

Binary Example

3

Binary in Dot Notation

Restoring Shift/Subtract Algorithm z

z1 z0 .z −1 z − 2 L z −l

IEEE Radicand

1.q−1 L q−l

q Square Root s

Remainder ( z − q 2 )

s1s0 .s−1s− 2 L s−l

q ( 0) = 1 s (0) = z − 1

(Initialization)

q (i ) = q (i −1) + 2 −i q−i

( ) = z − (q +2 q ) = z − (q ) − 2 ⋅ 2 q

s (i ) = z − q (i )

2

( i −1)

( i −1) 2

−i

(Invariant)

2

−i

−i

( i −1)

q−i − 2 − 2i q−2i

= s (i −1) − 2 −i ( 2q (i −1) + 2 −i q−i )q−i

(Recurrence)

2i s (i ) = 2 ⋅ 2i −1 s (i −1) − (2q (i −1) + 2 −i q−i ) q−i

(Recurrence)

4

Restoring Square Root Interpretation z

z1 z0 .z −1 z − 2 L z −l

Radicand

q Square Root s

Remainder ( z − q 2 )

1.q−1 L q−l s1s0 .s−1s− 2 L s−l

q ( 0) = 1 s (0) = z − 1

(Initialization)

q (i ) = q (i −1) + 2 −i q−i

( )

s (i ) = z − q (i )

2

(Invariant)

2i s (i ) = 2 ⋅ 2i −1 s ( i −1) − (2q (i −1) + 2 −i q−i )q−i 144 42444 3

(Recurrence)

qi =1→(1q−1 . q−2 Lq−i+1 01)two qi = 0 → 0

Example of Restoring Algorithm

5

Sequential Shift/Subtract Restoring Square-Rooter

IEEE Square Root • • •

q− l

1

q 1 q 1 (in this case, ( 1) ≠ exact midway case)

6

Binary Non-Restoring Algorithm Interpretation q ( 0) = 1 s (0 ) = z − 1

(Initialization)

q (i ) = q (i −1) + 2 −i q−i s (i ) = z − (q (i ) )

2

(Invariant)

2i s (i ) = 2 ⋅ 2i −1 s (i −1) −

(2q (i −1) + 2 −i q−i )q−i 144 42444 3

(Recurrence)

qi =1→2 q ( i −1) + 2 −i = (1q−1 .q−2 Lq−i+1 01)two qi = −1→− ( 2 q ( i −1) − 2 −i ) =???

q ( j −1) = Q (Partial Root) q ( j −1) − 2 − j = Q * (Diminished Partial Root) q−i = 1 → Subtract (Q01)Shift Left 1

Add (Q *11)Shift Left 1

q− i = − 1 →

High-Radix Square-Root z

IEEE Radicand

q Square Root s

Scaled Remainder r i ( z − q 2 )

q (0) = 1 s (0) = z − 1

z1 z0 .z −1 z − 2 L z −l 1.q−1 L q−l s1s0 .s−1s− 2 L s−l (Initialization)

q ( i ) = q (i −1) + 2 −i q−i

(

( ))

s (i ) = r i z − q (i )

2

s ( i ) = r ⋅ s ( i −1) − (2q (i −1) + r −i q−i ) q−i

(Invariant) (Recurrence)

7

For radix = 4 : s

(i )

= 4 ⋅ s (i −1) − (2q (i −1) + 4 −i q−i )q−i

(Recurrence)

Let register Q = q (i −1) and Q* = q (i −1) − 4 −i +1 q−i = 2 ⇒

High-Radix Square-Root

s (i ) = 4 s (i −1) − (4q (i −1) + 4 −i +1 ) = 4 s (i −1) − (Q010 )Shift Left 2

Q = Q 10 q−i = 1 ⇒

Q* = Q * 01

Q = Q 01 q−i = 0 ⇒

Q* = Q * 00

s (i ) = 4 s (i −1) − (2q (i −1) + 4 −i ) = 4 s (i −1) − (Q001)Shift Left 1

s (i ) = 4 s (i −1) Q = Q 00

Q* = Q *11

q − i = −1 ⇒

s (i ) = 4 s (i −1) + ( 2q (i −1) − 4 −i ) = 4 s (i −1) + (Q *111)Shift Left 1

Q = Q *11

Q* = Q * 10

q − i = −2 ⇒

s (i ) = 4 s (i −1) + (4q (i −1) + 4 −i +1 ) = 4 s (i −1) + (Q *110 )Shift Left 2

Q = Q *10

Q* = Q * 01

Digit Selection • As in division, digit selection can be based on examining just a few bits of the partial remainder s(i). • s(i) can be kept in carry-save form • The exact same lookup table can be used for square-root as is used for division if the digit set {−1, −1/2, 0, 1/2, 1} is used.

8

Square-Root by Convergence Using Newton - Raphson : x (i +1) = x ( i ) −

f ( x (i ) ) f ′( x ( i ) )

f ( x) = x 2 − z x (i +1) =

x (i ) +

z x (i )

2

δ (i +1) = z − x (i +1) = z −

z (i ) 2 (i ) 2 x (i ) = − ( z − x ) = − δ 2 2 x (i ) 2 x (i )

x (i) +

Square-Root by Convergence • Convergence is quadratic. (Number of bits accuracy doubles each iteration.) • Since δ is negative, the recurrence approaches the answer from above. • An initial table-lookup step can be used to obtain a better initial estimate, and reduce the number of iterations.

9

Example

Approximation Functions For Fractional z : 0.5 ≤ z < 1 1+ z x (0) = error < 6.07% 2 For Integer z = 2 2 m −1 + z rest : x ( 0 ) = 2 m −1 + 2 −( m +1) z = (3 × 2 m −2 ) + 2 −( m +1) z rest error < 6.07%

10

Division-Free Variants • With reciprocal circuit or lookup-table – each iteration requires a table lookup, a 1-bit shift, 2 multiplications, and 2 additions – multiplication must be twice as fast as division to make this cost effective – convergence rate will be less than quadratic because of error in reciprocal

x

( i +1)

(

= x + 0. 5 1 x (i )

(i )

)⋅ (z − (x ) ) (i ) 2

Division-Free Variants • With Newton-Raphson approximation of reciprocal – each iteration a 1-bit shift, 3 multiplications, and 2 additions – convergence rate will be less than quadratic because of error in reciprocal – two equations can be computed in parallel

x (i +1) = x (i ) + 0.5 (x ( i ) + zy (i ) ) y (i +1) = y ( i ) (2 − x ( i ) y ( i ) )

11

Example

Division-Free Variants Using Newton - Raphson : f ( x) =

1 −z x2

x (i +1) = x (i ) −

Root at x =

(

( )

x (i +1) = 0.5 x (i ) 3 − z x (i )

2

)

f ( x (i ) ) f ′( x (i ) )

1 z

• Solve for inverse of square-root instead. – – – –

Requires 3 multiplications and 1 addition Quadratic convergence Final answer = z × x(k) Used in Cray-2 supercomputer, 1989

12

Example

Parallel Hardware Square-Root • Square-Root is very similar to Division • Usually possible to modify divide units to do square root. • A non-restoring square-root array can be derived directly from the dot notation, similar to the way the non-restoring divide array was derived.

13

Non-Restoring Square-Root Array

14

Lecture 22 The CORDIC Algorithms

CORDIC • Coordinate Rotation Digital Computer • Invented in late 1950’s • Based on the observation that: – if you rotate a unit-length vector (1,0) – by an angle z – its new end-point will be at (cos z, sin z)

• Can evaluate virtually all functions of interest • k iterations require for k-bits accuracy

1

1959

1971

2

1977

Rotations and Pseudo-Rotations

3

True Rotations

Pseudo-Rotations

4

After m Real Rotations Rotate by angles α1 , α 2 , L , α m

After m Pseudo-Rotations Rotate by angles α1 , α 2 , L , α m

5

Expansion Factor K

• By product of pseudo-rotations • Depends on the rotation angles. • However, if we always uses the same rotation angles (with positive and negative signs), then – K is a constant – Can be precomputed and stored – Its reciprocal can also be computed and stored

Basic CORDIC Iterations Pick fixed rotation angles ± α ( i ) such that : á ( i ) = tan −1 2 −i i α (i ) tan α ( i ) = 2 −i 0 45.0 1.000 1

26.6 0.500

2 14.0 0.250 3 7.1 0.125

6

CORDIC Pseudo-Rotations

á ( i ) = tan −1 2 − i tan α (i ) = 2 −i

Approximate Angle Table

Table can be done in degrees or radians.

7

CORDIC Iterations with Table

Basic CORDIC Iterations • Each CORDIC rotation requires: – 2 shifts – 1 table lookup – 3 additions

• By rotating by the same set of angles (with + or − signs) , the expansion factor K can be precomputed

8

Rotation Mode Rules • Initialize: – z=z – x=x – y=y

• Iterate with di = sign(z(i)) • Finally (after m steps): – – –

Example

9

Example (First 3 Rotations)

Trig Function Computation • Initialize: – z=z – x = 1/K = 0.607 252 935 …. – y=0

• Iterate with di = sign(z(i)) • Finally (after m steps): – – – –

z≈0 x ≈ cos(z) y ≈ sin (z) y/x ≈ tan (z)

10

Precision in CORDIC • For k bits of precision in trig functions, k iterations are needed. • For large i, tan(2−i) ≈ 2−i • For i > k , change in z < ulp • Convergence is guaranteed for angles in range −99.7 ≤ z ≤ 99.7 – (99.7 is sum of all angles in the table)

• For angles outside this range, use standard trig identities to convert angle to one in the range

Vectoring Mode Rules • Initialize: z = z, x = x, y = y • Iterate with di = −sign( x(i) y(i) ) –

Forces y(m) to 0

• Finally (after m steps): – – –

11

Trig Function Computation • Initialize: – z=0 – x=1 – y=y

• Iterate with di = −sign(x(i) y(i)) = −sign(y(i)) • Finally (after m steps): – z ≈ tan−1(y) – Use identity: to limit range of fixed-point numbers

CORDIC Hardware

12

Bit-Serial CORDIC • For low cost, low speed application (hand held calculators) bit-serial implementations are possible.

Generalized CORDIC

13

Circular Circular Rotation Mode

Circular Vectoring Mode

Linear Linear Rotation Mode

Linear Vectoring Mode

14

Hyperbolic Hyperbolic Rotation Mode

Hyperbolic Vectoring Mode

Vector Length and Rotation Angle Circular

Linear

Hyperbolic

15

Convergence • Circular and Linear CORDIC converge for suitably restricted values of x, y, and z. • Hyperbolic will not converge for all cases. • A simple solution to this problem is to repeat steps: i = 4, 13, 40, 121, …. , j, 3j+1, …. • With these precautions, hyperbolic will converge for suitable restricted values of x, y, and z

Using CORDIC Directly computes :

Also directly computes :

sin

tan −1 ( y x) y + xz

cos tan −1 sinh cosh tanh −1

x2 + y2 x2 − y2 e z = sinh z + cosh z

× ÷

16

Using CORDIC Indirectly Computes : sin z tan z = cos z sinh z tanh z = cosh z w −1 w +1 log b w = K × ln w

ln w = 2 tanh −1

w =e t

t ln w

−1

cos w = tan

−1

sin −1 w = tan −1

( = ln (w +

1 − w2 w w 1 − w2

) 1+ w )

cosh −1 = ln w + 1 − w 2 sinh −1

2

w = ( w + 1 4) 2 − ( w − 1 4) 2

Summary Table of CORDIC Algorithms

17

Variable Factor CORDIC • Allows termination of CORDIC before m iterations • Allows digit set {-1, 0, 1} • Must compute expansion factor via recurrence: • At the end, must divide results by the square root of (K(m))2. • Constant-factor CORDIC is almost always preferred.

High Speed CORDIC • Do first k/2 iterations as normal. – z becomes very small

• Combine remain k/2 iterations into one step.

• K doesn’t change.

18

Lecture 23 Variations in Function Evaluation

Choices are Good • CORDIC can compute virtually all elementary functions of interest. • However, alternatives may have advantages: – implementation – adaptability to a specific technology – performance

1

Recurrence Methods

• Normalization - make u converge to a constant • Additive Normalization - make u converge to a constant by adding a term to u each iteration. • Multiplicative Normalization - make u converge to a constant by multiplying each iteration.

Which is better ? • Additive is easier to evaluate. – Addition is cheap – CORDIC is additive

• Multiplicative is slower, but usually converges faster – often quadratic convergence – steps are more expensive, but there are fewer of them – sometime multiplicative reduces to shift & add

2

Logarithm Multiplicative Normalization:

x(0) = x

y(0) = y

After m steps:

Read from a table

Logarithm, continued Domain of convergence:

Rate of convergence: One bit per iteration for large k. Invariants: y(i) = y + ln( x / x(i) ) x(i) = x e( y−y(i) )

3

A Radix-4 Logarithm

Scale so digit selection is the same each iteration.

Read from a table

Radix-4 Logarithm Digit Selection

Initialization : u = 4(δx − 1) Range of Convergence : ä=2→ 1 2≤ x≤5 8 ä =1→ 5 8 ≤ x ≤1

y = − ln δ

4

A Clever Base-2 Log Method y = log 2 x < 1 x = 2 y = 2 0. y−1 y−2 L Step 1 : Square x : x = x 2 = 2 2 y = 2 y−1 . y−2 L Step 2 : If x ≥ 2 then x = 21. y−2 L ≥ 2 → Step 3 :

y−1 = 1

Divide x by 2 : x = x 2 = 20. y−2 L Else x = 20. y−2 L < 2 →

y−1 = 0

Step 4 : Goto step 1 and repeat to get next digit.

Hardware for Clever Log2 Method

5

Generalization to Base-b Logarithms

Exponentiation

Invariants: x(i) = x + ln( y / y(i) )

Read from a table

Same recurrence we used for logarithm with x and y switched.

y(i) = y e( x−x(i) ) Initialization: x(0) = x y(0) = 1 As x(k) goes to 0, y(k) goes to ex Rate of convergence: One bit per iteration for large k.

6

Elimination of k/2 iterations • After k/2 iterations, ln(1 ± 2−k) ≈ ± 2−k • When x(j) = 0.00 …. 00xx …. xx , ln(1+ x(j) ) ≈ x(j) allowing us to perform the final computation step:

• Last step combines k/2 iterations, but contains a “true” multiplication

Radix-4 ex

Read from a table

Scale so digit selection is the same each iteration.

7

Radix-4 Exponentiation Digit Selection

Initialization : u = 4(δx − 1) Range of Convergence :

y = eδ

ä = −1 2 → x ≤ −1 4 ä=0→ δ =1 2 →

−1 4 ≤ x ≤ 1 4 x ≥1 4

General Exponentiation

1. Compute Ln x 2. Multiply

y Ln x

3. Compute Exp(y ln x)

8

General Approach to Division Additive Normalization Method

Scale so digit selection is the same each iteration.

Invariants: Unscaled: s(i) = z − q(i) × d Scaled: s(i) = z r i − q(i) × d

(

γ ( i ) ≈ r r i q * − q (i ) estimate

)

General Approach to Square Root Multiplicative Normalization Method

Initialization: x(0) = y(0) = z Invariant: z x(i) = y(i)2 Scale

Convergence: As x(i) goes to 1, y(i) goes to Sqrt(z)

Initialization: u(0) = z −1 y(0) = z Invariant: z (2−i u(i)+1) = y(i)2 Convergence: As 2−i u(i) goes to 0, y(i) goes to Sqrt(z)

9

General Approach to Square Root Additive Normalization Method

Initialization: x(0) = z y(0) = 0

Scale

Invariant: x(i) = z − y(i)2 Convergence: As x(i) goes to 0, y(i) goes to Sqrt(z)

Initialization: u(0) = z y(0) = 0 Invariant: 2−i u(i) = z − y(i)2 Convergence: As 2−i u(i) goes to 0, y(i) goes to Sqrt(z)

General Computation Steps • Preprocessing steps: – Use identities to bring operand within appropriate ranges – Initialize recurrences • sometimes using approximation functions

• Processing steps: – Iterations

• Postprocessing steps – Compute final result – Normalization

10

Approximating Functions Taylor Series

Maclaurin Series (a = 0 )

Error Bound

Error Bound

Horner’s Method

Coefficients c(i) can be stored in a table, or computed on the fly from c(i-1). Ex: For Sin(x)

11

Divide and Conquer Evaluation • Divide input x into integer and high order fraction bits xH and low order fractional bits xL

• Write a Taylor series expansion about x = xH

• Approximate with first two terms

Table lookup

12

Rational Approximation

Merged Arithmetic • When very high performance is required, you can build hardware to evaluate nonelementary functions. – Higher speeds – Lower component count – Lower power

13

Example

14

Lecture 24 Arithmetic by Table Lookup

Uses of Lookup Tables • Digit Selection in high-radix arithmetic • Initial result approximation to speed-up iterative methods • Store CORDIC constants • Store polynomial coefficients • Can be mixed with logic in hybrid schemes

1

Advantages of Table Lookup • • • • • • •

Memory is much denser than logic Easy to layout in VLSI Large memories are cheap and practical Design and testing is easy Flexibility for last minute changes Any function can be implemented Can be error-encoded to make more robust

Direct Table Lookup • Size – input operands = u bits total – output results = v bits total – table size = 2u × v bits

• Flexible • Size is not practical in most cases – Exponential growth gets you

• Best for unary (single operand) functions

2

Indirect Table Lookup • Hybrid scheme that uses preprocessing and postprocessing blocks to reduce the size of the table.

Binary to Unary Reduction • Idea: Evaluate a binary function using an auxiliary unary function • Indirect method – Preprocess: Convert binary operands to unary operand(s). – Table lookup – Postprocess: Convert unary results to binary result

3

Example 1: Log(x ± y) Lz = log z = log( x ± y ) = log( x(1 ± y x) ) = log x + log(1 ± y x)

= log x + log(1 ± log −1 (log y − log x) )

= Lx + log(1 ± log −1 ( Ly − Lx) ) = Lx + ϕ( Ly − Lx) Do ϕ using a lookup table.

Example 2: Multiplication

[

]

1 (x + y )2 − (x − y )2 4 x, y are k - bit values → xy requires 2 (k + 1) - bit lookups xy =

Observation : The least significant bits of ( x + y ) and ( x − y ) are either both even or both odd.

[

]

1 (x + y )2 − (x − y )2 =  x + y  −  x − y  + εy 4  2   2  where ε = 0 if x + y is even, ε = 1 if x + y is odd. xy requires 2 k - bit lookups and a three operand addition 2

2

4

Further Reductions for Squaring • Last two bits of square x2 are always ….0x0 and thus do not need to be in the table. • Split table approach – Operand is split and two smaller tables are used – Size of two split tables is less than single table – Results of the two tables are combined to form the result.

Tables in Bit-Serial Arithmetic Example: Autoregressive Filter

5

Define the Lookup Table

LSB-First Bit Serial Filter

6

Interpolating Memory • Instead of programming a table f (x) for all possible values of an operand x …. • only program table for a small set of values of x at regular intervals (intervals of 2i are best) • To evaluate the function f (x) : – read table at end-points of interval that contains x, and interpolate between them to approximate f (x)

Interpolation xlo ≤ x ≤ xhi If f ( xlo ) and f ( xhi ) are known, Then f ( x) may be approximated :  f ( xhi ) − f ( xlo )  f ( x) = f ( xlo ) + ( x − xlo ) ⋅   xhi − xlo    f ( xhi ) − f ( xlo )  = f ( xlo ) + ∆x ⋅   xhi − xlo   = a + b ⋅ ∆x

7

Example: Log2 1≤ x ≤ 2 f ( xlo ) = log 2 1 = 0 f ( xhi ) = log 2 2 = 1 log 2 x ≈ x − 1 = the fractional part of x Maximum absolute error : 0.086071 Maximum relative error : 0.061476

Improved Linear Approximation • Choose a line that minimizes worst case absolute or relative error. • May not be exact at endpoints

8

Improved Linear Approximation log 2 x ≈ 0.043036 + ∆x ∆x = 1 − x Maximum absolute error : 0.043036 Half the error from previous approximation But not good enough ...

More Improvement • Two choices: – Go to two or more linear intervals – Use one interval, but with 2nd degree polynomial interpolation: f (xk + ∆x) ≈ ak + bk ∆x + ck ∆x2 Program xk → ak, bk, ck in lookup table.

• Both result in larger table, but which is best?

9

Log2 with 4 Linear Intervals

Trade-Offs in Cost, Speed, and Accuracy • Interpolation using – h bits of x – degree m polynomial – requires table of (m+1)2h entries

• As m increases, complexity increases, speed goes down. • It is seldom cost effective to go beyond second-degree interpolation.

10

Maximum Absolute Error in Computing Log2

Piecewise Lookup Tables • Evaluation base on table lookup using fragments of the operands. • Indirect table lookup method.

11

Example 1: IEEE Floating Point

x:

t 8 bits

u 6 bits

v 6 bits

w 6 bits

IEEE 26-bit (2.24) significand

Taylor Series

12

Method of Evaluation

Example 2: Modular Reduction

mod z p { b − bit 14243 d − bit

13

Divide and Conquer Modular Reduction

Alternate Method: Successive Refinement

14