Hardware Implementation of Finite-Field Arithmetic

Hardware Implementation of Finite-Field Arithmetic About the Authors Jean-Pierre Deschamps received an MS degree in e...

Author: Jean-Pierre Deschamps

94 downloads 841 Views 1MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Hardware Implementation of Finite-Field Arithmetic

About the Authors Jean-Pierre Deschamps received an MS degree in electrical engineering from the University of Louvain, Belgium, in 1967, a PhD degree in computer science from the Autonomous University of Barcelona, Spain, in 1983, and a PhD degree in electrical engineering from the Polytechnic School of Lausanne, Switzerland, in 1984. He worked in several companies and universities. He is currently a professor at the University Rovira i Virgili, Tarragona, Spain. His research interests include ASIC and FPGA design, digital arithmetic, and cryptography. He is the author of seven books and about a hundred international papers. José Luis Imaña received the MS and PhD degrees, both in Physics, from Complutense University of Madrid, Spain, where he is currently a professor. His research interests include algorithms and VLSI architectures for computations in finite fields, cryptography, computer arithmetic, reconfigurable computing architectures, and formal methods in verification. He is the author of about 30 international papers and communications. Gustavo D. Sutter received an MS degree in computer science from State University UNCPBA of Tandil (Buenos Aires), Argentina, and a PhD degree from the Autonomous University of Madrid, Spain. He has been a professor at the UNCPBA, Argentina, and is currently a professor at the Autonomous University of Madrid, Spain. His research interests includes ASIC and FPGA design, digital arithmetic, and development of embedded systems. He is the author of one book and about 30 international papers and communications.

Hardware Implementation of Finite-Field Arithmetic Jean-Pierre Deschamps José Luis Imaña Gustavo D. Sutter

New York Chicago San Francisco Lisbon London Madrid Mexico City Milan New Delhi San Juan Seoul Singapore Sydney Toronto

Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permission of the publisher. ISBN: 978-0-07-154582-2 MHID: 0-07-154582-4 The material in this eBook also appears in the print version of this title: ISBN: 978-0-07-154581-5, MHID: 0-07-154581-6. All trademarks are trademarks of their respective owners. Rather than put a trademark symbol after every occurrence of a trademarked name, we use names in an editorial fashion only, and to the benefit of the trademark owner, with no intention of infringement of the trademark. Where such designations appear in this book, they have been printed with initial caps. McGraw-Hill eBooks are available at special quantity discounts to use as premiums and sales promotions, or for use in corporate training programs. To contact a representative please visit the Contact Us page at www.mhprofessional.com. Information contained in this work has been obtained by The McGraw-Hill Companies, Inc. (“McGrawHill”) from sources believed to be reliable. However, neither McGraw-Hill nor its authors guarantee the accuracy or completeness of any information published herein, and neither McGraw-Hill nor its authors shall be responsible for any errors, omissions, or damages arising out of use of this information. This work is published with the understanding that McGraw-Hill and its authors are supplying information but are not attempting to render engineering or other professional services. If such services are required, the assistance of an appropriate professional should be sought. TERMS OF USE This is a copyrighted work and The McGraw-Hill Companies, Inc. (“McGraw-Hill”) and its licensors reserve all rights in and to the work. Use of this work is subject to these terms. Except as permitted under the Copyright Act of 1976 and the right to store and retrieve one copy of the work, you may not decompile, disassemble, reverse engineer, reproduce, modify, create derivative works based upon, transmit, distribute, disseminate, sell, publish or sublicense the work or any part of it without McGraw-Hill’s prior co sent. You may use the work for your own noncommercial and personal use; any other use of the work is strictly prohibited. Your right to use the work may be terminated if you fail to comply with these terms. THE WORK IS PROVIDED “AS IS.” McGRAW-HILL AND ITS LICENSORS MAKE NO GUARANTEES OR WARRANTIES AS TO THE ACCURACY, ADEQUACY OR COMPLETENESS OF OR RESULTS TO BE OBTAINED FROM USING THE WORK, INCLUDING ANY INFORMATION THAT CAN BE ACCESSED THROUGH THE WORK VIA HYPERLINK OR OTHERWISE, AND EXPRESSLY DISCLAIM ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. McGraw-Hill and its licensors do not warrant or guarantee that the functions contained in the work will meet your requirements or that its operation will be uninterrupted or error free. Neither McGraw-Hill nor its licensors shall be liable to you or anyone else for any inaccuracy, error or omission, regardless of cause, in the work or for any damages resulting therefrom. McGraw-Hill has no responsibility for the content of any information accessed through the work. Under no circumstances shall McGraw-Hill and/or its licensors be liable for any indirect, incidental, special, punitive, consequential or similar damages that result from the use of or inability to use the work, even if any of them has been advised of the possibility of such damages. This limitation of liability shall apply to any claim or cause whatsoever whether such claim or cause arises in contract, tort or otherwise.

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi xiii

1

Mathematical Background . . . . . . . . . . . . . . . . . . . . . 1.1 Number Theory . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Basic Deﬁnitions . . . . . . . . . . . . . . . . . 1.1.2 Euclidean Algorithms . . . . . . . . . . . . . 1.1.3 Congruences . . . . . . . . . . . . . . . . . . . . . 1.2 Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Polynomials . . . . . . . . . . . . . . . . . . . . . 1.2.5 Congruences of Polynomials . . . . . . . 1.3 Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Basic Properties . . . . . . . . . . . . . . . . . . 1.3.2 Field Extensions . . . . . . . . . . . . . . . . . . 1.3.3 Roots of Irreducible Polynomials . . . 1.3.4 Bases of Finite Fields . . . . . . . . . . . . . . 1.3.5 Finite Fields GF(2m) . . . . . . . . . . . . . . . 1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 2 4 8 8 9 10 11 15 17 17 18 20 20 22 23

2

mod m Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Digit Recurrence Algorithms . . . . . . . 2.1.2 Nonrestoring Reducer . . . . . . . . . . . . 2.1.3 SRT Reducer . . . . . . . . . . . . . . . . . . . . . 2.2 Reduction mod 2k − a . . . . . . . . . . . . . . . . . . . . . 2.3 Precomputation of 2ik mod m . . . . . . . . . . . . . . 2.4 Barrett Reduction Algorithm . . . . . . . . . . . . . . 2.4.1 n-Digit to (k + t)-Digit Reduction . . . 2.4.2 An Approximation of q . . . . . . . . . . . . 2.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Speciﬁc Circuits . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 mod 239 Reducer . . . . . . . . . . . . . . . . . 2.6.2 mod (2192 − 264 − 1) Reducer . . . . . . . . 2.7 FPGA Implementation . . . . . . . . . . . . . . . . . . . 2.7.1 Nonrestoring Reducers . . . . . . . . . . . . 2.7.2 SRT Reducers . . . . . . . . . . . . . . . . . . . . 2.7.3 Reduction mod 2k − a . . . . . . . . . . . . . .

25 25 25 27 29 33 38 43 43 44 48 49 49 50 54 55 55 55

v

vi

Contents 2.7.4 Precomputation of 2ik mod m . . . . . . . 2.7.5 Barrett Reduction . . . . . . . . . . . . . . . . 2.7.6 Speciﬁc Circuits . . . . . . . . . . . . . . . . . . Comments and Conclusions . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57 58 59 59 60

3

mod m Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Addition mod m . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Subtraction mod m . . . . . . . . . . . . . . . . . . . . . . 3.3 Adder/Subtractor mod m . . . . . . . . . . . . . . . . 3.4 Multiplication mod m . . . . . . . . . . . . . . . . . . . . 3.4.1 Multiply and Reduce . . . . . . . . . . . . . 3.4.2 Double, Add, and Reduce ........ 3.4.3 Montgomery Multiplication . . . . . . . 3.4.4 Comparison . . . . . . . . . . . . . . . . . . . . . 3.5 Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 FPGA Implementations . . . . . . . . . . . . . . . . . . 3.6.1 mod m Adders/Subtractors . . . . . . . . 3.6.2 mod m Multipliers . . . . . . . . . . . . . . . . 3.6.3 mod m Exponentiators . . . . . . . . . . . . 3.7 Comments and Conclusions . . . . . . . . . . . . . . 3.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 63 64 66 66 70 75 81 82 87 87 87 88 88 89

4

Operations over GF(p) . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . 4.1.1 Integer Division . . . . . . . . . . . . . . . . . . 4.1.2 Multiplication and Subtraction . . . . . 4.1.3 mod p Division . . . . . . . . . . . . . . . . . . 4.2 Binary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Plus-Minus Algorithm . . . . . . . . . . . . . . . . . . . 4.4 Fermat’s Little Theorem . . . . . . . . . . . . . . . . . . 4.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 FPGA Implementations . . . . . . . . . . . . . . . . . . 4.6.1 Euclidean Algorithm . . . . . . . . . . . . . . 4.6.2 Binary Algorithm . . . . . . . . . . . . . . . . . 4.6.3 Plus-Minus Algorithm . . . . . . . . . . . . 4.6.4 Fermat’s Little Theorem . . . . . . . . . . . 4.7 Comments and Conclusions . . . . . . . . . . . . . . 4.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 92 93 96 98 100 104 110 112 113 113 114 114 115 116 116

5

Operations over Zp[x]/f (x) . . . . . . . . . . . . . . . . . . . . . 5.1 Addition and Subtraction mod f(x) . . . . . . . . . 5.2 Multiplication mod f(x) .................. 5.2.1 Two-Step Multiplication . . . . . . . . . . . 5.2.2 Serial Multiplication . . . . . . . . . . . . . . 5.3 Exponentiation mod f(x) .................

117 117 121 121 123 128

2.8 2.9

Contents 5.4 5.5

5.6 5.7 6

7

Optimal Extension Fields . . . . . . . . . . . . . . . . . FPGA Implementations . . . . . . . . . . . . . . . . . . 5.5.1 Adders of Polynomials mod p . . . . . . 5.5.2 Subtractors of Polynomials mod p . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Adders/Subtractors of Polynomials mod p . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Serial Multipliers . . . . . . . . . . . . . . . . . 5.5.5 Exponentiation . . . . . . . . . . . . . . . . . . . Comments and Conclusions . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Operations over GF (p m ) . . . . . . . . . . . . . . . . . . . . . . . 6.1 Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . 6.2 Binary Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Reduction to Multiplications over GF(p m) and Inversion over Zp . . . . . . . . . . . . . . . . . . . . 6.4 Optimal Extension Fields . . . . . . . . . . . . . . . . . 6.5 FPGA Implementations . . . . . . . . . . . . . . . . . . 6.6 Comments and Conclusions . . . . . . . . . . . . . . 6.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operations over GF (2m)—Polynomial Bases . . . . . . 7.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Two-Step Classic Multiplication . . . . 7.1.2 Karatsuba-Ofman Polynomial Multiplication . . . . . . . . . . . . . . . . . . . 7.1.3 Interleaved Multiplication . . . . . . . . . 7.1.4 Matrix-Vector Multipliers . . . . . . . . . . 7.1.5 Montgomery Multiplication . . . . . . . 7.2 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Important Irreducible Polynomials . . . . . . . . 7.6.1 Equally Spaced Polynomials (ESPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 General Irreducible Polynomials . . . 7.6.3 All-One Polynomials (AOPs) . . . . . . . 7.6.4 Trinomials . . . . . . . . . . . . . . . . . . . . . . . 7.6.5 Pentanomials . . . . . . . . . . . . . . . . . . . . 7.7 FPGA Implementations . . . . . . . . . . . . . . . . . . 7.7.1 Classic Multipliers . . . . . . . . . . . . . . . . 7.7.2 Interleaved Multiplication . . . . . . . . . 7.7.3 Mastrovito Multipliers . . . . . . . . . . . .

132 136 136 136 137 137 137 138 138 139 140 147 154 156 162 162 162 163 164 164 169 171 174 182 187 195 204 206 213 213 214 216 219 221 223 224 224 224

vii

viii

Contents 7.7.4

7.8 7.9

Mastrovito Multipliers, Second Version . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.5 Interleaved Multiplication, Advanced Version ............... 7.7.6 Montgomery Multipliers . . . . . . . . . . 7.7.7 Classic Squaring . . . . . . . . . . . . . . . . . 7.7.8 LSB First Squarer, Second Version . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.9 Montgomery Squarer . . . . . . . . . . . . . 7.7.10 Binary Exponentiation . . . . . . . . . . . . 7.7.11 Montgomery Exponentiation . . . . . . . 7.7.12 Division . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.13 Extended Euclidean Algorithm (EEA) for Inversion . . . . . . . . . . . . . . . 7.7.14 Modiﬁed Almost Inverse Algorithm (MAIA) for Inversion . . . . . . . . . . . . . 7.7.15 Important Irreducible Polynomials . . . . . . . . . . . . . . . . . . . . . Comments and Conclusions . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 225 227 227 228 228 229 229 229 230 230 231 231

Operations over GF(2m)—Normal Bases . . . . . . . . . 8.1 Some Properties of Normal Bases . . . . . . . . . . 8.2 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Optimal Normal Bases . . . . . . . . . . . . . . . . . . . 8.7 FPGA Implementations . . . . . . . . . . . . . . . . . . 8.7.1 Multiplier . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Exponentiation . . . . . . . . . . . . . . . . . . . 8.7.3 Inversion . . . . . . . . . . . . . . . . . . . . . . . . 8.7.4 Type-I Optimal Normal Basis Multiplier with AOPs . . . . . . . . . . . . . 8.8 Comments and Conclusions . . . . . . . . . . . . . . 8.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

235 236 238 238 249 255 259 264 265 265 266

9

Operations over GF (2m)—Other Bases . . . . . . . . . . 9.1 Dual Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Triangular Bases . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

269 269 277 284

10

An Example of Application—Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Public-Key Cryptography . . . . . . . . . . . . . . . . 10.2 Elliptic Curve over a Finite Field . . . . . . . . . .

287 287 288

8

266 266 267

Contents 10.3 10.4

Group Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Point Multiplication . . . . . . . . . . . . . . . . . . . . . 10.4.1 Deﬁnition . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Basic Algorithms . . . . . . . . . . . . . . . . . 10.4.3 Some Alternative Methods . . . . . . . . Example of Implementation . . . . . . . . . . . . . . 10.5.1 Computation Resources . . . . . . . . . . . 10.5.2 Point Addition . . . . . . . . . . . . . . . . . . . 10.5.3 Point Multiplication . . . . . . . . . . . . . . FPGA Implementation . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

290 292 292 293 294 304 305 305 306 310 311

A

p = 2192 – 264 – 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Hexadecimal Representation . . . . . . . . . . . . . . A.2 mod p Reduction . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Generic Sequential Circuit . . . . . . . . . . . A.2.2 Speciﬁc Combinational Circuit. . . . . . . . A.2.3 FPGA Implementation . . . . . . . . . . . . . A.3 mod p Addition and Subtraction . . . . . . . . . . . A.4 mod p Multiplication . . . . . . . . . . . . . . . . . . . . . A.4.1 Generic Circuit . . . . . . . . . . . . . . . . . . . A.4.2 Speciﬁc Circuit . . . . . . . . . . . . . . . . . . . A.5 mod p Exponentiation . . . . . . . . . . . . . . . . . . . . A.6 mod p Division . . . . . . . . . . . . . . . . . . . . . . . . . .

313 313 313 313 314 314 314 315 315 315 316 317

B

Optimal Extension Fields . . . . . . . . . . . . . . . . . . . . . . B.1 GF(23917) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 VHDL Models and Constant Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . B.1.2 FPGA Implementations . . . . . . . . . . . . B.2 GF((232 − 387)6) . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Constants . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 mod p Reduction . . . . . . . . . . . . . . . . . . B.2.3 mod p Addition and Subtraction . . . . . . B.2.4 mod p Multiplication . . . . . . . . . . . . . . B.2.5 mod p Division . . . . . . . . . . . . . . . . . . . B.2.6 mod (x6 − 2) Multiplication . . . . . . . . . B.2.7 mod (x6 − 2) Division . . . . . . . . . . . . . .

319 319 319 320 321 321 323 323 324 324 325 326

Binary Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 GF(2163) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1.1 mod f(x) Multiplication . . . . . . . . . . . . C.1.2 mod f(x) Division . . . . . . . . . . . . . . . . . C.1.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . C.1.4 Elliptic-Curve Operations . . . . . . . . . .

331 331 331 331 332 332

10.5

10.6 10.7

C

ix

x

Contents C.2

D

GF(2233) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.2.1 mod f(x) Multiplication . . . . . . . . . . . . C.2.2 mod f(x) Division . . . . . . . . . . . . . . . . . C.2.3 Squaring . . . . . . . . . . . . . . . . . . . . . . . . . C.2.4 Elliptic-Curve Operations . . . . . . . . . .

Ada versus VHDL Index

333 333 334 334 334

............................

337

.......................................

341

Preface

F

inite fields are used in different types of computers and digital communication systems. Two well-known examples are errorcorrection codes and cryptography. The traditional way of implementing the corresponding algorithms is software, running on general-purpose processors or on digital-signal processors. Nevertheless, in some cases the time constraints cannot be met with instructionset processors, and specific hardware must be considered, that is, circuits specifically designed for executing those complex algorithms: they implement the particular computation primitives of the algorithms and profit from their inherent parallelism. Apart from the application-specific integrated circuits (ASICs) solution, another technology at hand for developing specific circuits is constituted by field-programmable gate arrays (FPGA). They form an attractive option for small production quantities as their nonrecurrent engineering costs are much lower than those corresponding to ASICs. They also offer flexibility and fast time-to-market. Furthermore, in order to reduce their size, and so the unit cost, an interesting possibility is to reconfigure them at run time so that the same programmable device can execute different predefined functions. This book describes algorithms and circuits for executing the main finite-field operations, that is, addition, subtraction, multiplication, squaring, exponentiation, and division. It is mainly addressed to hardware engineers involved in the development of embedded systems, including finite-field operations. Distinguishing features of this book are the following: • The emphasis is different from the classic texts on finite fields. It is not limited to the description of algebraic and algorithmic aspects. The main topic is circuit synthesis. • A special importance has been given to FPGA implementations. The particular architecture of these components leads the designer to use synthesis techniques somewhat different than the ones applied for ASIC for which standard cell libraries exist. Throughout the book examples of FPGA implementation are described.

xi

xii

Preface • Most algorithms are described in Ada, a programming language similar to VHDL, so that they can be executed and the correctness of the proposed algorithms can be verified with actual input data. • In what concerns the description of the circuits, logic schemes are presented as well as VHDL models, in such a way that the corresponding circuits can be easily simulated and synthesized.

Overview The book is divided into 10 chapters. The first chapter (mathematical background) gives the main definitions and properties of finite fields. Chapters 2 to 4 are dedicated to the operations modulo m and the corresponding circuits. Chapter 2 deals with the modulo m reduction, Chap. 3 with the modulo m addition, subtraction, multiplication, and exponentiation, and Chap. 4 with the modulo p division, where p is a prime. Chapters 5 and 6 are dedicated to the operations modulo f(x), where f(x) is a polynomial over a finite field, and to the corresponding circuits. Chapter 5 deals with the modulo f(x) addition, subtraction, multiplication, and exponentiation, and Chap. 6 with the modulo f(x) division, where f(x) is an irreducible polynomial. Chapters 7 to 9 are dedicated to the main arithmetic operations over GF(2m). In Chap. 7 polynomial bases are considered (thus, a particular case of the topics dealt with in Chaps. 5 and 6). In Chap. 8 normal bases are used, and in Chap. 9 dual and triangular bases are considered. Chapter 10 is dedicated to elliptic-curve cryptography, currently one of the main finite-field applications. There are four appendices. Three of them describe circuits for performing arithmetic operations over some particular fields, namely a prime field GF(2192 − 264 − 1) in App. A, two optimal extension fields GF(23917) and GF((232 − 387)6) in App. B, and two binary extension fields GF(2163) and GF(2233) in App. C. Appendix D is a brief comparison of the syntaxes of Ada and VHDL. All the chapters, but the first one, include algorithms, circuits, and results of FPGA implementations. The algorithms are described in Ada and the circuits are modeled in VHDL. Complete and executable source files (Ada and VHDL) are available at the authors’ Web site www.arithmetic-circuits.org.

Acknowledgments

T

he authors are grateful to the following universities for providing them the means for carrying this work through to a successful conclusion: University Rovira i Virgili (Tarragona, Spain), Autonomous University of Madrid (Spain), and Complutense University of Madrid (Spain).

xiii

This page intentionally left blank

Hardware Implementation of Finite-Field Arithmetic

This page intentionally left blank

CHAPTER

1

Mathematical Background

T

his chapter presents some topics in mathematics; it is intended to make this book self-contained. For further details the reader can refer to textbooks on Algebra ([Coh93], [GN03], [Her75], [Hun74]), Number Theory ([Kob94], [Ros92], [Ros00], [Gar59]), Finite Fields ([LN83], [LN94], [McC87], [Men93]), and Cryptography [MOV96], from where the following material has been mainly extracted.

1.1

Number Theory 1.1.1

Basic Definitions

Definitions 1.1 1. The set of natural numbers1 N = {0, 1, 2, 3, . . .}. 2. The set of integers Z = { . . . , −3, −2, −1, 0, 1, 2, 3, . . . }.

Definition 1.2 Given two integers x and y, y divides x (y is a divisor of x) if there exists an integer z such that x = zy. Definition 1.3 Given two integers x and y, with y > 0, there exist two integers q (the quotient) and r (the remainder) such that x = qy + r

where 0 ≤ r < y

It can be proven that q and r are unique. Then (notation) r = x mod y

q = x div y

An alternative definition:

1

For convenience, the element zero has been included in N.

1

2

Chapter One Given two integers x and y, with y > 0, there exist two integers q (the quotient) and r (the remainder) such that

Definition 1.4 (integer division)

x = qy + r

where 0 ≤ r < y if x ≥ 0 and −y < r ≤ 0 if x < 0

It can be proven that q and r are unique. Then (notation) r = x rem y

q = x/y

Examples 1.1 1. x = −16, y = 3: −16 mod 3 = 2, −16 div 3 = −6, −16 = −6 ⭈ 3 + 2 −16 rem 3 = −1, −16/3 = −5, −16 = −5 ⭈ 3 + (−1) 2. x = −15, y = 3: −15 mod 3 = 0, −15 div 3 = −5, −15 = −5 ⭈ 3 + 0 −15 rem 3 = 0, −15/3 = −5, −15 = −5 ⭈ 3 + 0

Definitions 1.5 1. Given two integers x and y, z is the greatest common divisor of x and y if z is a natural number (nonnegative integer), z divides both x and y, any other common divider of x and y is also a divider of z. Notation: z = gcd(x, y). 2. Given two integers x and y, they are said to be relatively prime if gcd(x, y) = 1. 3. An integer p > 1 is said to be prime if its only positive divisors are 1 and p.

1.1.2

Euclidean Algorithms

Given two natural numbers x and y, the Euclidean algorithm for natural numbers computes gcd(x, y). It is based on a series of integer divisions: r (i − 1) = q (i )r (i ) + r (i + 1)

where 0 ≤ r (i + 1) < r (i )

Observe that any divider of r (i − 1) and r (i ) is also a divider of r (i ) and r (i + 1) so that gcd(r (i − 1), r (i )) = gcd(r (i ), r (i + 1)) Initially r (0) = x

and

r (1) = y

Mathematical Background Then compute r (0) = q (1)r (1) + r (2) r (1) = q (2)r (2) + r (3) r (2) = q (3)r (3) + r (4) ... r (n − 3) = q (n − 2)r (n − 2) + r (n − 1) r (n − 2) = q (n − 1)r (n − 1) + r (n)

where r (1) > r (2) > . . . > r (n) = 0 and gcd (r (i − 1), r (i )) = gcd(r (i ), r (i + 1)), so that gcd (x, y) = gcd (r (0), r (1)) = . . . = gcd (r (n − 1), r (n)) = gcd (r (n − 1), 0) = r (n − 1)

Example 1.2 Let r (0) = x = 9520; r (1) = y = 3120; 9520 = 3.3120 + 160 3120 = 19.160 + 80 160 = 2.80 + 0 Then gcd(9520, 3120) = 80.

In the extended Euclidean algorithm a series of coefficients b(i ) and c(i ) is calculated in parallel with the computation of r (0), r (1), r (2), . . . , r (n): b(0) = 1

c(0) = 0

b(1) = 0

c(1) = 1

b(2) = b(0) − b(1)q (1) ... b(n − 1) = b(n − 3) − b(n − 2)q (n − 2)

c(2) = c(0) − c(1)q (1) c(n − 1) = c(n − 3) − c(n − 2) q (n − 2)

It can be demonstrated by induction that r (i ) = b(i )x + c(i )y

∀ i = 0, 1, 2, . . . , n − 1

In particular gcd(x, y) = r (n − 1) = b(n − 1)x + c(n − 1)y In conclusion the extended Euclidean algorithm expresses the greatest common divisor z of two natural numbers x and y as a linear combination of x and y, that is, z = bx + cy

(1.1)

3

4

Chapter One Algorithm 1.1—Extended Euclidean algorithm if x = 0 then z := y; b := 0; c := 1; elsif y = 0 then z := x; b := 1; c := 0; else r_i := x; r_iplus1 := y; b_i := 1; c_i := 0; b_iplus1 := 0; c_iplus1 := 1; while r_iplus1 > 0 loop q := r_i/r_iplus1; r_iplus2 := r_i mod r_iplus1; b_iplus2 := b_i - b_iplus1*q; c_iplus2 := c_i - c_ iplus1*q; r_i := r_iplus1; r_iplus1 := r_iplus2; b_i := b_iplus1; b_iplus1 := b_iplus2; c_i := c_iplus1; c_iplus1 := c_iplus2; end loop; z := r_i; b := b_i; c := c_i; end if;

Example 1.3 Let ri = x = 230490; ri + 1 = y = 43290; bi = ci + 1 = 1; bi + 1 = ci = 0; Step 1 q = 230490/43290 = 5; ri + 2 = 230490 mod 43290 = 14040 bi + 2 = 1– 0 ⋅ 5 = 1; ci + 2 = 0 – 1 ⋅ 5 = −5 ri = 43290; ri + 1 = 14040 bi = 0; bi + 1 = 1 ci = 1; ci + 1 = −5 Step 2 q = 43290/14040 = 3; ri + 2 = 43290 mod 14040 = 1170 bi + 2 = 0 − 1 ⋅ 3 = −3; ci + 2 = 1 + 5 ⋅ 3 = 16 ri = 14040; ri + 1 = 1170 bi = 1; bi + 1 = −3 ci = −5; ci + 1 = 16 Step 3 q = 14040/1170 = 12; ri + 2 = 14040 mod 1170 = 0 bi + 2 = 1 + 3 ⋅ 12 = 37; ci + 2 = −5 − 16 ⋅ 12 = −197 ri = 1170; ri + 1 = 0 bi = −3; bi + 1 = 37 ci = 16; ci + 1 = −197 b = bi = −3; c = ci = 16; gcd(230490, 432900) = z = ri = 1170 = −3 ⋅ 230490 + 16 ⋅ 43290

1.1.3

Congruences

Definition 1.6 Given two integers x and y, and a positive integer n, x is congruent to y modulo n if n divides the difference (x − y).

Mathematical Background Notation: x ≡ y (mod n)

Properties 1.1 (basic properties of congruences) 1. x ≡ y (mod n) if and only if (x mod n) = (y mod n) (Definition 1.3). 2. The relation x ≡ y (mod n) is an equivalence relation (reflexive, symmetric, and transitive). 3. If x1 ≡ y1 (mod n) and x2 ≡ y2 (mod n), then (x1 − x2) (x1 + x2) ≡ (y1 + y2) (mod n) ≡ (y1 − y2) (mod n) (x1x2) ≡ (y1y2) (mod n)

(1.2)

From Properties 1.1(1 and 2), it can be seen that the mod n congruence relation partitions Z into n equivalence classes. Each equivalence class contains exactly one element of the set {0, 1, 2, . . . , n −1}, namely the common value (x mod n) for all elements x of the class. Furthermore, according to Property 1.1(3), the addition, subtraction, and multiplication of congruence classes can be defined. As a matter of fact the set of equivalence classes is isomorphic to Zn = {0, 1, 2, . . . , n − 1} where the addition, the subtraction, and the multiplication are defined by (x + y) mod n

(x − y) mod n

(xy) mod n

∀ x and y in Zn

Definition 1.7 Given two elements x and y of Zn, such that xy mod n = 1,

then y is said to be the multiplicative inverse of x. If such an inverse exists, it is unique. Notation: y = x − 1 mod n

Property 1.2 x has a multiplicative inverse if and only if gcd(x, n) = 1. Proof If xy ≡ 1 mod n, then xy = qn + 1 so that any divisor of x and n is also a divisor of 1. Thus, gcd(x, n) = 1. If gcd(x, n) = 1, then (relation 1.1) there exist b and c such that 1 = bx + cn, so that x − 1 = b mod n.

5

6

Chapter One More generally:

Properties 1.3 1. Let g = gcd(a, n). Then the equation ax ≡ d (mod n) has a solution x if and only if g divides d. 2. The solutions of ax ≡ d (mod n) are the same as the solutions of (a/g)x ≡ (d/g) (mod n/g). 3. There are g solutions, all of them congruent modulo n/g.

Proofs 1. If ax ≡ d (mod n), then ax − d = qn. As g divides both a and n, it also divides d. If g divides d, then d = qg. According to Eq. (1.1), g is a linear combination of a and n, that is g = ba + cn. So d = q(ba + cn) and x = qb is a solution. 2. If g divides d and ax ≡ d (mod n), that is ax − d = qn, then (a/g)x − (d/g) = q(n/g) and (a/g)x ≡ (d/g) (mod n/g). Inversely, if (a/g)x ≡ (d/g) (mod n/g) then ax ≡ d (mod n). 3. As a/g and n/g are relatively prime, there is a unique solution within Zn/g, namely, x = x0 = (d/g)(a/g) − 1 mod n/g. The complete set of solutions within Zn is xk = x0 + k(n/g)

∀k = 0, 1, . . . , g − 1

Observe that if k < g and x0 < (n/g), then xk ≤ (n/g) − 1 + ( g − 1)(n/g) = n − 1.

Definitions 1.8 1. The set of elements x of Zn relatively prime with n is the multiplicative group Zn*: Zn* = {x ∈ Zn | gcd(x, n) = 1} 2. The Euler phi function φ (n) is the number of elements in Zn*. According to Property 1.2, Zn* is the set of invertible elements of Zn. In particular, if p is a prime number then Zp* = {1, 2, . . . , p − 1}

and

φ (p) = p − 1

Property 1.4 (Fermat’s little theorem) Let p be a prime. Any integer x satisfies x p ≡ x (mod p), and any integer x not divisible by p satisfies xp − 1 ≡ 1 (mod p). If x is not divisible by p and if ix ≡ jx (mod p), that is, (i − j)x = qp, then i ≡ j (mod p). Thus

Proof

(1x)(2x) . . . ((p − 1)x) ≡ 1 ⭈ 2 . . . ⭈ (p − 1) (mod p)

Mathematical Background As the p − 1 above multiples of x are distinct and nonzero, they must be congruent to 1, 2, 3, . . . , p − 1 in some order. So (p − 1)!xp − 1 ≡ ( p − 1)! (mod p) or ( p − 1)!(xp − 1 − 1) ≡ 0 (mod p) As p does not divide (p − 1)!, (x p − 1 − 1) ≡ 0 (mod p) that is, xp − 1 ≡ 1 (mod p)

xp ≡ x (mod p)

and

If x is divisible by p, then xp ≡ x ≡ 0 (mod p). Let p be a prime. If x is not divisible by p and if r ≡ s (mod p − 1), then

Corollary 1.1

xr ≡ x s (mod p) Assume that r > s. Then r = q(p − 1) + s and 1 ≡ 1q ≡ (xp − 1)q ≡ xr − s (mod p), so that x r ≡ x s (mod p).

Proof

Definitions 1.9 1. The order of an element x of Zn* is the least positive integer t such that xt ≡ 1 (mod n). 2. If the order of x is equal to the number φ (n) of elements in Zn*, then x is said to be a generator or primitive element of Zn*. 3. If Zn* has a generator, then Zn* is said to be cyclic. Observe that if x is a generator, then Zn* = {x1, x2, x3, . . . , xφ(n)}.

Example 1.4 Z7 = {0, 1, 2, 3, 4, 5, 6}

and

Z7* = {1, 2, 3, 4, 5, 6};

7 is prime and φ (7) = 6;

11 ≡ 1 (mod 7), 23 ≡ 1 (mod 7), 36 ≡ 1 (mod 7), 43 ≡ 1 (mod 7), 56 ≡ 1 (mod 7), 62 ≡ 1 (mod 7); there are two generators: 3 and 5; for example:

31 ≡ 3 (mod 7), 32 ≡ 2 (mod 7), 33 ≡ 6 (mod 7), 34 ≡ 4 (mod 7), 35 ≡ 5 (mod 7), 36 ≡ 1 (mod 7).

7

8

Chapter One

1.2 Algebra 1.2.1

Groups

Definitions 1.10 A group ( G, *) consists of a set G with a binary operation* on G satisfying the following three axioms:

1. x * (y * z) = (x * y) * z, ∀ x, y, z ∈ G (associativity). 2. There is an identity (or unity) element 1 in G, such that x * 1 = 1 * x = x, ∀ x ∈ G. 3. For each element x of G there exists an element x − 1, called the inverse of x, such that x * x − 1 = x − 1 * x = 1. If, furthermore, 4. x * y = y * x, ∀ x, y ∈ G (commutativity), the group is said to be commutative (or abelian). Axioms 1 and 2 define a semigroup.

Examples 1.5 • The set of integers Z with the operation + forms a group, with 0 as identity element. • The set Zn with the operation of addition modulo n forms a group, with 0 as identity element. • The set Z n with the operation of multiplication modulo n is not a group, since not all elements have multiplicative inverses. • The set Zn* with the operation of multiplication modulo n forms a group, with 1 as identity element. The following definitions generalize the Definitions 1.9:

Definitions 1.11 1. The order of an element x of a finite group G is the least positive integer t such that xt = x * x * . . . * x = 1 2. If the order of x is equal to the number n of elements in G, then x is said to be a generator of G. 3. If G has a generator, then G is said to be cyclic.

Property 1.5 The order of an element x of a finite group G divides the number of elements in G.

Mathematical Background First observe that if H is a subgroup of G, then an equivalence relation on G can be defined: g1 ≡ g2 if there exists an element h in H such that g1h = g2. The number of elements in an equivalence class is equal to the number |H| of elements in H. Thus the number |G| of elements in G is equal to |H||G/H|, G/H being the set of classes and |G/H| the number of classes. In other words the number of elements of a subgroup divides the number of elements of the group. It remains to observe that the set {x, x2, . . . , xt = 1}, where t is the order of x, is a subgroup, so that the number t of elements of the subgroup divides the number of elements in G.

Proof

Example 1.6 Consider the multiplicative group Z7* = {1, 2, 3, 4, 5, 6}.

In this case, 3 and 5 are generators; the subgroup generated by 2 is {2, 4, 1}; the corresponding classes are then {2, 4, 1} and {6, 5, 3}; the number of elements (3) of the subgroup divides the number of elements (6) of Z7*.

1.2.2

Rings

Definitions 1.12 A ring (R, +, *) consists of a set R with two binary operations + and *, satisfying the following axioms: 1. (R, +) is a commutative group with additive identity element 0. 2. x * (y * z) = (x * y) * z, ∀ x, y, z ∈ R (associativity). 3. There is a multiplicative identity element 1, with 1 ≠ 0, such that x * 1 = 1 * x = x, ∀ x ∈ R. 4. x * (y + z) = (x * y) + (x * z) and (x + y) * z = (x * z) + (y * z), ∀ x, y, z ∈ R (distributivity). If, furthermore, 5. x * y = y * x, ∀ x, y ∈ R (commutativity), the ring is said to be commutative.

Examples 1.7 • The set of integers Z with the usual operations + and · is a commutative ring. • The set Zn with the addition and multiplication modulo n operations is a commutative ring.

Definitions 1.13 1. A subset S of a ring R is called a subring of R, provided that S is closed under + and * and forms a ring under these operations. 2. A subset J of a ring R is called an ideal, provided that J is a subring of R and for all a ∈ J and b ∈ R we have that ab ∈ J and ba ∈ J.

9

10

Chapter One

1.2.3

Fields

Definitions 1.14 A field (F, +, *) consists of a set F with two binary operations + and *, with an additive identity element 0 and a multiplicative identity element 1 satisfying the following axioms: 1. (F, +, *) is a commutative ring. 2. All nonzero elements of F have a multiplicative inverse.

Definition 1.15 The characteristic of a field is the least positive integer m such that ∑ im=1 1 = 0. Otherwise, the characteristic of a field is 0 if 1 + 1 + . . . + 1 (m times) is never equal to 0 for any m > 0. Examples 1.8 • The real numbers R form a field of characteristic 0 under the usual operations. • The set of integers Z with the usual operations of addition (+) and multiplication (·) is not a field, because the only nonzero elements with multiplicative inverses are 1 and −1. • The set Zp with the usual operations of addition and multiplication modulo p is a field if and only if p is a prime. If p is prime, then Zp has characteristic p. • Consider the field Z5. The tables for the addition and multiplication operations modulo 5 are as follows (Table 1.1):

+

0

1

2

3

4

⋅

0

1

2

3

4

0

0

1

2

3

4

0

0

0

0

0

0

1

1

2

3

4

0

1

0

1

2

3

4

2

2

3

4

0

1

2

0

2

4

1

3

3

3

4

0

1

2

3

0

3

1

4

2

4

4

0

1

2

3

4

0

4

3

2

1

TABLE 1.1 Addition and Multiplication over Z5

Definitions 1.16 1. A subset E of a field F is a subfield of F if E is itself a field with respect to the operations of F. In such a case, F is said to be an extension field of E. If E ≠ F, we say that E is a proper subfield of F. 2. A field containing no proper subfields is called a prime field.

Mathematical Background

1.2.4

Polynomials

Definitions 1.17 1. If R is a commutative ring, then a polynomial in the indeterminate x over R is an expression of the form f (x) = anxn + an − 1xn − 1 + . . . + a1x + a0 where ai ∈ R, ∀ i ∈ {0, 1, . . . , n}. The element ai is called the coefficient of xi in f (x). 2. The largest integer m (if any) such that am ≠ 0 is the degree of f (x). It is denoted deg (f) and am is called the leading coefficient. If all the coefficients of f (x) are equal to 0 then f (x) is called the zero polynomial and its degree defined to be equal to −∞. The zero-degree polynomials are also called constant polynomials. 3. A monic polynomial is a polynomial whose leading coefficient is equal to 1. 4. The polynomial ring R[x] is the ring formed by the set of all polynomials in the indeterminate x with coefficients in R. The two operations are the standard polynomial addition and multiplication, with coefficient arithmetic performed in R. The additive identity element 0 is the zero polynomial. The multiplicative identity element 1 is the monic constant polynomial.

Example 1.9 Let f (x) = x4 + 3x3 + 2x + 4 and g (x) = 4x3 + 3x + 4 be elements of the polynomial ring Z5[x]. The addition and multiplication of the two polynomials is as follows: f (x) + g (x) = x4 + 2x3 + 3 f (x)g (x) = 4x7 + 2x6 + 3x5 + x4 + 3x3 + x2 + 1 In the following text, we will deal almost exclusively with polynomials over an arbitrary field F.

Definition 1.18 Thanks to the fact that F is a field, all the nonzero coefficients have an inverse and the standard polynomial division can also be performed. Thus, if g (x) and h(x) ≠ 0 are polynomials in F[x], then there exist two polynomials q (x) (the quotient) and r (x) (the remainder) in F[x] such that g (x) = q (x)h(x) + r (x)

where deg (r) < deg (h)

Notation: r (x) = g (x) mod h(x)

q (x) = g (x) div h(x)

(1.3)

11

12

Chapter One Definitions 1.19 1. Given two polynomials g (x) and h(x), h(x) divides g (x) (or h(x) is a divisor of g (x)) if there exists a polynomial q (x) such that g (x) = q (x)h(x). 2. Given two polynomials g (x) and h(x), not both equal to 0, the greatest common divisor of g (x) and h(x) is the monic polynomial of greatest degree which divides both g (x) and h(x). 3. gcd(0, 0) = 0. 4. A polynomial f (x) of degree at least 1 is said to be irreducible if it cannot be written as the product of two polynomials, each of positive degree. A variant of the Euclidean algorithm for polynomials [GG03] expresses the greatest common divisor of two polynomials g (x) and h(x) in the form gcd( g, h) = b(x)g (x) + c(x)h(x) The algorithm is based on the fact that if u(x) and v(x) are two polynomials such that deg (u) = m

deg (v) = t

and

m>t

that is, u(x) = umxm + um − 1xm − 1 + . . . + u1x + u0 v(x) = vtxt + vt − 1xt − 1 + . . . + v1x + v0 then v(x)um(vt) − 1xm − t = (vtxt + vt − 1xt − 1 + . . . + v1x + v0)um(vt) − 1xm − t = umxm + r’(x) where deg (r’) < m, so that u(x) = (v(x)um(vt) − 1xm − t − r’(x)) + um − 1xm − 1 + . . . + u1x + u0 = v(x)um(vt) − 1xm − t + r (x) where r (x) = um − 1xm − 1 + . . . + u1x + u0 − r’(x) so that deg (r) < m

and

max(deg (r), deg (v)) < deg (u)

Furthermore, gcd(u, v) = gcd(v, r)

(1.4)

Mathematical Background The sequence of operations is almost the same as for computing the greatest common divisor of two integers. A series of polynomials r (0), r (1), r (2), . . . are generated. Initially assume that deg ( g) > deg (h) and define r (0) = g (x)

and

r (1) = h(x)

At each step the decomposition [Eq. (1.4)] is used: u(x) = r (i − 1), v(x) = r (i ), m = deg (r (i − 1)), t = deg (r (i )), deg (r (i − 1)) > deg (r (i )) so that r (i − 1) = q (i )r (i ) + r (i + 1) where q (i ) = um(vt) − 1xm − t , r (i + 1) = r (i − 1) − q (i )r (i ), deg (r (i + 1))< m = deg (r (i − 1)) At the end of the step, r (i ) and r (i + 1) are interchanged if deg (r (i )) < deg (r (i + 1)). Operations: r (0) = g (x) r (1) = h(x) r (0) = r (1)q (1) + r (2), if deg (r (1)) < deg (r (2)) interchange r (1) and r (2) r (1) = r (2)q (2) + r (3), if deg (r (2)) < deg (r (3)) interchange r (2) and r (3) r (2) = r (3)q (3) + r (4), if deg (r (3)) < deg (r (4)) interchange r (3) and r (4) ... r (n − 3) = r (n − 2)q (n − 2) + r (n − 1), if deg (r (n − 2)) < deg (r (n − 1)) interchange r (n − 2) and r (n − 1) r (n − 2) = r (n − 1)q (n − 1) + r (n) where

deg (r (0)) > deg (r (1)) > . . . > deg (r (n)) = 0 and gcd(r (i ), r (i + 1)) = gcd(r (i + 1), r (i + 2)) so that gcd( g, h) = gcd(r (0), r (1)) = . . . = gcd(r (n − 1), r (n)) Let r0 be the coefficient of x0 in r (n). If r0 = 0, then gcd( g, h) = gcd(r (n −1 ), 0) = r (n − 1)

13

14

Chapter One If r0 ≠ 0, then gcd( g, h) = gcd(r (n − 1), r0) = 1 In parallel with the computation of r (0), r (1), r (2), . . . , r (n) two series of polynomials b(i ) and c(i ) are generated: b(0) = 1 b(1) = 0 b(2) = b(0) − b(1)q (1), if deg (r (1)) < deg (r (2)) interchange b(1) and b(2) ... b(n − 1) = b(n − 3) − b(n − 2)q (n − 2), if deg (r (n − 2)) < deg (r (n − 1)) interchange b(n − 2) and b(n − 1) b(n) = b(n − 2) − b(n − 1)q (n − 1) c(0) = 0 c(1) = 1 c(2) = c(0) − c(1)q (1), if deg (r (1)) < deg (r (2)) interchange c(1) and c(2) ... c(n − 1) = c(n − 3) − c(n − 2)q (n − 2), if deg (r (n − 2)) < deg (r (n − 1)) interchange c(n − 2) and c(n − 1) c(n) = c(n − 2) − c(n − 1)q (n − 1) It can be demonstrated by induction that r (i ) = b(i )g (x) + c(i )h(x), ∀ i = 0, 1, 2, . . . , n

So, if r0 = 0 then gcd( g, h) = r (n − 1) = b(n − 1)g (x) + c(n − 1)h(x) and if r0 ≠ 0, then gcd( g, h) = 1 = r0 − 1r (n) = r0 − 1b(n)g (x) + r0 − 1c(n)h(x) In the following algorithm u stands for r (i − 1), v for r (i ), r for r (i + 1), b for b(i − 1), d for b(i ), bb for b(i + 1), c for c(i − 1), e for c(i ), cc for c(i + 1):

Algorithm 1.2— Variant of the extended Euclidean algorithm for polynomials u := g; v := h; b := 1; c := 0; d := 0; e := 1; m := degree(u); t := degree(v); if t = 0 then if v(0) = 0 then z = u; else z := 1; b := 0; c := (v(0)) - 1; end if; elsif m = 0 then if u(0) = 0 then z = v; b := 0; c := 1; else z := 1; b := (u(0)) - 1; end if; else

Mathematical Background while t > 0 loop if m < t then swap(u, swap(m, t); end if; q := u(m)*(v(t)) - 1*xm - t; cc := c - e*q; u := v; v := r; b := d; m := t; t := degree(v); end loop; if v(0) = 0 then z := u; c := e*(v(0)) - 1; end if; end if;

1.2.5

v); swap(b, d); swap(c, e); r := u - v*q; bb := b - d*q; c := e; d := bb; e := cc;

else z := 1; b := d*(v(0)) - 1;

Congruences of Polynomials

Definition 1.20 Given three polynomials g (x), h(x), and f (x) in F[x], g (x) is congruent to h(x) modulo f (x) if f (x) divides g (x) − h(x). Notation: g (x) ≡ h(x) (mod f (x))

Properties 1.6 (properties of congruences) 1. g (x) ≡ h(x) (mod f (x)) if and only if ( g (x) mod f (x)) = (h(x), mod f (x)) (Definition 1.15). 2. The relation g (x) ≡ h(x) (mod f (x)) is an equivalence relation (reflexive, symmetric, and transitive). 3. If g1(x) ≡ h1(x) (mod f (x)) and g2(x) ≡ h2(x) (mod f (x)), then g1(x) + h1(x) ≡ g2(x) + h2(x) (mod f (x)), g1(x) − h1(x) ≡ g2(x) − h2(x) (mod f (x)), (1.5) g1(x)h1(x) ≡ g2(x)h2(x) (mod f (x)) From Properties 1.6(1 and 2) it can be seen that the congruence relation partitions F[x] into equivalence classes. If n is the degree of f (x) then each equivalence class contains exactly one polynomial of degree d < n. So, if F is a finite field, then the number of equivalence classes is equal to |F|n, where |F| is the number of elements in F. Furthermore, according to Property 1.6(3), the addition, subtraction, and multiplication of congruence classes can be defined. As a matter of fact the set of equivalence classes is isomorphic to { g (x) ∈ F[x] | deg ( g) < n} where the addition, the subtraction, and the multiplication are defined by ( g (x) + h(x)) mod f (x) ( g (x) − h(x)) mod f (x) ( g (x)h(x)) mod f (x) The set of equivalence classes is denoted by F[x]/f (x).

15

16

Chapter One Properties 1.7 1. F[x]/f (x) is a commutative ring. 2. If f (x) is irreducible, then F[x]/f (x) is a field.

Proofs 1. Consequence of Property 1.6(3). 2. If f (x) is irreducible, then the greatest common divisor of f (x) and g (x) ≠ 0 is 1. Using the Euclidean algorithm, b(x) and c(x) can be computed such that 1 = b(x)f (x) + c(x)g (x) and c(x) = ( g (x)) − 1 mod f (x)

Example 1.10 Let f (x) = x3 + x + 1 ∈ Z2[x]. From the irreducibility of

f (x) over Z2, it follows that Z2[x]/f (x) is a field. In this case Zp = Z2, and the field Z2[x]/f (x) has the pn = 23 elements (residue classes) [0], [1], [x], [x2], [x + 1], [x2 + 1], [x2 + x], [x2 + x + 1]. The addition and multiplication tables are obtained by performing the required operations and by carrying out reduction mod f (x) if necessary (Table 1.2):

+

[0]

[1]

[x]

[x2]

[x + 1]

[x2 + 1]

[x2 + x]

[x2 + x + 1] [x2 + x + 1]

[0]

[0]

[1]

[x]

[x ]

[x + 1]

[x + 1]

[x + x]

[1]

[1]

[0]

[x + 1]

[x2 + 1]

[x]

[x2]

[x2 + x + 1] [x2 + x]

[x]

[x]

[x + 1]

[0]

[x2 + x]

[1]

[x2 + x + 1] [x2]

[x2]

[x2]

[x2 + 1]

[x2 + x]

[0]

[x2 + x + 1] [1]

[x2 + x + 1] [0]

2

[x + 1]

[x + 1]

[x]

[1]

[x2 + 1]

[x2 + 1]

[x2]

[x2 + x + 1] [1]

[x2 + x]

[x2 + x]

[x2 + x + 1] [x2]

2

2

[x2 + 1]

[x]

[x + 1]

[x2 + x]

[x2 + 1]

[x2]

[x2 + x]

[0]

[x + 1]

[x]

[x]

[x2 + 1]

[x + 1]

[0]

[1]

[x2 + x + 1] [x2 + x + 1] [x2 + x]

[x2 + 1]

[x + 1]

[x2]

[x]

[1]

[0]

⋅

[0]

[1]

[x]

[x2]

[x + 1]

[x2 + 1]

[x2 + x]

[x2 + x + 1]

[0]

[0]

[0]

[0]

[0]

[0]

[0]

[0]

[0]

[1]

[0]

[1]

[x]

[x2]

[x + 1]

[x2 + 1]

[x2 + x]

[x2 + x + 1]

2

[1]

[x + x + 1] [x2 + 1]

[x]

[0]

[x]

[x ]

[x + 1]

[x + x]

[x2]

[0]

[x2]

[x + 1]

[x2 + x]

[x2 + x + 1] [x]

[x + 1]

[0]

[x + 1]

[x2 + x]

[x2 + x + 1] [x2 + 1]

[x2 + 1]

[0]

[x2 + 1]

[1]

[x]

[x2 + x]

[0]

[x2 + x]

[x2 + x + 1] [x2 + 1]

[x2 + x + 1] [0]

[x2 + x + 1] [x2 + 1]

[1]

2

[x2]

2

[x2 + 1]

[1]

[1]

[x]

[x2]

[x2 + x + 1] [x + 1]

[x2 + x]

[1]

[x + 1]

[x]

[x2]

[x]

[x2 + x]

[x2]

[x + 1]

TABLE 1.2 Addition and Multiplication over Z2 [x]/f(x)

Mathematical Background

1.3

Finite Fields 1.3.1

Basic Properties

A finite field is a field F which contains a finite number of elements. The order of a finite field F is the number of elements in F.

Definition 1.21 Let p be a prime, F = Zp, and f (x) an irreducible polynomial of degree n over Zp. The corresponding field F[x]/f (x) contains q = pn elements and is called either Fq or GF (q) (Galois field). Two fields are isomorphic if they have the same structure, although the representation of their elements may be different. It can be demonstrated that any finite field contains q = pn elements, for some prime p and some positive integer n, and is isomorphic to Fq (whatever the irreducible polynomial f (x) of degree n over Zp). In particular, if n = 1, then the corresponding field Fp is isomorphic to Zp. The finite field Fp can henceforth be identified with Zp. If Fq is a finite field of order q = pn, with p a prime, then the characteristic of F q is p. Furthermore, Fq contains a copy of Z p as a subfield. Therefore, Fq can be considered as an extension field of Zp of degree n. The set of zero-degree polynomials (the constants) is a subfield of Fq isomorphic to Fp. If g (x) is a zero-degree polynomial (an element of Fp) then, according to the Fermat’s little theorem, ( g (x))p = g (x). Conversely, it can be demonstrated that if a polynomial g (x) satisfies the condition ( g (x))p = g (x), then g (x) is a constant. Another interesting property of Fq is that the set Fq* of nonzero polynomials is a cyclic group. Let g (x) be a nonzero polynomial, that is an element of Fq*, and assume that the order of g (x) is t. According to Properties 1.6, t divides q − 1, so that ( g (x))q − 1 = ( g (x))tk = 1k = 1. Consider now a polynomial g (x) and define h(x) = ( g (x))r where r = (q − 1)/(p − 1). According to the previous property, (h(x))p − 1 = ( g (x))q − 1 = 1 and (h(x))p = h(x), so that h(x) is a constant polynomial. A last property, useful for performing arithmetic operations, is that ( g (x) + h(x))p = ( g (x))p + (h(x))p. It is a straightforward consequence of the fact that all the binomial coefficients (p!/(i!)(p − i )!) are multiples of p, except for i = 0 or p. To summarize: Properties 1.8 (some useful properties of finite fields) 1. The set of zero-degree polynomials in Fq is a subfield of Fq isomorphic to Fp. 2. Given g (x) in Fq such that ( g (x))p = g (x), then g (x) ∈ Fp. 3. The set of nonzero polynomials of Fq is a cyclic group denoted by Fq*. 4. Given g (x) in Fq, then ( g (x))q = g (x) (Fermat’s little theorem).

17

18

Chapter One 5. Given g (x) and h(x) in Fq, then ( g(x) + h(x))p = ( g(x))p + (h(x))p , for all s ≥ 0. 6. If r = (pn − 1)/(p − 1), that is r = 1 + p + p2 + . . . + pn − 1, and g (x) is an element of Fq, then ( g (x))r is an element of Fp. s

s

s

Example 1.11 p = 2, n = 4, f (x) = 1 + x + x4 so that x4 ≡ 1 + x mod f (x); α = x is a generator of the cyclic group F16*: α1 = x α2 = x2 α3 = x3 α4 = x4 ≡ 1 + x α5 = x(1 + x) = x + x2 α6 = x(x + x2) = x2 + x3 α7 = x(x2 + x3) = x3 + x4 ≡ 1 + x + x3 α8 = (α4)2 = (1 + x)2 = 1 + x2 α9 = x(1 + x2) = x + x3 α10 = x(x + x3) = x2 + x4 ≡ 1 + x + x2 α11 = x(1 + x + x2) = x + x2 + x3 α12 = x(x + x2 + x3) = x2 + x3 + x4 ≡ 1 + x + x2 + x3 α13 = x(1 + x + x2 + x3) = x + x2 + x3 + x4 ≡ 1 + x2 + x3 α14 = x(1 + x2 + x3) = x + x3 + x4 ≡ 1 + x3 α15 = x(1 + x3) = x + x4 ≡ 1 Given a polynomial g (x) = g0 + g1x + g2x2 + g3x3, then ( g (x))2 = g0 + g1x2 + g2x4 + g3x6 ≡ g0 + g1x2 + g2(1 + x) + g3x2(1 + x) = ( g0 + g2) + g2x + ( g1 + g3)x2 + g3x3 if ( g (x))2 = g (x), then g0 + g2 = g0, g2 = g1, g1 + g3 = g2 thus g1 = g2 = g3 = 0 and g (x) = g0, that is, an element of Fp (Property 1.8(3)).

1.3.2

Field Extensions

Definition 1.22 Let E be a subfield of the field F and M any subset of F. Then the field E(M) is defined as the intersection of all subfields of F containing both E and M and is called the extension field of E obtained by adjoining the elements in M. For a finite subset M = {θ1 , . . . , θn },

Mathematical Background E( M) = E(θ1 , . . . , θn ). If M consists of a single element θ ∈ F, then L = E(θ) is said to be a simple extension of E and θ is a defining element of L over E.

Definition 1.23 Let E be a subfield of F and θ ∈ F. If θ satisfies a nontrivial polynomial equation with coefficients in E, that is, if a0 + a1θ + . . . + anθn = 0 with ai ∈ E not all being 0, then θ is said to be algebraic over E. An extension L of E is called an algebraic extension of E if every element of L is algebraic over E.

Definition 1.24 If θ ∈ F is algebraic over E, then the uniquely

determined monic polynomial f ∈ E[x] generating the ideal J = { g ∈ E[x] : g(θ) = 0} of E[x] is called the minimal (or irreducible, or defining) polynomial of θ over E. The degree of θ over E means the degree of f. An extension field L of E may be viewed as a vector space over E. L forms an abelian group under addition. Furthermore, each “vector” α in L can be multiplied by a “scalar” k in E so that kα is in L and the laws for multiplication by scalars are satisfied: (k + r)α = kα + rα, k(α + β) = kα + kβ, (kr)α = k(rα) and 1α = α, where α, β ∈ L and k, r ∈ E [LN94].

Definition 1.25 Let L be an extension field of E. If L, considered as a vector space over E, is finite-dimensional, then L is called a finite extension of E. The dimension of the vector space L over E is called the degree of L over E, and it is represented as [L:E]. Given a simple extension E(θ ) of E obtained by adjoining an algebraic element θ, it can be observed that if F is an extension of E and if θ ∈ F is algebraic over E, then E(θ) is an algebraic and finite extension of E. Furthermore, E(θ) is isomorphic to E[x]/f if θ ∈ F is algebraic of degree n over E and f is the minimal polynomial of θ over E. It can also be proven that the elements of the simple algebraic extension E(θ) of E are polynomial expressions in θ, and that any element of E(θ) can be uniquely represented in the form a0 + a1θ + . . . + anθ n−1 with ai ∈ E for 0 ≤ i ≤ n – 1, where n= [E(θ):E] and {1, θ, θ2, . . . , θ n−1} is a basis of E(θ) over E. Let f ∈ E[x] be irreducible over the field E. Then there exists a simple algebraic extension of E with a root of f as a defining element. Following is an example of root adjunction.

Theorem 1.1

Example 1.12 Let f (x) = x2 + x + 2 ∈ F3[x], which is irreducible over F3, and let θ be a root of f. It can be proven that the other root of f in L = F3[x]/f is 2θ + 2, since f (2θ + 2) = (2θ + 2)2 + (2θ + 2) + 2 = θ2 + θ + 2 = 0. Therefore, the simple algebraic extension L = F3(θ) consists of the following nine elements: {0, 1, 2, θ, θ + 1, θ + 2 , 2θ, 2θ + 1, 2θ + 2}.

19

20

Chapter One It must be noted that in Example 1.12 we may adjoin either the root θ or the root 2θ + 2 of f, and the same field would be obtained. This fact is covered as follows. Let α and β be two roots of the polynomial f ∈ E[x] that is irreducible over E. Then E(α) and E(β) are isomorphic under an isomorphism, mapping α to β and keeping the elements of E fixed.

1.3.3

Roots of Irreducible Polynomials

As described previously, starting from the prime fields Fp, other finite fields can be constructed by the process of root adjunction. If f ∈ Fp[x] is an irreducible polynomial over Fp of degree n, then a finite field with pn elements can be obtained by adjoining a root of f to Fp. Furthermore, for every prime p and every positive integer n there exists a finite field with pn elements and therefore we can speak of the finite field (or the Galois field) with q = pn elements, or of the finite field of order q. This field is denoted by Fq or GF (q), where q is a power of the prime characteristic p of Fq.

Theorem 1.2 If f is an irreducible polynomial in Fq [x] of degree m, then f has a root α in Fq m . Furthermore, all the roots of f are simple and are 2 m− 1 given by the m distinct elements α , α q , α q , . . . , α q of Fqm . Let Fqm be an extension of Fq and let α ∈ Fqm. Then the elements 2 m−1 are called the conjugates of α with respect α, α q , α q , . . . , α q to Fq, that are distinct if and only if the minimal polynomial of α over Fq has degree m. It can also be proven that if α is a primitive element of Fq, then so are all its conjugates with respect to any subfield of Fq. Example 1.13 Let α ∈ F24 = F16 be a root of f (x) = x4 + x + 1 ∈ F2[x]. The

conjugates of α with respect to F2 are α, α2, α4 = α + 1, and α8 = α2 + 1, where each of them is a primitive element of F16.

1.3.4

Bases of Finite Fields

Considering a finite extension F = Fqm of the finite field E = Fq as a vector space over E, then F has dimension m over E. Moreover, if {α1, α2, . . . , αm} is a basis of F over E, then each element α in F can be uniquely represented by α = a1α 1 + a2α 2 + . . . + amα m , with ai ∈ E for 1 ≤ i ≤ m.

Definition 1.26 If α ∈ F = Fqm and E = Fq, then the trace of α over E is defined by m− 1 Tr(α) = α + α q + . . . + α q

(1.6)

It must be noted that the trace of α over E is the sum of the conjugates of α with respect to E. Furthermore, Tr (α) is an element of E. Let F = Fqm and E = Fq. Then the trace function satisfies the following properties:

Mathematical Background Properties 1.9 1. Tr(α + β) = Tr(α) + Tr(β), for all α , β ∈ F. 2. Tr(aα) = aTr(α), forall a ∈ E, α ∈ F. 3. The trace is a linear transformation from F onto E, where F and E are viewed as vector spaces over E. 4. Tr(a ) = ma, forall a ∈ E . 5. Tr(α q ) = Tr(α), forall α ∈ F. The important definition of duality is given in the following.

Definition 1.27 Let E be a finite field and F a finite extension of E. Then two bases {α 1 , α 2 . . . α m } and {β1 , β 2 . . . β m } of F over E are said to be dual bases if ⎧1, if i = j Tr(α iβ j ) = ⎨ ⎩0, if i ≠ j

(1.7)

where 1 ≤ i, j ≤ m. There exist many distinct bases of F over E, but there are two types of bases particularly important. The first is a polynomial basis {1, α , α 2 , . . . , α m−1 } , made up of the powers of a defining element α of F over E, where α is often taken to be a primitive element of F. The other type of important basis is a normal basis, defined by a suitable element of F. By an E-automorphism of F (or an automorphism of F over E) we mean an automorphism of F = Fqm = GF(qm) that fixes the elements of E = Fq = GF(q). The set of the E-automorphisms of F is a group, named the Galois group of F over E, generated by the Frobenius automorphism ϕ (α) = α q, for α ∈ F, and made up of the m distinct elements G0, G 1, . . . , Gm − 1 defined as follows: Gi : F → F α → α q = αGi , α ∈ F , i

(1.8)

where Gi = G1i and G1m = G10 = G0 = I (identity automorphism). Then, a basis {β0, β1, . . . , βm − 1} is a normal basis for F over E if βi = αGi 2 m− 1 for some element α ∈ F. Therefore, the set {α , α q , α q . . . , α q , where α is a suitable element of F, will be a normal basis if the m elements are linearly independent and α will be the generator or normal element of the normal basis.

Definition 1.28 Let F = Fqm and E = Fq . Then a basis of F over E of the m− 1

form {α , α q , α q . . . α q } consisting of a suitable element α ∈ F and its conjugates with respect to E, is called a normal basis of F over E. 2

Example 1.14 Let α ∈ F23 = F8 be a root of the irreducible polynomial

f (x) = x3 + x2 + 1 ∈ F2[x]. Then the basis {α , α 2 , α 4 = α 2 + α + 1} is a

21

22

Chapter One normal basis of F8 over F2, because α4 = αα3 = α(α2 + 1) = α3 + α = α2 + α + 1.

Finite Fields GF (2m)

1.3.5

Finite fields GF(2 m ) = F m are extension fields of GF (2) = F2 = Z2. Finite 2 fields of order 2m are characteristic 2 finite fields, also known as binary extension fields. Binary fields GF (2m) have fundamental interest due to their wide number of technical applications, such as algebraic codes, cryptographic schemes, random number generators, digital signal processing or VLSI testing. The elements of the finite field GF (2m) are the polynomials {0, 1, α, α + 1, α2, α2 + 1, . . . , αm − 1 + αm − 2 + . . . + α + 1}, where α is a root of an irreducible polynomial f (x) over GF (2), f (α) = 0, and where the polynomial coefficients are in GF (2) = {0,1}. Let α ∈GF(2 4 ) = F2 4 be a root of the irreducible polynomial f (x) = x + x3 + 1 ∈ GF (2)[x]. Then the elements of GF (24) represented in the polynomial basis {α 3 , α 2 , α , 1} are given in Table 1.3. All the concepts studied in previous subsections can be easily adapted to this particular case of GF (2m).

Example 1.15

4

Elements in GF (24)

Polynomial

Coordinates

0

0

(0,0,0,0)

α

α

(0,0,1,0)

α2

α2

(0,1,0,0)

α

3

α

(1,0,0,0)

α

4

α +1

α

5

α +α+1

α

6

3 3

(1,0,0,1)

3

(1,0,1,1)

α +α +α+1

(1,1,1,1)

α7

α2 + α + 1

(0,1,1,1)

α8

α3 + α2 + α

(1,1,1,0)

9

α

α +1

(0,1,0,1)

α

10

α +α

α

11

3

2

2 3

(1,0,1,0)

α +α +1

(1,1,0,1)

α12

α+1

(0,0,1,1)

α13

α2 + α

α

14

α +α

α

15

1

3

3

2

(0,1,1,0) 2

(1,1,0,0) (0,0,0,1)

TABLE 1.3 Representation of Elements of GF(24) in the Polynomial Basis {α3, α2, α, 1}

Mathematical Background

1.4

References [Coh93] H. Cohen. A Course in Computational Algebraic Number Theory. SpringerVerlag, Berlin, 1993. [Gar59] H. Garner. “The residue number system,” IRE Transactions on Electronic Computers. EC-8, 1959, pp. 140–147. [GG03] J. von zur Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, 2003. [GN03] W. J. Gilbert and W. K. Nicholson. Modern Algebra with Applications. John Wiley & Sons, New York, 2003. [Her75] I. N. Herstein. Topics in Algebra. 2d ed. Xerox College Pub., Lexington, Massachusetts, 1975. [Hun74] T. W. Hungerford. Algebra. Holt, Rinehart and Winston, New York, 1974. [Kob94] N. Koblitz. A Course in Number Theory and Cryptography. Springer-Verlag, New York, 1994. [LN83] R. Lidl and H. Niederreiter. Finite Fields. Addison-Wesley, Reading, Massachusetts, 1983. [LN94] R. Lidl and H. Niederreiter. Introduction to Finite Fields and Their Applications. Cambridge University Press, New York, 1994. [McC87] R. J. McCeliece. Finite Fields for Computer Scientists and Engineeers. Kluwer Academic Publishers, Boston, 1987. [Men93] A. J. Menezes (ed). Applications of Finite Fields. Kluwer Academic, BostonLondon-Dordrecht, 1993. [MOV96] A. J. Menezes, P.C. van Oorschot, and S. C. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, 1996. [Ros92] K. H. Rosen. Elementary Number Theory and Its Applications. Addison-Wesley, Reading, Massachusetts, 1992. [Ros00] K. H. Rosen (editor-in-chief). Handbook of Discrete and Combinatorial Mathematics. CRC Press, Boca Raton, 2000.

23

This page intentionally left blank

CHAPTER

2

mod m Reduction

A

rithmetic operations over the finite ring Zm = {0, 1, . . . , m − 1} are used as computation primitives for executing numerous cryptographic algorithms, especially those related with the use of public keys (asymmetric cryptography). Classical examples are ciphering/deciphering, authentication, and digital signature protocols based on RSA-type or elliptic-curve algorithms. One of the basic operations is the modulo m reduction. Given two naturals x and m, it computes z = x mod m. Combined with operations over the set Z of integers (sum, subtraction, product, and so on) it allows one to perform the same operations over Zm. In this chapter several algorithms are described, namely, the integer division, the reduction mod Bk − a, the precomputation of Bik mod m, and the Barrett algorithm. All the mentioned algorithms have been synthesized and implemented within field programmable components.

2.1

Integer Division A straightforward method for computing z = x mod m consists in performing the integer division of x by m, that is, x = qm + z

z<m

For that purpose, any division algorithm can be used.

2.1.1

Digit Recurrence Algorithms

Digit recurrence algorithms ([Par00], [EL04], [DBS06]) are based on the following property:

Property 2.1 Given a natural y and an integer s belonging to the range −2y ≤ s < 2y, the equation

s = qy + r

(2.1)

has at least one solution with q ∈ { − 1, 0, 1} and − y ≤ r < y. The possible values of q and r, in function of s and y, are shown in Fig. 2.1, a so-called Robertson diagram.

Proof

25

26

Chapter Two r y q = –1

–2y

q=0

q=1

y

–y

2y

s

–y

FIGURE 2.1

Robertson diagram.

Thus, r is equal to either s − y (if q = 1), s (if q = 0), or s + y (if q = −1). Now consider a natural m belonging to the range 2k − 1 ≤ m < 2k

(2.2)

and an integer x belonging to the range −2n ≤ x < 2n

(2.3)

y = m2n − k

(2.4)

2n − 1 ≤ y < 2n

(2.5)

with n ≥ k. Then define

so that

From Eqs. (2.3), (2.4), and (2.5) −2y ≤ −2n ≤ x < 2n ≤ 2y

(2.6)

Then, use Property 2.1 and compute x = q1y + r1 2r1 = q2y + r2 2r2 = q3y + r3 ... 2rn − k = qn − k + 1y + rn − k + 1

(2.7)

According to Eq. (2.6), −2y ≤ x < 2y, so that Property 2.1 can be used and −y ≤ r1 < y. Similarly, as −2y ≤ 2r1 < 2y Property 2.1 can be used and −y ≤ r2 < y, and so on. To summarize −y ≤ ri < y, ∀i = 1, 2, . . . , n − k + 1

(2.8)

mod m Reduction Multiply the first Eq. (2.7) by 2n − k, the second one by 2n − k − 1, the third one by 2n − k − 2, and so on, and add up the n − k + 1 equations. The result is x2n − k = (q12n − k + q22n − k − 1 + . . . + qn − k + 1)y + rn − k + 1

(2.9)

Then, according to Eq. (2.4) x2n − k = (q12n − k + q22n − k − 1 + . . . + qn − k + 1)m2n − k + rn − k + 1

(2.10)

so that rn − k + 1 is divisible by 2n − k and x ≡ rn − k + 1/2n − k mod m

(2.11)

From Eqs. (2.4) and (2.8), − m ≤ rn − k + 1/2n − k < m

(2.12)

Thus, if rn − k + 1 ≥ 0 then x mod m = rn − k + 1/2n − k, and if rn − k + 1 < 0 then x mod m = m + rn − k + 1/2n − k. Assume that a function quotient(s,y) has been defined. It generates a solution q ∈ {−1, 0, 1} of Eq. (2.1). The following formal algorithm computes z = x mod m.

Algorithm 2.1—Generic digit-recurrence reduction algorithm y := m*(2**(n-k)); s := x; for i in 0 .. n-k loop if quotient(s,y) = 1 then r := s - y; elsif quotient(s,y) = 0 then r := s; else r := s + y; end if; s := 2*r; end loop; z := r /(2**(n-k)); if z < 0 then z := (z + m); end if;

Observe (Fig. 2.1) that if − y ≤ s < y, then there are two solutions for q. The way a particular solution is chosen corresponds to the definition of a particular digit-recurrence algorithm.

2.1.2

Nonrestoring Reducer

According to Fig. 2.1 the value of q can be defined as follows: q = − 1 if s < 0

q = 1 if s ≥ 0

(2.13)

This definition corresponds to the classical nonrestoring algorithm in which, at each step, the value of q only depends on the sign of s. An

27

28

Chapter Two executable Ada file nr_reducer.adb is available at www.arithmeticcircuits.org. At each step of Algorithm 2.1, a subtraction s − y or an addition s + y (or no operation) is computed, where − 2y ≤ s < 2y, so that [Eq. (2.5)] − 2 n + 1 ≤ s < 2n + 1

(2.14)

which means that s is an (n + 2)-bit 2’s-complement integer, and y [Eqs. (2.4) and (2.5)] is an n-bit natural whose n − k less significant bits are 0s. The datapath corresponding to Algorithm 2.1 with the quotient function defined by Eq. (2.13) is shown in Fig. 2.2. The minimum period of the clock signal is equal to the delay of a (k + 1)-bit adder. If carry-ripple adders are used, a lower bound of the total computation time, including the final correction step if z < 0, is given by T = [(n − k + 1)(k + 1) + k]TFA ≈ k(n − k)TFA

(2.15)

TFA being the delay of a 1-bit full-adder. A complete VHDL file nr_reducer.vhd is available at www. arithmetic-circuits.org. The entity declaration is sn wk

sn –1 wk–1

FA

FA

sn–k+1 w1

sn–k w0

FA

HA

...

rn–1

rn

rn–k+1

rn–k

sn–k–1

s1

s0

... rn–k–1

...

r1

...

r0 0

(n + 2)-bit register initially: s = x ...

sn+1 (sign)

sn

m

... sn–k+1

sn–k+2

sn–k

–m rn–1, rn–2, ... ,rn–k

1

m

0 k-bit adder w rn (sign)

0

1

z

FIGURE 2.2

Nonrestoring reducer datapath.

sn–k–1

s1

s0

mod m Reduction entity nr_reducer is port ( x: in std_logic_vector (N downto 0); m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (K-1 downto 0); done: out std_logic ); end nr_reducer;

A simple communication protocol, based on a command signal start and a control signal done, is used (Fig. 2.3). The same type of protocol will be used throughout the book for all VHDL models. The VHDL architecture corresponding to the circuit of Fig. 2.2 is the following: r(N downto N-K) <= s(N downto N-K)+w; r(N-K-1 downto 0) <= s(N-K-1 downto 0) ; with r(N) select z <= r(N-1 downto N-K) when ‘0’, r(N-1 downto N-K)+ m when others; registers: process(clk) begin if clk’event and clk =‘1’ then if load = ‘1’ then s <= x(n) & x; --sign extension elsif update = ‘1’ then s <= r(N downto 0) & ‘0’; end if; end if; end process registers; minus_m <= (‘1’¬(m))+1; --Two´s complement of m with s(N+1) select w <= minus_m when ‘0’, (‘0’&m) when others;

The complete model additionally includes an (n − k)-state counter and a control unit generating the load, update, and done signals.

2.1.3

SRT Reducer

In order to reduce the computation time, an interesting idea is to use carrysave adders, that is, to encode the successive values of s under the form s = ss + sc input data

x, m

start done output data

FIGURE 2.3

previous result

Communication protocol.

z = x mod m

29

30

Chapter Two the so-called stored-carry encoding. Then the operations r = s + y, r = s, and r = s − y, are substituted by rs + rc = ss + sc + y

rs + rc = ss + sc

and

rs + rc = ss + sc − y (2.16)

where rs + rc is the stored-carry representation of r. Assume that all integers are represented as (n + 2)-bit 2’s-complement numbers. Then the computation of rs + rc = ss + sc + w, where w is equal to either y, 0, or the 2’s complement representation of − y, is performed as follows: rs(i) = ss(i) + sc(i) + w(i) mod 2 rc(0) = 0

∀i = 0, 1, . . . , n + 2

rc(i) = (ss(i − 1) + sc(i − 1) + w(i − 1))/2

∀i = 1, 2, . . . , n + 2

(2.17)

that is carry-free operations. In the following algorithm the functions csa_sum(ss, sc, w) and csa_carry(ss, sc, w) return rs and rc according to Eq. (2.17).

Algorithm 2.2—Generic digit-recurrence carry-save reduction algorithm y := m*(2**(n-k)); --2’s complement representation of -y and x minus_y := (2**(n+2)) - y; if x >= 0 then ss := x; sc := 0; else ss := (2**(n+2))+ x; sc := 0; end if; --main loop for i in 0 .. n-k loop if quotient(ss, sc, y) = 1 then rs := csa_sum(ss, sc, minus_y); rc := csa_carry(ss, sc, minus_y); elsif quotient(ss, sc, y) = 0 then rs := csa_sum(ss, sc, 0); rc := csa_carry(ss, sc, 0); else rs := csa_sum(ss, sc, y); rc := csa_carry(ss, sc, y); end if; ss := 2*rs; sc := 2*rc; end loop; --final step z := ((rs+rc) mod (2**(n+1)))/(2**(n-k)); --correction if z < 0 if z >= 2**k then z := (z+m) mod 2**k; end if;

In order to get an executable algorithm it remains to define the quotient function. It has been shown ([Par00], [EL04], [DBS06]) that the decision can be taken without computing the exact value of s = ss + sc, but just an approximation of s (the so-called SRT algorithm, after Sweeney, Robertson, and Tocher). Actually, it is enough to know the four most significant bits of ss + sc (Table 6.3 of [DBS06]). Let ss’ and sc’

mod m Reduction be the 4-bit naturals corresponding to the four most significant bits of ss’ and sc’, and compute t = ss’ + sc’ mod 16. Then the value of q can be defined as a function of t: q = 1 if t ≤ 2, q = − 1 if 11 ≤ t ≤ 14, q = 0 if t = 15, and t never belongs to the interval 3 ≤ t ≤ 10.

Algorithm 2.3—Computation of q-SRT algorithm with stored-carry encoding function quotient(ss, sc, y: in natural) return integer is ss_high, sc_high, t: natural; begin ss_high := ss / (2**(n-1)); sc_high := sc / (2**(n-1)); t := (ss_high+sc_high) mod 16; if t <= 2 then return 1; elsif t < 15 then return -1; else return 0; end if; end quotient;

An executable Ada file srt_reducer.adb including Algorithm 2.2 as well as the functions csa_sum, csa_carry, and quotient (Algorithm 2.3) is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 2.2, with the quotient function defined by Algorithm 2.3, is shown in Fig. 2.4. The Boolean functions defining q (2’s complement) are q1 = t3 ⊕ t2 t1t0

q0 = t2 t1t0

(2.18)

The minimum period of the clock signal is equal to the delay of a 4-bit adder, plus the computation time of a 4-input Boolean function, plus the delay of a 1-bit adder, that is, 5TFA + TBoolean4. Assuming that TFA and TBoolean4 have the same order of magnitude, the minimum clock period is approximately equal to TCLK ≈ 6TFA

(2.19)

A lower bound of the total computation time, including the final steps (decoding from stored-carry form to normal form and correction if z < 0), is given by T = (n − k + 1)TCLK + (k + 1)TFA ≈ 6(n − k + 1)TFA + (k + 1)TFA

(2.20)

that is a linear function of n and k instead of a quadratic function, as in the case of Algorithm 2.1 [Eq. (2.14)]. In Eq. (2.20) the term (k + 1)TFA corresponds to the final operations; they are not executable in one clock cycle.

31

ss,0

ss,1

ss,n–k–1

w0

ss,n–k

w1

ss,n–k+1

ss,n–k+2 sc,n–k+2 w2

ss,n sc,n wk

ss,n+1 sc,n+1 wk+1

Chapter Two

... FA rs,n+1

...

FA

FA

rc,n+1 rs,n rc,n rs,n–k+2

FA

FA rs,n–k rs,n–k–1

rs,n–k+1 rc,n–k+1 rc,n–k+2

rs,1 rs,0 0

...

...

ss,0

ss,1

ss,n–k–1

ss,n–k

...

ss,n–k+1

ss,n–k+2 sc,n–k+2

ss,n sc,n

ss,n+1 sc,n+1

(n – k + 3)-bit register initially: ss = x, sc = 0 ... ss,n+2 sc,n+2

32

0 m –m ss,n+2, ss,n+1, ss,n, ss,n–1 sc,n+2, sc,n+1, sc,n, sc,n–1

4-bit t adder

two 4-input quotient 0 1 –1 Boolean functions

rc,n, rc,n–1, ...,rc,n–k+1 rs,n, rs,n–1, ...,rs,n–k+1

w

rs,n–k

rk–1, rk–2, ... , r0 m

k-bit adder k-bit adder rk, rk–1, ... ,r1

r0

rk (sign)

0

1

z

FIGURE 2.4

SRT reducer datapath.

A complete VHDL file srt_reducer.vhd is available at www. arithmetic-circuits.org. The corresponding entity declaration is entity srt_reducer is port ( x: in std_logic_vector (N downto 0); m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (K-1 downto 0); done: out std_logic ); end srt_reducer;

The VHDL architecture corresponding to the circuit of Fig. 2.4 is the following:

mod m Reduction csa: for i in N-K to N generate rs(i) <= ss(i) xor sc(i) xor w(i-N+K); rc(i+1) <= (ss(i) and sc(i)) or (ss(i) and w(i-N+K)) or (sc(i) and w(i-N+K)) ; end generate; rs(N+1) <= ss(N+1) xor sc(N+1) xor w(K+1); rs(N-K-1 downto 0) <= ss(N-K-1 downto 0); r(0) <= rs(N-K); r(K downto 1) <= rs(N downto N-k+1)+rc(N downto N-K+1); with r(K) select z <= r(K-1 downto 0) when ‘0’, r(K-1 downto 0)+m when others; registers: process(clk) begin if clk’event and clk=‘1’ then if load = ‘1’ then ss <= x(N)&x(N)&x; sc <= (others => ‘0’); elsif update=‘1’ then ss(0) <= ‘0’; for i in 1 to N+2 loop ss(i) <= rs(i-1); end loop; sc(N-K) <= ‘0’; sc(N-K+1) <= ‘0’; for i in N-K+2 to N+2 loop sc(i) <= rc(i-1); end loop; end if; end if; end process registers; t <= ss(N+2 downto N-1)+sc(N+2 downto N-1); quotient(1) <= t(3) xor (t(2) and t(1) and t(0)); quotient(0) <= not(t(2) and t(1) and t(0)); not_gates: for i in 0 to K-1 generate not_m(i) <= not(m(i)); end generate; minus_m <= (“11”¬_m) +1; with quotient select w <= minus_m when “01”, (“00”&m) when “11”, (others => ‘0’) when others;

The complete model additionally includes an (n − k) − state counter and a control unit generating the load, update, and done signals.

Comment 2.1 It is important to note that in the preceding VHDL description the done signal is raised at the end of the main loop of Algorithm 2.2, that is, when the execution of the final operations (those which are not executable in one clock cycle) begins. For the done signal to be raised when the final result z is actually available, some kind of synchronization of the final operations should be introduced, and the control unit modified accordingly.

2.2

Reduction mod 2k − a Assume that 2k − 1 ≤ m < 2k

(2.21)

33

34

Chapter Two Then m = 2k – a

(2.22)

1 ≤ a ≤ 2k − 1

(2.23)

where

Given a natural x belonging to the range 0 ≤ x < 2n

(2.24)

compute the following quotients qi and remainders ri: x = q02k + r0 q0a = q12k + r1 (2.25)

q1a = q22k + r2 ... qs − 2a = qs − 12k + rs − 1

Multiply the second Eq. (2.25) by (2k/a), the third one by (2k/a)2, . . . , the last one by (2k/a)s − 1, and sum up the s equations; the result is x = r0 + r1(2k/a) + r2(2k/a)2 + . . . + rs − 1(2k/a)s − 1 + qs − 12k(2k/a)s − 1 (2.26) where [Eq.(2.23)] 2k/a ≥ 2. Observe that if s > n − k, then qs − 1 = 0. In the contrary case, x ≥ 2k(2k/a)s − 1 ≥ 2k + s − 1 ≥ 2k + (n − k + 1) − 1 = 2n. Thus, after a finite number s of steps, with s ≤ n − k + 1, a kind of base (2k/a) expression is obtained: x = r0 + r1(2k/a) + r2(2k/a)2 + . . . + rs − 1(2k/a)s − 1

with rs − 1 > 0 (2.27)

in which every remainder ri is smaller than 2k. By summing up the s equations (2.25), assuming that qs − 1 = 0, the following relation is obtained: x = q (2k − a) + q (2k − a) + . . . + q (2k − a) + r + r + . . . + r (2.28) 0

s−2

1

0

1

s−1

Thus, x ≡ r mod m = 2k − a

with r = r0 + r1 + . . . + rs − 1

(2.29)

k

As every remainder ri is smaller than 2 , the maximum value of r is r < s2k ≤ (n − k + 1)2k

(2.30)

so that r can be expressed as a t-bit natural where t = k + ⎡log2(n − k + 1)⎤. The same method could then be used for generating a t’-bit number r’ ≡ x mod m where t’ = k + ⎡log2(t − k + 1)⎤, and so on. The following algorithm ([MOV96]) generates a t-bit natural z such that z ≡ x mod m, z < 2t, t ≤ k + ⎡log2(n − k + 1)⎤

(2.31)

mod m Reduction Algorithm 2.4—Partial mod 2k − a reduction a := 2**k - m; r := x mod 2**k; q := x/2**k; loop r := r + (q*a mod 2**k); q := q*a/2**k; if q = 0 then exit; end if; end loop; z := r;

Algorithm 2.4 generates a natural z smaller than x, except if r1 = r2 = . . . = rs − 1 = 0 in which case x = r = r0 < 2k. An iterative execution of Algorithm 2.4 eventually generates a natural z ≡ x mod m and smaller than 2k. A final correction step generates z = r − m if r ≥ m.

Example 2.1 Assume that n = 64, k = 8, and m = 239. Thus a = 28 − 239 = 17. In what follows, all numbers are represented in hexadecimal. In particular m = (EF), 28 = (100), and 17 = (11). Compute (41C1D298F81A7296) mod (EF): (41C1D298F81A7296) = (41C1D298F81A72)(100) + (96) (41C1D298F81A72)(11) = (45DDEFC2879C1)(100) + (92) (45DDEFC2879C1)(11) = (4A3BCEBEB015)(100) + (D1) (4A3BCEBEB015)(11) = (4EDF8BAA9B1)(100) + (65) (4EDF8BAA9B1)(11) = (53CD846544)(100) + (C1) (53CD846544)(11) = (590A5CAB9)(100) + (84) (590A5CAB9)(11) = (5E9B0276)(100) + (49) (5E9B0276)(11) = (6484B29)(100) + (D6) (6484B29)(11) = (6ACCFD)(100) + (B9) (6ACCFD)(11) = (7179C)(100) + (CD) (7179C)(11) = (7891)(100) + (5C) (7891)(11) = (801)(100) + (A1) (801)(11) = (88)(100) + (11) (88)(11) = (9)(100) + (08) (9)(11) = (0)(100) + (99) Thus, s = 15 and r = (96) + (92) + (D1) + (65) + (C1) + (84) + (49) + (D6) + (B9) + (CD) + (5C) + (A1) + (11) + (08) + (99) = (7F7) The same method is used for reducing (7F7): (7F7) = (7)(100) + (F7) (7)(11) = (0)(100) + (77)

35

36

Chapter Two Thus, s = 2 and r = (F7) + (77) = (16E) Now reduce (16E): (16E) = (1)(100) + (6E) (1)(11) = (0)(100) + (11) so that s = 2 and r = (6E) + (11) = (7F) As r < 2k it remains to check whether r < m, or not. Actually (7F) < (EF) so that the final result is (41C1D298F81A7296) mod (EF) = (7F) Actually x = (41C1D298F81A7296) = (466F352019D159)(EF) + (7F).

Algorithm 2.5—mod 2k − a reduction r := x mod 2**k; q := x/2**k; a := 2**k - m; loop loop -- main loop r := r + (q*a mod 2**k); q := q*a/2**k; if q = 0 then exit; end if; end loop; q := r/2**k; r := r mod 2**k; if q = 0 then exit; end if; end loop; --final correction; if r >= m then z := r-m; else z := r; end if;

An executable Ada file two_power_k_minus_a_reducer.adb is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 2.5 is shown in Fig. 2.5. In order to get an estimation of the minimum clock period, assume that a carry-save multiplier is used. An example is shown in Fig. 2.6 with n = 7, k = 3, and t = 5. Every product-cell computes two 4-input Boolean functions, namely the 2-bit representation of xy + z + w where x, y, z, and w are the four cell inputs. Let TMULT be the corresponding delay. Then the computation time of product(n − 1..k) is equal to kTMULT + (n − k)T FA, and the computation time of sum(t − 1..0) to kT MULT + (t − k + 1 )T FA. Thus, the minimum period of the clock signal is kTMULT + (n − k)T FA. An upper bound is nT MULT. Assuming that the inner loop of Algorithm 2.5 is executed only once (best case), the total computation time is about

mod m Reduction q (n – k – 1..0) a (k – 1..0)

(n-k)-bit by k bits multiplier

product (n – 1..k)

r (k – 1..0)

product (n – 1..0) r (t – 1..0) product (k – 1..0)

sign

(n – k)-bit register initially: xn–1, ... , xk after reload: rt–1, ... , rk

–m

(k + 1)bit adder

t-bit adder sum (t – 1..0) 0

1

t-bit register initially: xk–1, ... , x0 after reload: rk–1, ... , r0

q (n – k – 1..0)

z (k – 1..0) r (t – 1..0)

FIGURE 2.5

Datapath of a mod 2k − a reducer. q (3)a (0)

q (2)a (0)

q (1)a (0)

q (0)a (0)

product0

q (3)a (1)

q (2)a (1)

q (1)a (1)

q (0)a (1) product1

q (3)a (2)

q (2)a (2)

q (1)a (2)

q (0)a (2) product2

HA

FA

FA

HA

product6

product5

product4

product3 r4

r3

r2

r1

r0

HA

HA

FA

FA

HA

sum4

sum3

sum2

sum1

sum0

FIGURE 2.6 Carry-save multiplier and carry-ripple adder.

T ≈ snTMULT ≤ (n − k + 1)nTMULT

(2.32)

Obviously faster multipliers and/or adders could be used. A VHDL file tpkma_reducer.vhd is available at www.arithmeticcircuits.org. The corresponding entity declaration is

37

38

Chapter Two entity tpkma_reducer is port ( x: in std_logic_vector (N-1 downto 0); m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (K-1 downto 0); done: out std_logic ); end tpkma_reducer;

The VHDL architecture corresponding to the circuit of Fig. 2.5 is the following: a <= not (m) + ‘1’; product <= q*a; sum <= (zero&product(K-1 downto 0)) + r; registers: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then q <= x(N-1 downto K); r <= zero & x(K-1 downto 0); elsif reload=‘1’ then q <= (long_zero & r(T-1 downto k)); r <= zero & r(K-1 downto 0); elsif update = ‘1’ then q <= product(N-1 downto k); r <= sum; end if; end if; end process registers; q_equal_zero <= ‘1’ when q = very_long_zero else ‘0’; r_minus_m <= (‘0’&r(K-1 downto 0)) + (‘1’&a); with r_minus_m(k) select z <= r_minus_m(K-1 downto 0) when ‘0’, r(K-1 downto 0) when others;

The complete model additionally includes a control unit generating the load, reload, update, and done signals.

2.3

Precomputation of 2ik mod m First define a mixed-radix numeration system [DBS06] based on a set of positive bit-vector lengths l0, l1, l2, . . . , ls − 1 such that l0 + l1 + l2 + . . . + ls − 1 = n The radices are 2l0 , 2l1 , . . . , 2ls−1 , so that the corresponding weights are W0 = 1, W1 = 2l0 , W2 = 2l1 . 2l0 = 2l1 + l0 , . . . , ... Ws − 1 = 2ls − 2 . . . . . 2l1 . 2l0 = 2ls − 2 + +l1 + l0 .

mod m Reduction Then any natural x belonging to the interval 0 ≤ x < Ws = 2 s − 1. . . . . 2l1. 2l0 = 2 s − 1 l

l

+ ... + l1 + l0

= 2n

can be represented under the form x = Xs − 1Ws − 1 + Xs − 2Ws − 2 + . . . + X1W1 + X0W0

where 0 ≤ Xi < 2li ,

∀i = 0, 1, . . . , s − 1

(2.33)

The following values are previously computed: b0 = W0 = 1 b1 = W1 mod m b2 = W2 mod m ... bs − 1 = Ws − 1 mod m Then x ≡ Xs − 1bs − 1 + Xs − 2bs − 2 + . . . + X1b1 + X0b0 mod m

(2.34)

and the problem is reduced to the computation of r mod m where r = Xs − 1bs − 1 + Xs − 2bs − 2 + . . . + X1b1 + X0b0

(2.35)

Assume that 2k − 1 ≤ m < 2k and that l0 ≥ k. Then Ws − 1 > . . . > W1 = 2l0 ≥ 2k > m so that bi = Wi mod m < Wi, ∀i = 1, 2, . . . , s − 1, and r is smaller than x except if X1 = X2 = . . . = Xs − 1 = 0 in which case r = X0 < 2l0 . Thus, an iterative use of the preceding method eventually generates a natural z ≡ x mod m and smaller than 2l0 . A final correction step generates z = x mod m. In particular, if l0 = k, so that z < 2k, then x mod m is either z or z – m. As a particular case assume that n = sk and define l0 = l1 = l2 = . . . = ls − 1 = k. Then W0 = 1, W1 = 2k, W2 = 22k , . . . , Ws − 1 = 2(s − 1)k

and

Ws = 2sk = 2n

The corresponding representation [Eq. (2.33)] of x is x = Xs − 1 · 2(s − 1)k + Xs − 2 · 2(s − 2)k + . . . + X1 · 2k + X0 ∀i = 0, 1, . . . , s − 1 and the previously computed values are

where 0 ≤ Xi < 2k,

39

40

Chapter Two b0 = 1 b1 = 2k mod m b2 = 22k mod m ... bs − 1 = 2

(s − 1)k

mod m

Example 2.2 Assume again that n = 64, k = 8, and m = 239, so that s = 8. The values of b0 to b7 have been previously calculated. They are expressed in hexadecimal: b0 = (01), b1 = (11), b2 = (32), b3 = (85), b4 = (6E), b5 = (C5), b6 = (03), b7 = (33) Compute x mod (EF) where x = (41C1D298F81A7296): (41)(33) + (C1)(03) + (D2)(C5) + (98)(6E) + (F8)(85) + (1A)(32) + (72) (11) + (96)(01) = (18034) (1)(32) + (80)(11) + (34)(01) = (8E6) (8)(11) + (E6)(01) = (16E) (1)(11) + (6E)(01) = (7F) In the following Algorithm 2.6 vector_r is the base-2k representation of r and b_table(m,i) returns 2ik mod m.

Algorithm 2.6—Precomputation of 2ik mod m r := x; loop vector_r(0) := r mod 2**k; q := r/2**k; for i in 1 .. s-1 loop vector_r(i) := q mod 2**k; q := q/2**k; end loop; r := vector_r(0); for i in 1 .. s-1 loop r := r + vector_r(i)*b_table(m)(i); end loop; if r < 2**k then exit; end if; end loop; if r >= m then z := r-m; else z := r; end if;

An executable Ada file precomputation_reducer.adb is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 2.6 is shown in Fig. 2.7. If only the arithmetic circuit delays are taken into account, the minimum period of the clock signal is the delay of a k-bit by k-bit multiplier followed by a d-bit adder. Assume that a carry-save multiplier and carry-ripple adders are used (Fig. 2.8, with k = 3 and d = 8), then the

mod m Reduction x n -bit register load initially (load): xn–1, ... , x0 reload after reload: acc r r (d – 1..k)

end_of_computation

NOR

sel s-word k-bit look-up table

k s-to-1 multiplexers vector_r (sel)

bsel r (k – 1..0)

k-bit by k-bit multiplier

sign

product d-bit adder

m

k-bit subtractor

1

0

next_acc load

d-bit register load and reload: acc = 0 acc

FIGURE 2.7

z reload

Data path of a reducer with precomputation of 2ik mod m.

a (2)b (2)

HA c (7)

c (6)

HA

HA

c (5)

FA

FA c (4)

FA

a (2)b (0)

a (1)b (0)

a (2)b (1)

a (1)b (1)

a (0)b (1)

a (1)b (2)

a (0)b (2)

a (0)b (0)

HA c (3)

FA

c (2)

FA

c (1)

FA

FIGURE 2.8 Carry-save multiplier and carry-ripple adders (k = 3 and d = 8).

c (0)

HA

41

42

Chapter Two delay is equal to kTMULT + (1 + d − k)TFA. An upper bound is (d + 1)TMULT, and an upper bound of d can be calculated as follows: Xs − 1bs − 1 + Xs − 2bs − 2 + . . . + X1b1 + X0b0 < 2k( bs − 1 + bs − 2 + . . . + b1 + b0) < s22k = (n/k)22k so that an upper bound of d is 2k + log2n − log2k. Assuming that the inner loop of Algorithm 2.6 is executed only once (best case), the total computation time is about T ≈ s(2k + log2n − log2k + 1)TMULT ≈ 2nTMULT

(2.36)

A VHDL file precomputation_reducer.vhd is available at www. arithmetic-circuits.org. The corresponding entity declaration is entity precomputation_reducer is port ( x: in std_logic_vector (N-1 downto 0); m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (K-1 downto 0); done: out std_logic ); end precomputation_reducer;

The VHDL architecture corresponding to the circuit of Fig. 2.7 is the following: digit_selection: for i in 0 to s-1 generate vector_r(i) <= r((i+1)*K-1 downto i*K); end generate; vector_i <= vector_r(conv_integer(sel)); b_i <= b_table(conv_integer(sel)); product <= vector_i * b_i; next_acc <= acc+product; register_acc: process(reset, clk) begin if reset = ‘1’ then acc <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if load = ‘1’ or reload = ‘1’ then acc <= (others => ‘0’); elsif ce_acc = ‘1’ then acc <= next_acc; end if; end if; end process register_acc; register_r: process(reset, clk) begin if reset = ‘1’ then r <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if load = ‘1’ then r <= x; elsif reload = ‘1’ then r <= ZERO & acc; end if; end if;

mod m Reduction end process register_r; end_of_computation <= ‘1’ when r(D-1 downto K)=SHORT_ZERO else ‘0’; r_minus_m <= (‘0’ & r(K-1 downto 0)) - (‘0’ & m); with r_minus_m(K) select z <= r(K-1 downto 0) when ‘1’, r_minus_m(K-1 downto 0) when others;

The complete model additionally includes a control unit generating the load, reload, sel, and done signals.

2.4

Barrett Reduction Algorithm A generalized version of the Barrett algorithm ([MOV96], [HMV04]) is presented.

2.4.1 n-Digit to (k + t)-Digit Reduction Assume that m belongs to the range Bk − 1 < m < Bk where B is the base of the chosen numeration system (usually 2 or a power of 2). Observe that if m is a power of B the computation is trivial: x mod m is the number represented by the k less significant digits of x. The value of z = x mod m is the remainder of the integer division of x by m, that is, x = qm + z, z < m The Barrett algorithm starts with the computation of an approximation q’ of q = ⎣x/m⎦ such that q − a ≤ q’ ≤ q

(2.37)

(a method for computing an approximation q’ of q is given in Sec. 2.4.2). First compute r’ = x – q’m

(2.38)

As z = x – qm, then [Eq. (2.37)] z ≤ r’ ≤ z + am

(2.39)

Let t be the minimum integer such that Bt ≥ a + 1

(2.40)

r’ ≤ z + am < m + am = (a + 1)m < BtBk = Bk + t

(2.41)

0 ≤ z ≤ r’ < Bk + t

(2.42)

r’ mod m = x mod m = z

(2.43)

Then,

Thus,

and [Eq. (2.38)]

43

44

Chapter Two The following formal algorithm, including a function approximation which generates an approximation q’ of ⎣x/m⎦ [see Eq. (2.37)], computes a (k + t)-digit number r equivalent to x mod m:

Algorithm 2.7—n-digit to (k + t)-digit reduction q := approximation(x, m); r := ((x mod Bk + t) – (q*m mod Bk

+t

)) mod Bk

+t

;

If a = 2 and B ≥ 3, then condition Eq. (2.40) amounts to Bt ≥ 3 and is satisfied if t = 1. Thus x − q’m can be computed as mod Bk + 1. This case corresponds to the classical Barrett algorithm.

2.4.2 An Approximation of q Let x and m be expressed in base B, that is, x = xn − 1Bn − 1 + xn − 2Bn − 2 + . . . + x0B0, m = m Bk − 1 + m Bk − 2 + . . . + m B0 k−1

k−2

0

where mk − 1 > 0

The approximation q’ of q = ⎣x/m⎦ is, q’ = ⎣ ⎣x/Bk − 1⎦ ⎣Bn/m⎦/Bn − k + 1⎦

(2.44)

q’ ≤ ⎣(x/Bk − 1)(Bn/m)/Bn − k + 1⎦ = ⎣x/m⎦ = q

(2.45)

Observe that

and q = ⎣(x/Bk − 1)(Bn/m)/Bn − k + 1⎦ < ⎣( ⎣x/Bk − 1⎦ + 1)( ⎣Bn/m⎦ + 1)/Bn − k + 1⎦ = ⎣[( ⎣x/Bk − 1⎦ ⎣Bn/m⎦ )/Bn − k + 1] + [(⎣x/Bk − 1⎦ + ⎣Bn/m⎦ + 1)/ Bn − k + 1]⎦ ≤ ⎣[( ⎣x/Bk − 1⎦ ⎣Bn/m⎦ )/Bn − k + 1]⎦ + ⎣[(⎣x/Bk − 1⎦ + ⎣Bn/m⎦ + 1)/Bn − k + 1]⎦ = q’ + ⎣[(⎣x/Bk − 1⎦ + ⎣Bn/m⎦ + 1)/ Bn − k + 1]⎦ (2.46) As x < Bn and m ≥ Bk − 1, then x/Bk − 1 < Bn − k + 1

and

Bn/m ≤ Bn − k + 1

(2.47)

so that ⎣x/Bk − 1⎦ + ⎣Bn/m⎦ + 1 ≤ (Bn − k + 1 –1) + Bn − k + 1 + 1 = 2Bn − k + 1 Thus from [Eqs. (2.46) and (2.48)], q ≤ q’ + 2

(2.48)

mod m Reduction that is, a=2

(2.49)

According to Eqs. (2.40) and (2.49) the value of t must be chosen in such a way that Bt ≥ 3. Thus if B = 2, then t = 2 (the computation is performed mod Bk + 2), if B > 2, then t = 1 (the computation is performed mod Bk + 1). To summarize, the following algorithm computes z = x mod p. The constant c = ⎣Bn/m⎦

(2.50)

must have been previously calculated.

Algorithm 2.8—Barrett reduction (complete Ada source code available) y := x/B**(k-1); w := y*c; q := (w/B**(n-k+1)) mod B**(k+t); r := ((x mod B**(k+t)) - ((q*m) mod B**(k+t))) mod B**(k+t); while r >= m loop r := r-m; end loop; z := r;

The division by Bk − 1 or Bn − k + 1 and the mod Bk + t reduction are trivial operations. The only nontrivial operations are multiplications by c and m, and subtractions. An executable Ada file Barrett_reducer. adb is available at www.arithmetic-circuits.org. As before assume that x is a 64-bit number and m = 239. All numbers will be represented in hexadecimal, so that B = 16, n = 16, k = 2, t = 1, c = ⎣1616/239⎦ = (112358E75D30336), m = (EF). Compute (41C1D298F81A7296) mod (EF):

Example 2.3

y = x/16 = (41C1D298F81A729) w = yc = (41C1D298F81A729)(112358E75D30336) = (. . . 15958562734E19BDA6) q = w/1615 mod 163 = (. . .159) mod 163 = (159) product = qm = (159)(EF) = (14217) r = (x mod 163) – (product mod 163) = (296) – (217) = (7F) z = (7F) If B = 2, and thus t = 2, then 2k − 1 < m < 2k, 2n − k < 2n/m < 2n − k + 1, 2n − k ≤ c < 2n − k + 1, that is, c = c(n − k..0) x/2 (w/2

k−1

= x(n − 1 .. k − 1) = y(n − k..0)

) mod 2k + 2 = w[(n − k + 1) + (k + 1) .. n − k + 1]

n−k+1

= w(n + 2 .. n − k + 1) = q(k + 1..0) and an equivalent version of Algorithm 2.8 is the following.

45

46

Chapter Two Algorithm 2.9—Barrett reduction with B = 2 y := x(n-1..k-1); prod := y*c; q := prod(n+2..n-k+1); prod := q*m; r := x(k+1..0) - prod(k+1..0); while r >= m loop r := r-m; end loop; z := r;

The datapath corresponding to Algorithm 2.9 is shown in Fig. 2.9. The size of the multiplier inputs mul1 and mul2 depends on the

y = x (n – 1..k – 1)

1

0

m

c

1

0

mul 1

sel_mul

mul 2

n1 by n2 multiplier mul_out (n + 2 .. 0) ce_prod

(n + 3)-bit register prod q = prod (n + 2 .. n – k + 1)

prod (k + 1 ..0)

x (k + 1..0)

1

m

0

1

sub 1

0

sel_sub

sub 2

(k + 2)-bit subtractor

sign

dif (k + 2)-bit register dif

z

FIGURE 2.9

Datapath of a Barrett reducer.

ce_r

mod m Reduction relative value of n and k : y and c are (n − k + 1)-bit numbers, q is a (k + 2)-bit number and m a k-bit number. Thus, n1 = max{n − k + 1, k + 2}

and

n2 = max{n − k + 1, k}

The minimum period of the clock signal is the delay of an n1-bit by n2-bit multiplier whose only n + 3 outputs (less significant bits) are used. If a carry-save multiplier is used, the corresponding computation time is about (n + 3)TMULT. According to Eqs. (2.41) and (2.49), r’ < 3m, so that the final result is either r’, r’ − m, or r’ − 2m. Thus, the whole computation is performed in at most 5 cycles, so that T ≈ 5(n + 3)TMULT

(2.51)

A VHDL file Barrett_reducer.vhd is available at www.arithmeticcircuits.org. The corresponding entity declaration is entity Barrett_reducer is port ( x: in std_logic_vector (N-1 downto 0); m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (K-1 downto 0); done: out std_logic ); end Barrett_reducer;

If n > 2k, and thus n1 = n2 = n − k + 1, the VHDL architecture corresponding to the circuit of Fig. 2.9 is the following: with sel_mul select mul1 <= x(N-1 downto K-1) when ‘0’, (ZERO & q) when others; with sel_mul select mul2 <= c when ‘0’, (“00” & ZERO & m) when others; mul_out <= mul1 * mul2; register_prod: process(clk) begin if clk’event and clk=‘1’ then if ce_prod = ‘1’ then prod <= mul_out(n+2 downto 0); end if; end if; end process; q <= prod(n+2 downto n-k+1); with sel_sub select sub1 <= x(k+1 downto 0) when ‘0’, r when others; with sel_sub select sub2 <= prod(k+1 downto 0) when ‘0’, (“00”&m) when others; dif <= (‘0’ & sub1) - (‘0’ & sub2); sign <= dif(k+2); register_r: process(clk) begin if clk’event and clk=‘1’ then

47

48

Chapter Two if ce_r = ‘1’ then r <= dif(k+1 downto 0); end if; end if; end process; z <= r(k-1 downto 0);

2.5

Comparison Throughout this chapter five reduction algorithms were considered: nonrestoring division, SRT division, mod 2k − a reduction, precomputation of 2ik mod m, and Barrett algorithm. The corresponding approximate computation times are shown in Table 2.1 [Eqs. (2.15), (2.20), (2.32), (2.36), and (2.51)]: Reduction algorithm

Computation time k(n − k)TFA

Nonrestoring SRT

(5n − 4k + 6)TFA

mod 2k − a

(n − k + 1)nTMULT

Precomputation

2nTMULT

Barrett

5(n + 3)TMULT

TABLE 2.1 Approximate Computation Times

The nonrestoring reducer includes one k-bit adder and one (n + 2)bit register. Its cost is a linear function of k and n. The SRT-reducer includes two k-bit adders and one (n + k + 2)-bit register. Its cost is also a linear function of k and n. The other reducers include a multiplier: an (n − k)-bit by k-bit multiplier in the case of the mod 2k − a, a k-bit by k-bit multiplier in the case of the precomputation, and an n1-bit by n2-bit multiplier, where n1 = max{n − k + 1, k + 2} and n2 = max{n − k + 1, k}, in the case of the Barrett reducer. An interesting particular case is when n ≈ 2k (see Table 2.2). It corresponds, among others, to the second step of a straightforward method for computing the product xy mod m where x, y, and m are k-bit numbers: first compute z = xy, a 2k-bit number, and after that reduce z mod m. In the case of the Barrett reducer, with n ≈ 2k, both numbers of bits n1 and n2 are approximately equal to k. Reduction algorithm

Computation time

k-by-k multiplier

Nonrestoring

k2TFA

No

SRT

6kTFA

No

mod 2 − a

2k TMULT

Yes

Precomputation

4kTMULT

Yes

Barrett

10kTMULT

Yes

k

2

TABLE 2.2 Approximate Computation Times when n ≈ 2k

mod m Reduction

2.6

Specific Circuits Throughout this chapter generic reducers based on sequential implementations of five different algorithms have been proposed. For particular values of n and k other structures could be considered (completely combinational, partly sequential, and so on), or even completely specific circuits.

2.6.1

mod 239 Reducer

As a first example, a 16-bit to 8-bit mod 239 reducer can be synthesized as follows: a 16-bit number x = x15 · 215 + x14 · 214 + . . . + x0 can be decomposed under the form x = (x15 · 23 + x14 · 22 + x13 · 2 + x12 ) 212 + (x11 · 23 + x10 · 22 + x9 · 2 + x8)28 + x · 27 + . . . + x 7

0

As 2 mod 239 = 33 = 25 + 1, and 28 mod 239 = 17 = 24 + 1, 12

x ≡ x’ = (x15 · 23 + x14 · 22 + x13 · 2 + x12)(25 + 1) + (x11 · 23 + x · 22 + x 2 + x )(24 + 1) + x · 27 +. . . + x 10

9

8

7

0

= x15 · 28 + x14 · 27 + x13 · 26 + x12 · 25 + x15 · 23 + x14 · 22 + x13 · 2 + x12 + x11 · 27 + x10 · 26 + x9 · 25 + x8 · 24 + x11 · 23 + x10 · 22 + x 2 + x + x · 27 + . . . + x (2.52) 9

8

7

0

An upper bound of x’ is 15 · 33 + 15 · 17 + 255 = 1005, so that x’ is a 10-bit number, that is, x’ = x’ · 29 + x’ · 28 + . . . + x’ 9

8

0

A similar decomposition gives x’ = (x’9 · 2 + x’8) 28 + x’7 · 27 + . . . + x’0 ≡ x’’ = (x’ · 2 + x’ )(24+ 1) + x’ · 27 + . . . + x’ 9

8

7

0

= x’9 · 25 + x’8 · 24 + x’9 · 2 + x’8 + x’7 · 27 + . . . + x’0 (2.53) An upper bound of x’’ is 3.17 + 255 = 306, so that x mod 239 is either x’’ or x’’ − 239. The corresponding specific circuit, implementing Eqs. (2.52) and (2.53), as well as the eventual subtraction, is shown in Fig. 2.10. A VHDL file mod_239_reducer.vhd is available at www. arithmetic-circuits.org. The corresponding entity declaration is entity mod_239_reducer is port ( x: in std_logic_vector(15 downto 0); z: out std_logic_vector(7 downto 0) ); end mod_239_reducer;

49

50

Chapter Two x (11..8) & x (11..8) 0

x ′ (9..8) & 0 & 0 & x ′ (9..8) x ′ (7..0)

x (7..0) 0

000

9-bit adder

9-bit adder x ′′ (8..0) –239(273)

x (15..12) & 0 & x (15..12) 0

0

0

x ′′ (7..0)

0

10-bit adder 10-bit adder

dif

dif (9) x ′ (9..0)

dif (7..0) 1

0

x mod 239

FIGURE 2.10 Reduction mod 239.

The VHDL architecture corresponding to the circuit of Fig. 2.10 is the following: x1_by_17 <= ‘0’&x(11 downto 8)&x(11 downto 8); x0 <= ‘0’&x(7 downto 0); sum <= x1_by_17 + x0; x2_by_33 <= ‘0’&x(15 downto 12)&’0’&x(15 downto 12); long_sum <= ‘0’∑ xx <= x2_by_33 + long_sum; xx1_by_17 <= “000”&xx(9 downto 8) &”00”&xx(9 downto 8); xx0 <= ‘0’&xx(7 downto 0); xxx <= xx1_by_17 + xx0; minus_239 <= conv_std_logic_vector(273, 10); long_xxx <= ‘0’&xxx; dif <= long_xxx + minus_239; with dif(9) select z <= dif(7 downto 0) when ‘1’, xxx(7 downto 0) when others;

2.6.2

mod (2192 − 264 − 1) Reducer

As a second example, a 384-bit to 192-bit mod (2192 − 264 − 1) reducer is synthesized. For that decompose x = x383 · 2383 + x382 · 2382 + . . . + x1 · 2 + x0 under the form

mod m Reduction x = (x383 · 263 + x382 · 262 + . . . + x321 · 2 + x320)2320 + (x · 263 + x · 262 + . . . + x · 2 + x )2256 319

318

257

256

+ (x255 · 263 + x254 · 262 + . . . + x193 · 2 + x192)2192 + (x · 2191 + x · 2190 + . . . + x · 2 + x ) 191

190

1

0

(2.54)

Then observe that 2192 ≡ 264 + 1 2256 = 2192 · 264 ≡ (264 + 1) 264 = 2128 + 264 2320 = 2192 · 2128 ≡ 2128(264 + 1) = 2192 + 2128 ≡ 2128 + 264 + 1

(2.55)

According to Eqs. (2.54) and (2.55), x ≡ (x383 · 263 + x382 · 262 + . . . + x321 · 2 + x320)(2128 + 264 + 1) + (x · 263 + x · 262 + . . . + x · 2 + x )(2128 + 264) 319

318

257

256

+ (x255 · 263 + x254 · 262 + . . . + x193 · 2 + x192)( 264 +1) + (x · 2191 + x ·2190 + . . . + x · 2 + x ) 191

190

1

0

(2.56)

Then define x’ = x383 · 2191 + . . . + x320 · 2128 + x383 · 2127 + . . . + x320 · 264 + x383 · 263 +...+x 320

x’’ = x319 · 2191 + . . . + x256 · 2128 + x319 · 2127 + . . . + x256 · 264 x’’’ = x · 2127 + . . . + x · 264 + x · 263 + . . . + x 255

192

x’’’’ = x191 · 2

191

+ x190 · 2

190

255

+...+x2+x 1

192

0

(2.57)

Thus, x ≡ s = x’ + x’’ + x’’’ + x’’’’

(2.58)

An upper limit for s is given by s < 2192 + (2192 – 264) + 2128 + 2192 < 4(2192 – 264 − 1) 192

so that x mod (2 s, s − (2

192

(2.59)

− 2 − 1) is either 64

− 2 − 1), s – 2 · (2192 − 264 − 1), or s – 3 · (2192 − 264 − 1) 64

The corresponding circuit is shown in Fig. 2.11. A VHDL file mod_p192_reducer.vhd is available at www. arithmetic-circuits.org. The corresponding entity declaration is entity mod_p192_reducer is port( x: in std_logic_vector(383 downto 0); z: out std_logic_vector(191 downto 0) ); end mod_p192_reducer;

51

52

Chapter Two xx1: = x(383..320) & x(383..320) & x(383..320) xx2: = x(319..256) & x(319..256) & 00...00

xx 1(191..0)

xx 2(191..0)

0

x (191..0)

0

x (255..192) & x (255..192)

0

193-bit adder

193-bit adder

0

0

194-bit adder s (193..0)

minus_3p

minus_2p 0

0

195-bit adder

0

minus_p 0

194-bit adder

0

193-bit adder z1(192) s (191..0)

z 1(191:0)

z2(193) 0

1

z 2(191:0)

z3 (194) 0

1

z 3(191:0) 0

1 x mod p192

FIGURE 2.11 Reduction mod p = (2192 − 264 − 1).

The VHDL architecture corresponding to the circuit of Fig. 2.11 is the following: xx1 <= x(383 downto 320) & x(383 downto 320) & x(383 downto 320); xx2 <= x(319 downto 256) & x(319 downto 256) & ZEROS; xx3 <= ZEROS & x(255 downto 192) & x(255 downto 192); xx4 <= x(191 downto 0); xx12 <= (‘0’ & xx1) + xx2; xx34 <= (‘0’ & xx3) + xx4; s <= (‘0’ & xx12 + xx34); z3 <= (‘0’ & s) + minus_3p; z2 <= s+minus_2p; z1 <= s(192 downto 0) + minus_p; process(z1,z2,z3,s)

mod m Reduction begin if z3(194) = ‘0’ then z <= z3(191 downto 0); elsif z2(193) = ‘0’ then z <= z2(191 downto 0); elsif z1(192) = ‘0’ then z <= z1(191 downto 0); else z <= s(191 downto 0); end if; end process;

The algorithm can be slightly modified in order to reduce the cost. According to Eq. (2.59), s is a 194-bit number that can be decomposed under the form s = (s193 · 2 + s192 )2192 + s191 · 2191+ . . . + s1 · 2 + s0 ≡ (s · 2 + s )(264 + 1) + s 2191+ . . . + s · 2 + s 193

192

191

1

0

Then define s’ = s193 · 265 + s192 · 264 + s193 · 2 + s192 s’’ = s 2191+ . . . + s · 2 + s 191

1

0

Thus s ≡ ss = s’ + s’’

(2.60)

An upper limit for ss is given by ss < 2192 + 265 + 264 + 3 < 2(2192 – 264 − 1)

(2.61)

so that x mod (2192 − 264 − 1) is either ss or ss – (2192 − 264 − 1) The corresponding circuit is shown in Fig. 2.12. A VHDL file mod_p192_reducer.vhd is available at www. arithmetic - circuits.org. The VHDL architecture corresponding to the circuit of Fig. 2.12 is the following: xx1 <= x(383 downto 320) & x(383 downto 320) & x(383 downto 320); xx2 <= x(319 downto 256) & x(319 downto 256) & ZEROS64; xx3 <= ZEROS64 & x(255 downto 192) & x(255 downto 192); xx4 <= x(191 downto 0); xx12 <= (‘0’ & xx1) + xx2; xx34 <= (‘0’ & xx3) + xx4; s <= (‘0’ & xx12 + xx34); ss <= (‘0’ & s(191 downto 0))+ (s(193 downto 192) & ZEROS62 & s(193 downto 192)); zz <= ss + minus_p; with zz(192) select z <= ss(191 downto 0) when ‘1’, zz(191 downto 0) when others;

53

54

Chapter Two xx 1: = x (383..320) & x (383..320) & x (383..320) xx2: = x (319..256) & x (319..256) & 00...00 xx 1 (191..0)

x (191..0)

xx2 (191..0)

0

x (255..192) & x (255..192)

0

0

193-bit adder

193-bit adder

0

0

194-bit adder

s (193..0)

62 zeros

s (193:192) & 00..00 & s (193:192)

s (191..0) 0

193-bit adder ss (192.0)

minus_p

193-bit adder zz (192) ss (191..0)

zz (191:0) 0

1

x mod p192

FIGURE 2.12 Reduction mod p = (2192 − 264 − 1), second version.

2.7

FPGA Implementation Several reducer circuits have been implemented within Spartan3 (speed - 5) programmable devices. The times (period, total time) are expressed in ns. The parameters FFs and LUTs represent the numbers of flip-flops and look-up tables, respectively. Every slice includes two

mod m Reduction flip-flops and two look-up tables. All the source files are available at www.arithmetic-circuits.org.

2.7.1

Nonrestoring Reducers

The cost and delay of several nonrestoring reducers are shown in Table 2.3.

n

k

FFs

LUTs

Slices

Period

Cycles

Total time

16

8

20

55

28

4.4

8

35.2

24

8

28

63

32

4.6

16

73.6

64

32

70

197

101

7.7

32

246.4

128

64

134

389

198

9.3

64

595.2

256

128

263

773

456

14.9

128

1907.2

384

192

391

1,157

679

21.7

192

4166.4

512

256

522

1,541

906

26.1

256

6681.6

TABLE 2.3 Cost and Delay of Nonrestoring Reducers

2.7.2

SRT Reducers

The cost and delay of several SRT reducers are shown in Table 2.4.

n

k

FFs

LUTs

Slices

Period

Cycles

Total time

16

8

28

106

55

5.1

8

40.8

24

8

36

114

59

5.2

16

83.2

64

32

100

425

216

6.2

32

198.4

128

64

197

852

431

6.5

64

416

256

128

391

1,688

915

6.6

128

844.8

384

192

583

2,525

1,365

8.2

192

1574.4

512

256

778

3,373

1,824

8.8

256

2252.8

TABLE 2.4 Cost and Delay of SRT Reducers

2.7.3

Reduction mod 2k − a

In this case the number of cycles depends on the value of x. The parameter Mult represents the number of embedded 18-bit by 18-bit multipliers (Table 2.5).

55

56

Chapter Two

n

k

FFs

LUTs

Mult

Slices

Period

16

8

23

80

1

44

8.1

24

8

32

88

1

50

8.1

64

32

74

319

4

172

20.8

128

64

138

897

16

518

26.4

256

128

268

2,757

64

1,630

43.6

512

256

∞

Note: ∞ slices means that the circuit does not fit within the device.

TABLE 2.5 Cost and Period of Reducers mod 2k − a

In order to get an estimation of the computation time, 1,000 pairs (x, m) have been generated for several values of k, and the corresponding numbers of cycles have been observed by simulation. In this case the Start/Done communication-protocol cycles have been included. Then, the minimum (MinCycles), maximum (MaxCycles), and average (AverCycles) numbers of cycles have been obtained. By multiplying the average number of cycles by the period, an estimation of the average computation time (AverTime) can be computed (Table 2.6).

k

Period

MinCycles

MaxCycles

AverCycles

AverTime

8

8.1

6

15

10.33

83.7

32

20.8

7

12

11.24

233.8

64

26.4

7

11

8.92

235.5

128

43.6

7

11

8.94

389.6

TABLE 2.6

Average Delay of Reducers mod 2k − a

If m is a constant, then a = 2k − m is also a constant and it is no longer necessary to use multipliers for computing qa. Specific combinational circuits can be used. As an example, a 384-bit to 192-bit mod (2192 − 264 − 1) reducer has been implemented. The implementation results are the following:

FFs

LUTs

Slices

Period

MinCycles

MaxCycles

AverCycles

AverTime

686

38,560

19,978

50.6

7

11

8.96

453.6

mod m Reduction

Precomputation of 2ik mod m

2.7.4

The number of cycles depends on the value of x. The cost and minimum period of several reducers are shown in Table 2.7.

N

K

FFs

LUTs

Mult

Slices

Period

16

8

36

84

1

44

10.4

24

8

45

105

1

55

10.4

32

8

55

115

1

61

10.4

64

8

92

171

1

93

10.8

64

32

103

233

3

139

22.0

128

64

205

583

10

339

24.0

256

128

408

1,643

36

972

34.3

512

256

∞

TABLE 2.7 Cost and Period of Reducers with Precomputation

In order to get an estimation of the computation time, 1,000 pairs (x, m) have been generated for several values of k and n, and the corresponding numbers of cycles have been observed by simulation. The minimum (MinCycles), maximum (MaxCycles), and average (AverCycles) numbers of cycles have been obtained, and the average computation time (AverTime) has been computed (Table 2.8).

N

K

Period

MinCycles

MaxCycles

AverCycles

AverTime

16

8

10.4

6

18

12.01

125

64

8

10.8

22

52

37.21

401.9

TABLE 2.8 Average Delay of Reducers with Precomputation

Observe that if n = 2k, and thus s = 2, the look-up table only stores b0 = 1 and b1 = 2k mod m. It is no longer necessary to use multipliers for computing vector_r(sel) . bsel. Specific combinational circuits can be used. As an example, a 384-bit to 192-bit mod (2192 − 264 − 1) reducer has been implemented. In this case b1 = 2192 mod (2192 − 264 − 1) = 264 + 1. The implementation results are the following.

FFs

LUTs

Slices

Period

MinCycles

MaxCycles

AverCycles

AverTime

674

16,574

8,559

46.9

10

10

10

469

57

58

Chapter Two

2.7.5

Barrett Reduction

The cost and minimum period of several reducers are shown in Table 2.9. n

k

FFs

LUTs

Mult

Slices

Period

Cycles

Total time

16

8

14

57

1

30

7.2

5

36

24

8

16

63

1

35

8.2

5

41

32

8

40

97

4

62

12.7

5

63.5

64

8

45

294

10

168

20.3

5

101.5

TABLE 2.9 Cost and Period of Sequential Barrett Reducers

If m and n are constants, then c = ⎣2n/m⎦ is also a constant and specific circuits can be used for computing yc and qm. As an example, a 384-bit to 192-bit mod (2192 − 264 − 1) reducer has been implemented. The implementation results are the following. FFs

LUTs

Mult

Slices

Period

Cycles

Total time

57

38,957

None

20,067

55.1

5

275.5

Combinational versions have also been implemented for several values of n and m (Table 2.10).

n

k

M

LUTs

Slices

Total time

16

8

239

63

37

17.1

24

8

239

136

83

18.2

32

8

239

257

145

24.6

64

8

239

1,031

538

29.1

1,002

840

61.2

20,097

10,725

74.8

192

−2 −1 64

384

192

2

512

256

2256 − 2224 + 2192 + 296 − 1

TABLE 2.10 Cost and Delay of Combinational Barrett Reducers

The obtained costs (LUTs and Slices) are surprising: There is little difference between (N, k) = (64, 8) and (N, k) = (384, 192), and a great difference between the latter and (N, k) = (512, 256). This is due to the fact that specific circuits (instead of multipliers) are used for multiplying by a constant, namely by m and by c = ⎣Bn/m⎦. The complexity of such circuits strongly depends on the constant value. In particular, the values of c are

mod m Reduction n = 64 and m = 239: c = ⎣264/239⎦ = 112358E75D30336 (hexadecimal), n = 384 and m = 2192 − 264 − 1: c = ⎣2384/(2192 − 264 − 1)⎦ = 2192 + 264 + 1, n = 512 and m = 2256 − 2224 + 2192 + 296 − 1: c = ⎣2512/(2256 − 2224 + 2192 + 296 − 1)⎦ = 100000000FFFFFFFFFFFFFFFEFFFFFFFEFFFFFFFEFFFFFFFF000 0000000000003 (hexadecimal) Obviously, the multiplication by c is very simple in the second case, and much more complex in the other ones.

2.7.6

Specific Circuits

The 16-bit to 8-bit mod 239 reducer of Fig. 2.10, and the 384-bit to 192bit mod (2192 − 264 − 1) reducers of Figs 2.11 and 2.12 have been implemented (Table 2.11).

N

K

M

16

8

239

384 384

192 192

2

LUTs

Slices

Total time

63

37

17.1

192

−2 −1

1,033

931

30

192

−2 −1

648

642

45

2

64 64

TABLE 2.11 Cost and Delay of Combinational Specific Reducers

2.8

Comments and Conclusions According to the implementation results, the following conclusions are obtained. 1. For nonfixed-m reducers, the more cost-effective are those based on integer divisions, that is, the nonrestoring and the SRT reducers. The mod 2k − a reducer uses more resources (slices and multipliers) and the same occurs with the precomputation-based reducer. Nevertheless, the latter is more cost-effective. The less cost-effective is the Barrett reducer for the great number of multipliers it needs. 2. As regards the computation time for nonfixed-m reducers, the fastest is the Barrett reducer. The mod 2k − a and the precomputation-based reducers are slower, while the nonrestoring and SRT reducers are the slowest. Those experimental results do not completely confirm the theoretical results of Tables 2.1 and 2.2. The SRT reducers are not so fast as was

59

60

Chapter Two theoretically foreseen. In fact, it is well known that, within FPGA devices the carry-ripple adders use optimized carry chains, so that the improvement in performance using carrysave adders is not so great. Regarding the fact that the Barrett reducer is the fastest, and not the precomputation-based reducer as stated in Tables 2.1 and 2.2, remember that for the former a worst-case estimation was done, while for the latter a kind of better-case estimation was considered. 3. For fixed m, specific reducers should be synthesized. For example, in the case of m = 2192 − 264 − 1, a specific circuit has been designed, with approximately the same number of resources as a nonrestoring reducer, and about two orders of magnitude faster.

2.9

References [DBS06] J.-P. Deschamps, G. Bioul, and G. Sutter. Synthesis of Arithmetic Circuits. Wiley, Hoboken, New Jersey, 2006. [EL04] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann, San Francisco, 2004. [HMV04] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer, New York, 2004. [MOV96] A. J. Menezes, P. C. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, 1996. [Par00] B. Parhami. Computer Arithmetic. Oxford University Press, New York, 2000.

CHAPTER

3

mod m Operations

A

lgorithms for executing arithmetic operations over the finite ring Zm are presented, and the corresponding circuits are synthesized. The operations under consideration are the addition, subtraction, multiplication, and exponentiation mod m. Obviously a straightforward solution consists of executing the same operations with integers, and then reducing mod m (Chap. 2) the previously obtained results. Nevertheless more efficient algorithms can be used. Among others the Montgomery arithmetic is introduced and applied to the multiplication and exponentiation operations. All the mentioned algorithms have been synthesized and implemented within field programmable components.

3.1 Addition mod m Given two natural numbers x and y belonging to Zm = {0, 1, . . . , m − 1}, compute z = (x + y) mod m. Taking into account that 0 ≤ x + y < 2m z must be equal to either x + y or x + y − m. The following algorithm computes z.

Algorithm 3.1—mod m addition z1 := x + y; z2 := z1 - m; if z2 >= 0 then z := z2; else z := z1; end if;

Algorithm 3.1 is slightly modified in order to get a more efficient circuit. Assume that m is a k-bit natural and define z2 = (z1 mod 2k) + (2k − m)

(3.1)

(instead of z1 − m). Then, consider three cases: 1. If z1 < m, so that z1 < 2k, then z2 = z1 + (2k − m) < m + (2k − m) = 2k; 2. If m ≤ z1 < 2k then z2 = z1 + (2k − m) ≥ m + (2k − m) = 2k, and z2 mod 2k = z1 − m; 3. If 2k ≤ z1, so that m < z1, then z2 = (z1 – 2k) + (2k − m) = z1 – m < 2m – m = m < 2k, and z2 mod 2k = z1 − m.

61

62

Chapter Three To summarize, If z1 < m, then z1 < 2k and z2 < 2k If m ≤ z1, then either z1 ≥ 2k or z2 ≥ 2k, and z1 − m = z2 mod 2k

Algorithm 3.2—Binary mod m addition z1 := x + y; z2 := (z1 mod 2**k) + (2**k - m); c1 := z1/2**k; c2 := z2/2**k; if c1 = 0 and c2 = 0 then z := (z1 mod 2**k); else z := (z2 mod 2**k); end if;

An executable Ada file binary_mod_m_addition.adb, including Algorithm 3.2, is available at www.arithmetic-circuits.org.

Example 3.1 Assume that k = 8 and m = 239, so that 2k − m = 17. x = 129 and y = 105: z1 = 129 + 105 = 234, z2 = 234 + 17 = 251, c1 = c2 = 0, z = z1 mod 256 = 234 x = 234 and y = 238: z1 = 234 + 238 = 472, z2 = 216 + 17 = 233, c1 = 1, c2 = 0, z = z2 mod 256 = 233 x = 215 and y = 35: z1 = 215 + 35 = 250, z2 = 250 + 17 = 267, c1 = 0, c2 = 1, z = z2 mod 256 = 11 A combinational circuit that implements Algorithm 3.2 is shown in Fig. 3.1. If ripple-carry adders are used, then its computation time is equal to T = (k + 1)TFA + Tmux2 − 1 ≈ kTFA

x

c1

y

k-bit adder z1 mod 2k

2k – m

c2

k-bit adder z2 mod 2k 0

1

z

FIGURE 3.1

mod m adder.

(3.2)

mod m Operations

3.2

Subtraction mod m Given two natural numbers x and y belonging to Zm = {0, 1, . . . , m − 1}, compute z = (x − y) mod m. Taking into account that −m<x−y<m z must be equal to either x − y or x − y + m. The corresponding algorithm is the following:

Algorithm 3.3—mod m subtraction z1 := x - y; z2 := z1 + m; if z1 < 0 then z := z2; else z := z1; end if;

Algorithm 3.3 is slightly modified in order to get a more efficient circuit. Assume that m is a k-bit natural and define z1 = (x – y + 2k) mod 2k

(3.3)

(instead of z1 − m). Then consider two cases: If −m < x – y < 0 , then z1 = x – y + 2k < 2k, z2 = z1 + m = x – y + 2k + m > 2k z2 mod 2k = x – y + m If 0 ≤ x – y < m, then 2k ≤ x – y + 2k < 2k + m, z1 = x – y To summarize, either x – y + 2k < 2k and z = z2 mod 2k, or 2k ≤ x – y + 2 and z = z1. The corresponding algorithm is the following: k

Algorithm 3.4—Binary mod m subtraction sum := x + (2**k - y); z1 := (sum mod 2**k); c1 := sum/2**k; z2 := (z1 + m) mod 2**k; if c1 = 1 then z := z1; else z := z2; end if;

An executable Ada file binary_mod_m_subtraction.adb, including Algorithm 3.4, is available at www.arithmetic-circuits.org.

Example 3.2 Assume that k = 8, m = 239. x = 159 and y = 238: sum = 159 + 18 = 177, z1 = 177, z2 = 177 + 239 = 416, c1 = 0, z = z2 mod 256 = 160 x = 238 and y = 159: sum = 238 + 97 = 335, z1 = 79, z2 = 79 + 239 = 318, c1 = 1, z = z1 = 79 A combinational circuit implementing Algorithm 3.4 is shown in Fig. 3.2. If ripple-carry adders are used, then its computation time is equal to T = Tinv + (k + 1)TFA + Tmux2 − 1 ≈ kTFA

(3.4)

63

64

Chapter Three x

y

2k – y – 1 c1

k-bit adder

1

z1

m

k-bit adder z2 0

1

z

FIGURE 3.2

mod m subtractor.

3.3 Adder/Subtractor mod m The combinational adder/subtractor of Fig. 3.3 is deduced from Figs. 3.1 and 3.2. The combinational circuit generates the following 3-variable Boolean function sel: sel = addb / sub ⋅ (c1 ∨ c2 ) ∨ addb / sub . c 1 The computation time of the adder/subtractor is approximately equal to T ≈ kTFA

(3.5)

A complete VHDL file adder_subtractor.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity adder_subtractor is port ( x, y: in std_logic_vector(K-1 downto 0); addb_sub: in std_logic; z: out std_logic_vector(K-1 downto 0) ); end adder_subtractor;

mod m Operations x

y addb/sub

c1

k-bit adder z1 m

c2

k-bit adder

comb. circ.

z2 sel 0

1

z

FIGURE 3.3

Adder/subtractor mod m.

The circuit being a combinational one, no communication protocol is necessary. The VHDL architecture corresponding to the circuit of Fig. 3.3 is the following: long_x <= ‘0’ & x; xor_gates1: for i in 0 to K-1 generate xor_y(i) <= y(i) xor addb_sub; end generate; xor_y(K) <= ‘0’; sum1 <= addb_sub + long_x + xor_y; c1 <= sum1(K); z1 <= sum1(K-1 downto 0); long_z1 <= ‘0’ & z1; xor_gates2: for i in 0 to k-1 generate xor_m(i) <= m(i) xor not(addb_sub); end generate; xor_m(k) <= ‘0’; sum2 <= not(addb_sub) + long_z1 + xor_m; c2 <= sum2(k); z2 <= sum2(k-1 downto 0);

65

66

Chapter Three sel <= (not(addb_sub) and (c1 or c2)) or (addb_sub and not(c1)); with sel select z <= z1 when ‘0’, z2 when others;

3.4

Multiplication mod m Given x and y ∈ Zm = {0, 1, . . . , m − 1}, compute z = xy mod m, where m is a k-bit natural.

3.4.1

Multiply and Reduce

As already quoted above, a straightforward method consists of multiplying x by y, so that a 2k-bit result product is obtained, and then reducing product mod m. For that, any combination of multiplier and mod m reducer can be used. Nevertheless the chosen option should be coherent. Two examples will be described. For relatively small values of m, combinational circuits can be considered for both the multiplication and the reduction. As an example, synthesize a mod 239 multiplier. Any 8-by-8 parallel multiplier can be used along with the mod 239 reducer of Fig. 2.10. A VHDL file mod_239_multiplier.vhd is available at www.arithmeticcircuits.org. The corresponding entity declaration is entity mod_239_multiplier is port ( x, y: in std_logic_vector(7 downto 0); z: out std_logic_vector(7 downto 0) ); end mod_239_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.4 is the following: product <= x*y; reducer_component: mod_239_reducer port map(product, z);

As a second example consider the case of a generic k-bit mod m multiplier. Among the different mod m reduction circuits presented in Chap. 2, one of the fastest is the SRT reducer of Sec. 2.1.3. Its main computation resource is a carry-save adder ([Par00], [EL04], [DBS06]). The multiplier that computes product should also be a fast one, for example a sequential carry-save multiplier. The following shift-andadd algorithm computes product = xy:

Algorithm 3.5—Shift and add multiplication p(0) := 0; for i in 0 .. k-1 loop p(i+1) := (p(i) + x(i)*y)/2; end loop; product := p(n)*(2**n);

mod m Operations x

y

parallel 8-by-8 multiplier product (15..0)

mod 239 reducer

z

FIGURE 3.4

Mod 239 multiplier.

If p is represented under carry-stored form, the circuit of Fig. 3.5 is obtained, in which product = ps(k − 1..0)2k + pc(k − 1..0)2k + p(k − 1..0)

(3.6)

Nevertheless, as the SRT reducer also operates with data under carry-stored form, a final addition for generating the decoded value of product is not necessary. A complete VHDL file modified_csa _multiplier.vhd is available at www.arithmetic-circuits.org. The entity declaration is yk–1

yk–2

y0

ps,0 pc,0

pc,k–1 ps,k–2 pc,k–2 HA

FA

x (i )

...

FA

0

0

one (k – 1)-bit and one k-bit register initially: 0 and 0

k-bit right-shift register initially: x

FIGURE 3.5

pc,0

ps,0

ps,k–2 pc,k–2

ps,k–1 pc,k–1

0

Carry-save shift-and-add multiplier.

p (k – 1..0)

67

68

Chapter Three entity modified_csa_multiplier is port ( x, y: in std_logic_vector (K-1 downto 0); clk, reset, start: in std_logic; ps, pc, p: inout std_logic_vector (K-1 downto 0); done: out std_logic ); end modified_csa_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.5 is the following: csa: for i in 0 to K-1 generate next_s(i) <= ps(i) xor pc(i) xor y_by_xi(i); next_c(i+1) <= (ps(i) and pc(i)) or (ps(i) and y_by_ xi(i)) or (pc(i) and y_by_xi(i)) ; end generate; and_gates: for i in 0 to K-1 generate y_by_xi(i) <= y(i) and p(0); end generate; registers: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then ps <= (others => ‘0’); pc <= (others => ‘0’); p <= x; elsif update = ‘1’ then ps <= ‘0’ & next_s(K-1 downto 1); pc <= next_c(k)& next_c(K-1 downto 1); p <=next_s(0) & p(K-1 downto 1); end if; end if; end process;

The complete model additionally includes a k-state counter and a control unit. The structure of the complete carry-save mod m multiplier is shown in Fig. 3.6. The first block is the modified_csa_multiplier entity and the second one the srt_reducer entity (Sec. 2.1.3) with n = 2k, and thus n − k = k. The circuit must be slightly modified: the initial value of the register in Fig. 2.4 must be ss = 000ps(k − 1..0) and p(k − 1..0) = ps(k − 1..0)2k + p(k − 1..0) and sc = 000pc(k − 1..0)00...0 = pc(k − 1..0)2k, so that initially ss + sc = product [see Eq. (3.6)]. Within the VHDL architecture (Sec. 2.1.3) the process registers is modified accordingly, that is, registers: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then ss <= (“000”&ps)&p; sc <= “000”&pc; elsif update = ‘1’ then ...

mod m Operations x

y

start

sequential carry-save multiplier ps

pc

done 1

to control unit

p start 2

from control unit

SRT reducer

z FIGURE 3.6

done

Carry-save mod m multiplier.

end if; end if; end process registers;

A complete VHDL file csa_mod_multiplier.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity csa_mod_multiplier is port ( x, y, m: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: inout std_logic ); end csa_mod_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.6 is the following: first_step: modified_csa_multiplier port map( x => x, y => y, clk=> clk, reset => reset, start => start1, ps => ps, pc => pc, p => p, done => done1 ); second_step: modified_srt_reducer port map( ps => ps, pc => pc, p => p, m => m, clk => clk, reset => reset, start => start2, z => z, done => done2 );

69

70

Chapter Three control_unit: process(clk, reset) begin case current_state is when 0 to 1 => done <= ‘1’; start2 <= ‘0’; when 2 => done <= ‘0’; start2 <= ‘0’; when 3 => done <= ‘0’; start2 <= ‘1’; when 4 => done <= ‘0’; start2 <= ‘0’; end case; start1 <= start; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => if done1 = ‘1’ then current_state <= 3; end if; when 3 => current_state <= 4; when 4 => if done2 = ‘1’ then current_state <= 0; end if; end case; end if; end process;

The total computation time is the sum of the computation times of both blocks. The minimum clock period of the circuit of Fig. 3.5 is approximately equal to TFA, and the minimum clock period of the circuit of Fig. 2.4 is about 6TFA [Eq. (2.19)]. The multiplication takes k cycles and the SRT reduction n − k + 1 = k + 1 cycles, plus (k + 1)TFA time units for the final steps [Eq. (2.20)]. Thus the total number of cycles is approximately equal to 2k, the cycle duration about 6TFA, and the total computation time is T ≈ 12kTFA + kTFA

(3.7)

In Eq. (3.7) the term kTFA corresponds to the final operations; they are not executable in one clock cycle, and a comment similar to Comment 2.1 must be done.

3.4.2

Double, Add, and Reduce

This section describes circuits based on the Interleaving Multiplication Algorithm ([RSDK06]). Given a k-bit natural x and a natural y the product z = xy can be computed as follows: xy = (xk − 1 · 2k − 1 + xk − 2 · 2k − 2 + . . . + x0 · 20)y = (. . .((0 · 2 + xk − 1y)2 (3.8) + x y)2+ . . . + x y)2 + x y k−2

1

0

If all operations (addition and doubling) are executed mod m, the result is product = xy mod m. The corresponding (left to right) algorithm is

mod m Operations Algorithm 3.6—Double, add, and reduce p := 0; for i in 0 .. k-1 loop p := (p*2 + x(k-i-1)*y) mod m; end loop; product := p;

In the following equivalent algorithm, the function mod_m_ addition(x, y, m, k) computes x + y mod m; x, y, and m being k-bit numbers, according to Algorithm 3.2:

Algorithm 3.7—Double, add, and reduce (second version) p := 0; for i in 0 .. k-1 loop p := mod_m_addition(p, p, m, k); if x(k-i-1) = 1 then p := mod_m_addition(p, y, m, k); end if; end loop; product := p;

An executable Ada file dar_mod_multiplication, including Algorithm 3.7, is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 3.7 is shown in Fig. 3.7. The combinational circuit generates condition = ce _ p ∧ (step _ type ∨ x(i)) y

0

1

step_type ce_p

m

x

y

m

mod m adder (Fig. 3.2) z

ce k- bit register clear

x

condition comb. circ. load

x (i )

p

z

FIGURE 3.7

Double, add, and reduce multiplier.

k- bit shift shift register load

update load

71

72

Chapter Three The minimum clock period of the circuit of Fig. 3.7 is determined by the mod m adder computation time, that is, approximately kTFA [Eq. (3.2)]. The number of clock cycles is equal to 2k so that the total computation time is T ≈ 2k2TFA

(3.9)

a rather bad result compared to Eq. (3.7). A complete VHDL file dar_mod_multiplier.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity dar_mod_multiplier is port ( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end dar_mod_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.7 is the following: with step_type select second_operand <= p when ‘0’, y when others; main_component: adder port map(p, second_operand, sum); condition <= ce_p and (not(step_type) or x_i); parallel_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then p <= (others => ‘0’); elsif condition = ‘1’ then p <= sum; end if; end if; end process parallel_register; equal_zero <= ‘1’ when count = ZERO else ‘0’; z <= p; shift_register: process(clk) begin if clk’event and clk=’1’ then if load = ‘1’ then int_x <= x; elsif update = ‘1’ then for i in k-1 downto 1 loop int_x(i) <= int_x(i-1); end loop; int_x(0) <= ‘0’; end if; end if; end process shift_register; x_i <= int_x(k-1);

The complete model additionally includes a k-state counter and a control unit.

mod m Operations In order to reduce the computation time [Eq. (3.9)], the storedcarry encoding principle can be used. For that, Algorithm 3.7 is modified: p is represented under the form p = ps + pc; the sum p + y is computed in carry-stored form, that is, ps + pc + y = ws + wc (csa_sum and csa_carry functions, Sec. 2.1.3); every operation (multiply by 2 or sum y) is followed by one step of SRT-reduction (Sec. 2.1.3); the final value of p = pc + ps belongs to the interval − m ≤ p < m, so that the final result is either p or p + m.

Algorithm 3.8—Double, add, and reduce with carry-stored encoding ps := 0; pc := 0; for i in 0 .. k-1 loop --doubling ws := (2*ps) mod 2**(k+2); wc := (2*pc) mod 2**(k+2); --SRT reduction if quotient(ws, wc) = 1 then ps := csa_sum(ws, wc, minus_m); pc := csa_carry(ws, wc, minus_m); elsif quotient(ws, wc) = 0 then ps := csa_sum(ws, wc, 0); pc := csa_carry(ws, wc, 0); else ps := csa_sum(ws, wc, m); pc := csa_carry(ws, wc, m); end if; --adding if binary_x(k-i-1) = 1 then ws := csa_sum(ps, pc, y); wc := csa_carry(ps, pc, y); --SRT reduction if quotient(ws, wc) = 1 then ps := csa_sum(ws, wc, minus_m); pc := csa_carry(ws, wc, minus_m); elsif quotient(ws, wc) = 0 then ps := csa_sum(ws, wc, 0); pc := csa_carry(ws, wc, 0); else ps := csa_sum(ws, wc, m); pc := csa_carry(ws, wc, m); end if; end if; end loop; p := (ps + pc) mod 2**(k+1); if p >= 2**k then p := (p + m) mod 2**k; end if;

Part of the corresponding datapath is shown in Fig. 3.8. It must be completed by the k-bit shift register and the combinational circuit (generation of condition) of Fig. 3.7, as well as the adders corresponding to the final operations, that is, p = pc + ps and p + m if p is negative. The minimum clock period is about 2TFA, the number of clock cycles is equal to 2k, so that the total computation time, without the final operations, is approximately equal to 4kTFA. If carry-propagate adders are used for the final operations the total computation time is approximately equal to T ≈ 4kTFA + kTFA

(3.10)

73

HA

FA

FA

...

y (0)

ps (0)

pc (1)

y (1)

ps (1)

y (k – 1)

pc (k – 1)

ps (k – 1)

pc (k)

ps (k )

pc (k + 1)

ps (k + 1)

Chapter Three

FA

HA

sc (0)

t two 3-input Boolean functions

FA

u (0)

ws (1) ...

wc (0)

quotient

u (k – 1)

wc (k – 1) FA

ss (0)

3-bit adder

00 01 10

u ws (k – 1)

wc (k) FA

sc (1)

ss (1)

sc (k – 1)

step_type

wc (k + 1..k – 1)

m

ws (0)

1

wc

ws (k )

u (k + 1)

wc (k + 1)

ws (k + 1)

ws

FA

ss (k – 1)

0

0 –m

u (1)

1

u (k)

0

ws (k + 1..k – 1)

2pc (k..0) sc

wc (1)

ss

2ps (k..0)

sc (k )

ss (k )

sc (k + 1)

ss (k + 1)

0

FA 0 ce

(2k + 1)-bit register

FIGURE 3.8

condition load

pc (0)

ps (0)

pc (1)

ps (1)

pc (k – 1)

ps (k – 1)

pc (k)

ps (k )

pc (k + 1)

clear ps (k + 1)

74

Double, add, and reduce with stored-carry encoding.

In the preceding relation, the second term corresponds to the final operations that are not executable in one clock cycle. A complete VHDL file dar_csa_multipier.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity dar_csa_multiplier is port ( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end dar_csa_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.8 is the following: long_y <= “00” & y; sc(0) <= ‘0’; pc(0) <= ‘0’; first_csa: for i in 0 to K generate

mod m Operations ss(i) <= ps(i) xor pc(i) xor long_y(i); sc(i+1) <= (ps(i) and pc(i)) or (ps(i) and long_y(i)) or (pc(i) and long_y(i)); end generate; ss(K+1) <= ps(K+1) xor pc(K+1) xor long_y(K+1); two_ps <= ps(K downto 0) & ‘0’; two_pc <= pc(K downto 0) & ‘0’; with step_type select ws <= two_ps when ‘0’, ss when others; with step_type select wc <= two_pc when ‘0’, sc when others; t <= ws(k+1 downto k-1) + wc(k+1 downto k-1); quotient(1) <= t(2) xor (t(1) and t(0)); quotient(0) <= not(t(2) and t(1) and t(0)); with quotient select u <= minus_M when «01», long_ZERO when «00», M when others; second_csa: for i in 0 to k generate next_ps(i) <= ws(i) xor wc(i) xor u(i); next_pc(i+1) <= (ws(i) and wc(i)) or (ws(i) and u(i)) or (wc(i) and u(i)); end generate; next_ps(k+1) <= ws(k+1) xor wc(k+1) xor u(k+1); parallel_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then ps <= (others => ‘0’); pc <= (others => ‘0’); elsif condition = ‘1’ then ps <= next_ps; pc(k+1 downto 1) <= next_pc; end if; end if; end process parallel_register;

The complete model additionally includes a k-bit shift register, a 3-input combinational circuit generating condition, the adders corresponding to the final steps, a k-state counter, and a control unit. As regards the done flag, a comment similar to Comment 2.1 must be done.

3.4.3 3.4.3.1

Montgomery Multiplication Montgomery Arithmetic

Consider two natural numbers m and R > m, and assume that they are relatively prime, that is, gcd(m, R) = 1. Then there exists an element R−1 of Zm such that RR − 1 mod m = 1

(3.11)

Define an application T from Zm to Zm: T(x) = xR mod m

(3.12)

75

76

Chapter Three The inverse application T − 1 is defined as follows: T − 1(y) = yR − 1 mod m

(3.13)

Thus, T is a one-to-one and onto application.

Property 3.1 Given x and y in Zm, then T((x + y) mod m) = (T(x) + T(y)) mod m and T((x − y) mod m) = (T(x) − T(y)) mod m. Proof T((x ± y) mod m) = (x ± y)R mod m = ((xR mod m) ± (yR mod m)) mod m = (T(x) ± T(y)) mod m. As regards the multiplication, observe that T(xy mod m) = xyR mod m = ((xR mod m)(yR mod m))R−1 mod m = T(x)T(y)R − 1 mod m

(3.14)

The latter suggests the definition of a new operation on Zm, the so-called Montgomery product MP [Mon85]: MP(x, y) = xyR − 1 mod m

(3.15)

A straightforward consequence of Eqs. (3.14) and (3.15) is the following property:

Property 3.2 Given x and y in Zm, then T(xy mod m) = MP(T(x)T(y))

(3.16)

Assume now that the value of R2 = RR mod m

(3.17)

has been previously computed. Then, given x and y in Zm, T(x) = xR mod m = xR2R − 1 = MP(x,R2)

(3.18)

T − 1(y) = yR − 1 mod m = MP(y,1)

(3.19)

Assuming that an efficient algorithm exists for computing the Montgomery product MP, any set of operations on Zm, including sums, subtractions, and multiplications, can be performed in the following way: Substitute all the operands, say x1, x2, . . . , by T(x1) = MP(x1, R2), T(x2) = MP(x2, R2), . . . Execute all the operations substituting the products by Montgomery products Substitute all the results, say y1, y2, . . . , by T − 1(y 1) = MP(y 1,1), T − 1(y 2) = MP(y 2,1), . . .

mod m Operations

3.4.3.2

Montgomery Product

First define the Montgomery reduction MR ([Mon85]): given a natural number x, then MR(x) = xR − 1 mod m

(3.20)

As gcd(m, R) = 1, there exists an element − m − 1 of ZR such that m(−m − 1) mod R = −1. Assuming that the value minus_inv_m of −m − 1 has been previously computed, and that x < mR, the following algorithm computes MR(x):

Algorithm 3.9—Montgomery reduction q := (x + ((x*minus_inv_m) mod r)*m) / r; if q >= m then z := q-m; else z := q; end if;

The correctness of Algorithm 3.9 is deduced from the following facts: x + (x( −m − 1) mod R)m ≡ 0 mod R , so that x + (x( −m − 1) mod R)m = qR x + (x( −m − 1) mod R)m < 2mR, so that q < 2m q mod m = qRR − 1 mod m = (x + (x(−m − 1) mod R)m)R − 1 mod m = xR − 1 mod m = MR(x)

Example 3.3 Assume that m = 239, R = 256, and compute MR(47672).

First check that R − 1 mod m = 225 and −m − 1 mod R = 241. Then compute 47672 + (((47672 × 241) mod 256) × 239) = 47672 + (184 × 239) = 91648 91648/256 = 358 358 – 239 = 119 and check that 47672 × 225 = (44879) × 239 + 119.

Given x and y in Zm, their product p = xy is smaller than mm < mR, so that their Montgomery product [Eq. (3.15)] can be computed as follows: MP(x, y) = MR(xy)

(3.21)

The corresponding algorithm is the following ([Mon85]):

Algorithm 3.10—Montgomery product product := x*y; q := (product + ((product*minus_inv_m) mod r)*m) / r; if q >= m then z := q-m; else z := q; end if;

77

78

Chapter Three Example 3.4 Assume again that m = 239, R = 256, then compute MP(202, 236). 202 × 236 = 47672; thus (example 3.3) MP(202,236) = 119 An executable Ada file Montgomery_product.adb, including Algorithm 3.10, is available at www.arithmetic-circuits.org. If m is odd, then there exists an element 2 −1 of Zm such that 2 · 2 −1 mod m = 1. Assuming that m is a k-bit number, choose R = 2k. Given x = xk − 1 · 2k − 1 + xk − 2 · 2k − 2 + . . . + x0 · 20 and y in Zm, then MP(x,y) = xy(2k) − 1 mod m = (xk − 1y2k − 1 + xk − 2y2k − 2 + . . . + x y20)(2k) − 1 mod m = ((. . . (((0 + x y)2 − 1 + x y)2 − 1 + x y)2 − 1 0

0

+ . . . )2 − 1 + xk − 1y)2 − 1 mod m

1

2

(3.22)

The multiplication of a natural a by 2 −1 can be performed as follows: if a is even then a2 −1 ≡ a/2 mod m; if a is odd then a2 − 1 ≡ (a + m)/2 mod m (3.23) The following algorithm is deduced from Eqs. (3.22) and (3.23):

Algorithm 3.11 p := 0; for i in 0 .. k-1 loop a := p + x(i)*y; if (a mod 2) = 0 then p := a/2; else p := (a + m)/2; end if; end loop; z := p mod m;

In the preceding algorithm p is always smaller than 2m (by induction): If p < 2m, then a = p + xiy < 2m + y < 3m, a/2 < (3/2)m < 2m, (a + m)/2 < 4m/2 = 2m Thus, at the last step z is either p or p − m.

Algorithm 3.12—Binary Montgomery product p := 0; for i in 0 .. k-1 loop a := p + vector_x(i)*y; if (a mod 2) = 0 then p := a/2; else p := (a + m)/2; end if; end loop; if p >= m then z := p-m; else z := p; end if;

Example 3.5 Same example as before, that is, m = 239, R = 28 = 256, compute MP(202, 236).

mod m Operations First represent 202 in binary: x = 11001010. Then compute p=0 a = 0 + 0 · 236 = 0, p = a/2 = 0 a = 0 + 1 · 236 = 236, p = a/2 = 118 a = 118 + 0 · 236 = 118, p = a/2 = 59 a = 59 + 1 · 236 = 295, p = (a + m)/2 = 534/2 = 267 a = 267 + 0 · 236 = 267, p = (a + m)/2 = 506/2 = 253 a = 253 + 0 · 236 = 253, p = (a + m)/2 = 492/2 = 246 a = 246 + 1 · 236 = 482, p = a/2 = 241 a = 241 + 1 · 236 = 477, p = (a + m)/2 = 716/2 = 358 As 358 ≥ m, the final result is 358 − 239 = 119. An executable Ada file binary_Montgomery_product.adb, including Algorithm 3.12, is available at www.arithmetic-circuits.org. A datapath corresponding to Algorithm 3.12, is shown in Fig. 3.9. The variable p is represented under carry-stored form so that carry-save yk –1

y1

y0

...

0 as,k+1

ac,k+1

pc,k ps,k

ps,k–1 pc,k–1

HA

FA

as,k

ac,k

as,k–1

... ac,k–1

ps,1 pc,1

ps,0 pc,0

FA

FA

as,1

ac,1

mk–1 HA bs,k+1

bc,k

ac,(k+1..1)

bc,(k+1..1)

0

1

bs,k–1

bc,k–1

as,(k+1..1)

HA

bs,1

bc,1

bs,0

bc,0

bs,(k+1..1)

0

1 x

two (k + 1)-bit registers

pc

FIGURE 3.9

FA

0

bc,k+1 b s,k

as,0

...

ac,0

m0

m1

FA

0

as,0

Montgomery product.

clear

load

ce

shift

k-bit shift-register

ps

ce_p load

x (i )

79

80

Chapter Three adders can be used. The final steps, that is, p = pc + ps and z equal to either p or p − m, are missing. The minimum clock period of the circuit of Fig. 3.9 is approximately 2TFA. The number of clock cycles is equal to k so that the total computation time, without the final operations, is about 2kTFA. The final steps consist of computing p = pc + ps and p − m, where pc and ps are (k + 1)-bit numbers and m a k-bit number. If carry-propagate adders are used, the corresponding computation time is about kTFA. Thus, the total computation time is approximately equal to T ≈ 2kTFA + kTFA

(3.24)

In Eq. (3.24) the second term kTFA corresponds to the final operations which are not executable in one clock cycle. A complete VHDL file Montgomery_multiplier.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity Montgomery_multiplier is port ( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end Montgomery_multiplier;

The VHDL architecture corresponding to the circuit of Fig. 3.9 is the following: and_gates: for i in 0 to k-1 generate y_by_xi(i) <= y(i) and xi; end generate; y_by_xi(K) <= ‘0’; first_csa: for i in 0 to k generate as(i) <= pc(i) xor ps(i) xor y_by_xi(i); ac(i+1) <= (pc(i) and ps(i)) or (pc(i) and y_by_xi(i)) or (ps(i) and y_by_xi(i)); end generate; ac(0) <= ‘0’; as(K+1) <= ‘0’; long_m <= “00”&m; second_csa: for i in 0 to k generate bs(i) <= ac(i) xor as(i) xor long_m(i); bc(i+1) <= (ac(i) and as(i)) or (ac(i) and long_m(i)) or (as(i) and long_m(i)); end generate; bc(0) <= ‘0’; bs(K+1) <= ac(K+1); half_as <= as(K+1 downto 1); half_ac <= ac(K+1 downto 1); half_bs <= bs(K+1 downto 1); half_bc <= bc(K+1 downto 1); with as(0) select next_pc <= half_ac when ‘0’, half_bc when others; with as(0) select next_ps <= half_as when ‘0’, half_bs when others;

mod m Operations parallel_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then pc <= (others => ‘0’); ps <= (others => ‘0’); elsif ce_p = ‘1’ then pc <= next_pc; ps <= next_ps; end if; end if; end process parallel_register; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then int_x <= x; elsif ce_p = ‘1’ then for i in 0 to K-2 loop int_x(i) <= int_x(i+1); end loop; int_x(K-1) <= ‘0’; end if; end if; end process shift_register; xi <= int_x(0);

The complete model additionally includes the circuits corresponding to the final steps, that is, p <= ps + pc; p_minus_m <= p + minus_m; with p_minus_m(K) select z <= p(K-1 downto 0) when ‘0’, p_minus_m(K-1 downto 0) when others;

as well as a k-state counter and a control unit. As regards the done variable, a comment similar to Comment 2.1 must be done.

3.4.4

Comparison

In this section three multiplication algorithms were considered: multiply and reduce; double, add, and reduce; and Montgomery product. The corresponding approximate computation times are the following [Eqs. (3.7), (3.9), (3.10), and (3.24)] (Table 3.1): Multiplication algorithm

Computation time

Multiply and reduce (stored-carry)

12kTFA + kTFA

Double, add, and reduce

2k2TFA

Double, add, and reduce (stored-carry)

4kTFA + kTFA

Montgomery (stored-carry)

2kTFA + kTFA

TABLE 3.1

Approximate Computation Times

81

82

Chapter Three Obviously, the Montgomery algorithm gives the shortest computation time. As pointed out in Sec. 3.4.3.1, the Montgomery method is effective when many multiplications involving a reduced number of different operands are performed, that is, when the initial encoding (operands → T(operands)) and the final decoding (results → T−1 (results)) do not substantially increase the computation time. This is the case when computing an exponential function such as yx. If a single multiplication must be performed, the other algorithms should be considered.

3.5

Exponentiation Given x and y belonging to Zm = {0, 1, . . . , m − 1} , compute z = yx mod m. Assume that m is a k-bit number and that x is represented in base 2, that is x = xk − 1 · 2k − 1 + xk − 2 · 2k − 2 + . . . + x0 · 20. Then z can be written in the form [Knu81] x x z = (( . . . (12 y k − 1 )2 y k − 2 )2 . . . )2 y x1 )2 y x0 mod m

to which corresponds the following algorithm:

Algorithm 3.13—Base 2 mod m exponentiation, MSB-first e := 1; for i in 1 .. k loop e := (e*e) mod m ; if binary_x(k-i) = 1 then e := (e*y) mod m; end if; end loop; z := e;

An executable Ada file mod_m_exponentiation_msb.adb, including Algorithm 3.13, is available at www.arithmetic-circuits.org. This algorithm includes between k and 2k mod m products. Nevertheless all the operands are either 1, y, or a previously obtained value (e) so that an alternative solution is the use of the Montgomery product. The computation is performed as follows: Substitute the initial operands 1 and y by T(1) = 2k mod m and T(y) = MP(y, exp_2k) where exp_2k = 22.k mod m Execute the main loop of Algorithm 3.13, substituting the mod m products by Montgomery products Compute T − 1(e) = MP(e, 1) Assume that exp_k = 2k mod m and exp_2k = 22k mod m have been previously computed and that the function mp that computes the Montgomery product has been defined:

mod m Operations Algorithm 3.14—Base 2 mod m exponentiation, MSB-first (Montgomery product) e := exp_k; ty := mp(y, exp_2k); for i in 1 .. k loop e := mp(e, e); if binary_x(k-i) = 1 then e := mp(e, ty); end if; end loop; z := mp(e, 1);

An executable Ada file Montgomery_exponentiation_msb.adb, including Algorithm 3.14 is available at www.arithmetic-circuits.org. A datapath corresponding to Algorithm 3.14 is shown in Fig. 3.10. In this circuit it is essential that the mp_done signal is raised when the final result is available (Comment 2.1). For that, a modified model of the Montgomery multiplier has been generated. It includes a timer that delays the rising of the done signal. The delay is a parameter that can be defined by the designer in function of the type of adder used for the final steps. The modified model Montgomery_multiplier_ modif.vhd and the exponentiator model Montgomery_exponentiator_ msb.vhd are available at www.arithmetic-circuits.org. The declaration of the entity Montgomery_exponentiator_msb is the following: entity Montgomery_exponentiator_msb is port ( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end Montgomery_exponentiator_msb; exp_2k y

e

00

01,10,11

00

e

ty

01

10 11

operand 1

1

x control

operand 2

x reset

parallel_in serial_out

y Montgomery multiplier

start

start_mp

done

mp_done

shift

update

load

load

k-bit shift register xk–i

z result

ce k-bit register initially:exp_k load

e

ce_e load

k-bit register

ce

ce_ty

ty

FIGURE 3.10 mod m exponentiation, MSB-ﬁrst algorithm.

z

83

84

Chapter Three The VHDL architecture corresponding to the circuit of Fig. 3.10 is with control select operand1 <= y when “00”, e when others; with control select operand2 <= exp_2k when “00”, e when “01”, ty when “10”, one when others; main_component: Montgomery_multiplier_modif port map(operand1, operand2, clk, reset, start_mp, result, mp_done); z <= result; register_e: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then e <= exp_k; elsif ce_e = ‘1’ then e <= result; end if; end if; end process register_e; register_ty: process(clk) begin if clk’event and clk = ‘1’ then if ce_ty = ‘1’ then ty <= result; end if; end if; end process register_ty; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then int_x <= x; elsif update = ‘1’ then int_x <= int_x(K-2 downto 0)&’0’; end if; end if; end process shift_register; xkminusi <= int_x(K-1);

Algorithm 3.14 mainly consists of k executions of a loop that in turn includes at most two Montgomery products. According to Eq. (3.24) the computation time of a Montgomery product is about 3kTFA, so that the total computation time of yx mod m is approximately equal to T ≈ 6k2TFA

(3.25)

Another way to compute z = yx mod m is given by the following expression z = y x0 ( y 2 )x1 ( y 2 )x2 . . . ( y 2 2

k −1

)

xk − 1

mod m

to which corresponds the following algorithm:

(3.26)

mod m Operations Algorithm 3.15—Base 2 mod m exponentiation, LSB-first e := 1; for i in 0 .. k-1 loop if binary_x(i) = 1 then e := (e*y) mod m; end if; y := (y*y) mod m ; end loop; z := e;

An executable Ada file mod_m_exponentiation_lsb.adb, including Algorithm 3.15, is available at www.arithmetic-circuits.org. As before, the Montgomery product can substitute the mod m product.

Algorithm 3.16—Base 2 mod m exponentiation (Montgomery product), LSB-first e := exp_k; ty := mp(y, exp_2k); for i in 0 .. k-1 loop if binary_x(i) = 1 then e := mp(e, ty); end if; ty := mp(ty, ty); end loop; z := mp(e, 1);

Algorithm 3.16 mainly consists of k executions of a loop that in turn includes at most two Montgomery products. In this case both products can be executed in parallel. A datapath corresponding to Algorithm 3.16 is shown in Fig. 3.11. According to Eq. (3.24) the computation time of a Montgomery product is about 3kTFA, so that the total computation time of yx mod m is approximately equal to T ≈ 3k2TFA

(3.27)

A complete VHDL file Montgomery_exponentiator_lsb.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity Montgomery_exponentiator_lsb is port ( x, y: in std_logic_vector(k-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(k-1 downto 0); done: out std_logic ); end Montgomery_exponentiator_lsb;

The VHDL architecture corresponding to the circuit of Fig. 3.11 is with last select second <= ty when ‘0’, one when others; with first select operand1 <= y when ‘1’, ty when others; with first select operand2 <= exp_2k when ‘1’, ty when others;

85

86

Chapter Three x ty

1

0

1

y ty

0

last

1

ty

exp_2k

0

1

first

e

x

y

reset Montgomery start multiplier done z

x start_mp1 mp1_done

y

reset Montgomery start multiplier done z

next_e

start_mp2 mp2_done

next_y ce

ce_ty

k-bit register ce k-bit register initially: exp_k load

ce_e load ty

e

parallel_in

shift

update

load

load

k-bit shift register serial_out

xi

z

FIGURE 3.11 mod m exponentiation, LSB-ﬁrst algorithm.

main_component1: Montgomery_multiplier_modif port map( x => e, y => second, clk => clk, reset=> reset, start => start_mp1, z => next_e, done => mp1_done ); main_component2: Montgomery_multiplier_modif port map( x => operand1, y => operand2, clk => clk, reset=> reset, start => start_mp2, z => next_y, done => mp2_done ); mp_done <= mp1_done and mp2_done; z <= next_e; register_e: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then e <= exp_k; elsif ce_e = ‘1’ then e <= next_e; end if; end if; end process register_e; register_ty: process(clk) begin if clk’event and clk = ‘1’ then if ce_ty = ‘1’ then ty <= next_y; end if;

mod m Operations end if; end process register_ty; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then int_x <= x; elsif update = ‘1’ then int_x <= ‘0’&int_x(K-1 downto 1); end if; end if; end process shift_register; xi <= int_x(0);

The complete model additionally includes a k-state counter and a control unit.

3.6

FPGA Implementations Several multipliers have been implemented within Spartan3 (speed -5) programmable devices. As before, the times (period, Total Time) are expressed in ns, and the parameters FFs and LUTs represent the numbers of flip-flops and look-up tables, respectively. Every slice includes two flip-flops and two look-up tables. All the source files are available at www.arithmetic-circuits.org.

3.6.1

mod m Adders/Subtractors

The cost and delay of several mod m adders/subtractors are shown in Table 3.2. K

m

8

239

192

LUTs

Slices

Total time

25

13

9

192

64

–2 −1

577

384

45

256

224

2

256

2 –2 – 2192 – 296 − 1

770

514

52

521

2521 – 1

1,567

1,044

105

TABLE 3.2

3.6.2

Cost and Delay of mod m Adders/Subtractors

mod m Multipliers

Two values of m have been considered. For m = 239 (k = 8), the best option is the combinational circuit described in Sec. 3.4.1 (Fig. 3.4). The implementation results are the following: LUTs

Slices

Total time

31

18

15

87

88

Chapter Three For m = 2192 – 264 − 1 (k = 192), four circuits have been implemented: csa_mod_multiplier, dar_mod_multiplier, dar_csa_multiplier, Montgomery_ multiplier (remember that the Montgomery multiplier does not compute xy mod m but xy2−k mod m). The cost and delay of several multipliers are shown in Table 3.3. FF

LUTs

Slices

Period

Cycles Total time

csa_mod

1,271 3,678

2,053

6.233

384

2393.5

dar_mod

400

593

400

23.615

384

9068.2

dar_csa

597

1,835

1,113

9.796

384

3761.7

Montgomery

612

1,398

922

6.765

198

1339.5

TABLE 3.3

Cost and Delay of mod 2192 – 264 − 1 Multipliers

Another mod 2192 –264 − 1 multiplier implementation is reported in App. A (Sec. A.4.1). It uses a sequential multiplier and the combinational reducer of Sec. 2.6.2.

3.6.3 mod m Exponentiators Two values of m are considered: m = 239 (k = 8) and m = 2192 − 264 − 1 (k = 192). The implementation results are the following (Tables 3.4 and 3.5): FF

LUTs

Slices

Period

Cycles

Total time

MSB-first

70

166

93

6.960

128

891

LSB-first

97

265

140

7.533

64

482

TABLE 3.4

FF

LUTs

Slices

Period

Cycles

Total time

MSB-first

1,185

1,993

1,199

8.176

73,733

602,841

LSB-first

1,779

3,554

1,983

8.871

36,869

327,065

TABLE 3.5

3.7

Cost and Delay of mod 239 Exponentiators

Cost and Delay of mod 2192 – 264 − 1 Exponentiators

Comments and Conclusions The experimental results do not completely confirm the theoretical results of Table 3.1. The fastest multiplier is obtained with the csa_ mod_multiplier entity and not with the dar_csa_multiplier entity. On the other hand the latter uses less slices. As regards the exponentiation, the fastest circuit is obtained with the LSB-first algorithm, and the most cost-effective with the MSB-first algorithm.

mod m Operations

3.8

References [DBS06] J.-P. Deschamps, G. Bioul, and G. Sutter. Synthesis of Arithmetic Circuits. Wiley, Hoboken, New Jersey, 2006. [EL04] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann, San Francisco, 2004. [Knu81] D. E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, vol. 2. Addison-Wesley, Massachusetts, 2d ed., 1981. [Mon85] P. Montgomery. “Modular Multiplication without Trial Division.” Mathematics of Computation, vol. 44, American Mathematical Society, Providence, Rhode Island, pp. 519–521, April 1985. [Par00] B. Parhami. Computer Arithmetic. Oxford University Press, New York, 2000. [RSDK06] F. Rodríguez-Henríquez, N. A. Saqib, A. Díaz Pérez, and Ç. K. Koç. Cryptographic Algorithms on Reconfigurable Hardware. Springer, New York, 2006.

89

This page intentionally left blank

CHAPTER

4

Operations over GF(p)

I

f p is prime, any nonzero element y of the ring Zp has an inverse y − 1 such that yy − 1 mod p = 1

(4.1)

Thus Zp is a finite field, also called Galois field, GF(p). Algorithms and circuits for executing additions, subtractions, multiplications, and exponentiations over Zp have already been studied in Chap. 3. In this chapter the inversion or, more generally, the division operation will be dealt with. The problem under study is the following: given x and y in Zp, where y ≠ 0, compute z such that x = yz mod p, that is, z = xy − 1 mod p

(4.2)

Observe that if p = 2, then y − 1 = 1, and z = x. So, throughout this chapter p will be assumed to be a k-bit natural greater than 2. In particular k must be greater than 1. A first method for computing the mod p inverse of an element y of Zp consists of using an algorithm that allows it to express the gcd (greatest common divisor) of two naturals a and b under the form of a linear combination αa + βb of a and b where α and β are integers. Assume that a = p and b = y and express the gcd of p and y under the form αp + βy. As p is prime and y smaller than p, their gcd is 1, so that αp + βy = 1 and βy mod p = 1, that is, y − 1 = β mod p

(4.3)

To this class of algorithms belong extended Euclidean algorithm and the binary algorithm. Another method is based on Fermat’s little theorem that states that yp − 1 mod p = 1 for any nonzero y. Thus, yp − 2 y mod p = 1 and y − 1 = yp − 2 mod p

(4.4)

In this way inversion is substituted by exponentiation.

91

92 4.1

Chapter Four

Euclidean Algorithm The classical Euclidean algorithm ([HMV04], [MOV96]) for computing the gcd of two naturals a and b consists of a set of integer divisions: r0 = a, r1 = b r0 = r1q1 + r2 r1 = r2q2 + r3

(4.5)

... rn − 2 = rn − 1qn − 1 + rn As r1 > r2 > r3 > . . . , after a finite number of steps the remainder, say rn, will be equal to 0. Furthermore gcd(rn − 1, rn) = gcd(rn − 2, rn − 1) = . . . = gcd(r0, r1) = gcd(a, b). Thus, gcd(a, b) = gcd(rn − 1, 0) = rn − 1

(4.6)

In the particular case where a = p and b = y < p, the gcd is equal to 1, that is, rn − 1 = 1

(4.7)

For computing z = xy − 1 mod p, another set of values u0, u1, u2, . . . are computed in parallel with the computation of q1, q2, q3, . . . : u0 = 0, u1 = x u2 = u0 − u1q1 u3 = u1 − u2q2

(4.8)

... un = un − 2 − un − 1qn − 1 The following lemma is demonstrated by induction.

Lemma 4.1 uiy ≡ rix mod p. Proof For i = 0 and 1: u0y = 0 and r0x = px ≡ 0 mod p, u1y = xy and r1x = yx. For i ≥ 2: uiy = (ui − 2 − ui − 1qi − 1)y = ui − 2y − ui − 1qi − 1y ≡ ri − 2x − ri − 1qi − 1x = (ri − 2 − ri − 1qi − 1)x = rix. Thus, according to Eq. (4.7) and Lemma 4.1, un − 1y ≡ x mod p, that is, xy − 1 = un − 1 mod p

Algorithm 4.1—mod p division, Euclidean algorithm a := p; b := y; c := 0; d := x; while b > 1 loop

(4.9)

Operations over GF(p) q := a/b; r := a mod b; next_d := (c - d*q); a := b; b := r; c := d; d := next_d; end loop; z := d mod p;

An executable Ada file Euclidean_mod_p_division.adb, including algorithm 4.1, is available at www.arithmetic-circuits.org. In order to execute Algorithm 4.1 the following computation primitives must be implemented: integer division, multiplication and subtraction, and mod p reduction.

4.1.1

Integer Division

At each step of Algorithm 4.1 the quotient q = a/b and the remainder r = a mod b must be computed. For using the SRT algorithm ([Par00], [EL04], [DBS06]) the divisor b should be normalized (most significant bit equal to 1), a rather complex operation. Instead, the nonrestoring algorithm will be used. The range of a and b is deduced from the fact that p = r0 > r1 > r2 > . . . > 0, so that 0 < ri ≤ p and 1 < ri − 1/ri ≤ p. Thus, if p is a k-bit number, so are a and b, with b ≥ 2. A slightly more general case is considered: a is assumed to be a (k + 1)-bit 2’s complement integer, that is, − 2k ≤ a < 2k. Define s0 = a

y = b2k − 2

and

(4.10)

so that − 2k ≤ s0 < 2k and y ≥ 2 · 2k − 2 = 2k − 1. Thus 2y ≥ 2k > s0, − 2y ≤ − 2k ≤ s0 and − 2y ≤ s0 < 2y. According to Property 2.1 the equation s0 = q1y + r1 has at least one solution with q1 ∈ {− 1, 0, 1} and − y ≤ r1 < y, so that s1 = 2r1 belongs to the same interval −2y ≤ s1 < 2y. By repeatedly applying Property 2.1 the following set of equations is generated: a = s0 = q1y + r1, − y ≤ r1 < y 2r1 = s1 = q2y + r2, − y ≤ r2 < y 2r2 = s2 = q3y + r3, − y ≤ r3 < y

(4.11)

... 2rk − 2 = sk − 2 = qk − 1y + rk − 1, − y ≤ rk − 1 < y According to the Robertson diagram of Fig. 2.1 the value of qi can be chosen as follows: If si − 1 < 0 then qi = − 1 and ri = si − 1 + y, and if si − 1 ≥ 0 then qi = 1 and ri = si − 1 − y Then multiply the first equation by 2k − 2, the second one by 2k − 3, and so on, and sum up the k − 1 equations. The result is a2k − 2 = (q1 · 2k − 2 + q2 · 2k − 3 + q3 · 2k − 4 + . . . + qk − 1 · 20)y + rk − 1

93

94

Chapter Four that is, according to Eq. (4.10), a = (q1 · 2k − 2 + q2 · 2k − 3 + q3 · 2k − 4 + . . . + qk − 1 · 20)b + (rk − 1/2k − 2) Thus the quotient q and the remainder r are q = q1 · 2k − 2 + q2 · 2k − 3 + q3 · 2k − 4 + . . . + qk − 1 · 20

and

r = rk − 1/2k − 2

(4.12)

−b≤r
(4.13)

qi’ = (qi + 1)/2

(4.14)

with

Then define

so that qi = 2qi’ − 1 and q = (q1’ · 2k − 1 + q2’ · 2k − 2 + q3’ · 2k − 3 + . . . + qk − 1’ · 2) − (2k − 1 − 1) = (q1’ − 1) · 2k − 1 + q2’ · 2k − 2 + q3’ · 2k − 3 + . . . + qk − 1’ · 2 + 1 · 20

(4.15)

According to Eq. (4.14) qi’ ∈ {0, 1} and the k-bit binary vector (1 − q1’) q2’ q3’ . . . qk − 1’ 1 is the 2’s complement representation of q. Nevertheless, as r could be negative [Eq (4.13)], a final correction step could be necessary: If r < 0, then substitute q by q − 1 and r by r + b According to the definition Eq. (4.14) of qi’, its value is defined as follows: If si − 1 < 0 then qi’ = 0 and ri = si − 1 + y, and if si − 1 ≥ 0 then qi’ = 1 and ri = si − 1 − y

Algorithm 4.2—Nonrestoring algorithm y := b*(2**(k-2)); s := a; for i in 1 .. k-1 loop if s < 0 then q(k-i) := 0; r := s + y; else q(k-i) := 1; r := s - y; end if; s := 2*r; end loop; remainder := r / (2**(k-2)); q(k-1) := 1 - q(k-1); q(0) := 1;

Operations over GF(p) acc := 0; for i in 0 .. k-2 loop acc := acc + (q(i)*(2**i)); end loop; acc := acc - (q(k-1)*(2**(k-1))); if remainder < 0 then quotient := acc -1; remainder := remainder + b; else quotient := acc; end if;

An executable Ada file nr_divider.adb, including Algorithm 4.2, is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 4.2 is shown in Fig. 4.1. The minimum clock period is about kTFA if a carry-ripple addersubtractor is used for computing r at each step. The number of clock cycles is equal to k − 1 plus the cycles corresponding to the initial and final operations, so that the computation time is about T ≈ k2TFA s (2k – 2..k – 2)

b s (k – 3..0)

s (2k – 1)

oper

(4.16)

(k + 1)-bit adder-subtractor (0: add,1: subtract)

r (k – 3..0)

r (2k – 2..k – 2) a r (2k – 2..0) & 0 load

1

0

clear in k-bit shift register shift 1

load q (k – 1) q (k – 2..1)

(2k – 1)-bit parallel register

ce

load + update

s (2k – 1..0) q (k – 1..0)

1

s (2k – 2..k – 1)

k-bit subtractor

0

1

quotient

FIGURE 4.1

b

k-bit adder

0

1

remainder

Nonrestoring divider datapath.

s (2k – 1)

q (0)

s (2k – 1) update

95

96

Chapter Four A complete VHDL file nr_divider.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity nr_divider is port ( a, b: in std_logic_vector (K-1 downto 0); clk, reset, start: in std_logic; quotient, remainder: out std_logic_vector (K-1 downto 0); done: out std_logic ); end nr_divider;

The VHDL architecture corresponding to the circuit of Fig. 4.1 is the following: parallel_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then s <= zero & a; elsif update = ‘1’ then s(2*K-1 downto 1) <= r(2*K-2 downto 0); s(0) <= ‘0’; end if; end if; end process parallel_register; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then q <= short_zero; elsif update = ‘1’ then for i in K-1 downto 2 loop q(i) <= q(i-1); end loop; q(1) <= not(s(2*K-1)); end if; end if; end process shift_register; with s(2*K-1) select r(2*K-2 downto k-2) <= s(2*K-2 downto k-2) + b when ‘1’, s(2*K-2 downto K-2) - b when others; r(K-3 downto 0) <= s(K-3 downto 0); modified_q <= not(q(K-1))&q(K-2 downto 1)&’1’; with s(2*K-1) select quotient <= modified_q when ‘0’, modified_q -1 when others; with s(2*K-1) select remainder <= s(2*K-2 downto K-1) when ‘0’, s(2*K-2 downto K-1) + b when others;

The complete model additionally includes a (k − 1)-state counter and a control unit.

4.1.2

Multiplication and Subtraction

It has been observed [DBS06] that in system [Eq. (4.8)] u(i) belongs to the interval − px/2 ≤ u(i) < px/2, for all i in 0, 1, . . . , n − 1. As both p and

Operations over GF(p) x are k-bit naturals, − 22k − 1 ≤ u(i) < 22k − 1, so that c and d are 2k-bit 2’scomplement integers. For computing z = c − dq, where q is a k-bit natural, a slightly modified version of the right-to-left multiplication algorithm can be used: c − dq = c − d(qk − 1 · 2k − 1 + qk − 2 · 2k − 2 + . . . + q0 · 20) = [((((c − q0d)2 − 1 − q1d)2 − 1 − . . . )2 − 1 − qk − 1d)2 − 1]2k (4.17)

Algorithm 4.3—Subtract and shift algorithm w := c; for i in 0 .. k-1 loop w := (w - q(i)*d)/2; end loop; z := w*(2**k);

The structure of the corresponding datapath is shown in Fig. 4.2. The minimum clock period is about 2kTFA if a carry-ripple conditional subtractor is used, the number of clock cycles is equal to k, so that the computation time is about T ≈ 2k2TFA

(4.18)

d

2k-bit conditional subtractor dif (2k..1)

dif (0)

2k-bit register, initially: c

k-bit shift register, initially: q

u (2k – 1..0)

v (k – 1..0)

u (k – 1..0)

z = u (k – 1..0) & v (k – 1..0)

FIGURE 4.2

Multiplication and subtraction.

97

98

Chapter Four A complete VHDL file mult_subt.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity mult_subt is port ( c, d: in std_logic_vector (2*K-1 downto 0); q: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (2*K-1 downto 0); done: out std_logic ); end mult_subt;

The VHDL architecture corresponding to the circuit of Fig. 4.2 is the following: parallel_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then u <= c; elsif update = ‘1’ then u <= dif(2*k downto 1); end if; end if; end process parallel_register; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then v <= q; elsif update = ‘1’ then for i in 0 to k-2 loop v(i) <= v(i+1); end loop; v(k-1) <= dif(0); end if; end if; end process shift_register; with v(0) select dif <= (u(2*K-1) & u) - (d(2*K-1) & d) when ‘1’, (u(2*k-1)&u) when others; z <= u(K-1 downto 0)&v;

4.1.3

mod p Division

A datapath for executing Algorithm 4.1 is shown in Fig. 4.3. The nonrestoring divider and the multiplication-and-subtraction circuits have been described in Secs. 4.1.1 and 4.1.2, respectively. The reduction can be executed with the nonrestoring reducer of Sec. 2.1.2. As a matter of fact the circuit of Sec. 2.1.2 computes x mod m where x is an (n + 1)-bit 2’s complement integer and m a k-bit natural. As d is a 2k-bit integer, n = 2k − 1. Let t be the number of executions of the main loop of Algorithm 4.1. The total computation time T is approximately equal to T ≈ t(Tdivision + Tmultiplication-and-subtraction) + Treduction

(4.19)

Operations over GF(p) a

b

a

b start_division

nonrestoring start divisor done remainder quotient

division_done

q

r

q

c

d

c

d

multiplication start and subtraction done z b

start_product product_done d

next_d

k-bit register initially: p

k-bit register initially: y

a

b

2k-bit register initially: x

2k-bit register initially: 0

load update

d c

p b x

m nonrestoring start reducer done z

comb. circ.

b_equal_1

start_reduction reduction_done

z

FIGURE 4.3

Euclidean algorithm.

where Tdivision, Tmultiplication-and-subtraction, and Treduction are the computation times of the three blocks of Fig. 4.3. The minimum clock period of the multiplication-and-subtraction circuit is 2kTFA while it is kTFA for the other blocks. Unless a different clock period is used for the circuit of multiplication and subtraction, the minimum clock period of the circuit of Fig. 4.3 is about 2kTFA. As the three algorithms (division, multiplication and subtraction, and reduction) are k-step iterations, the total computation time is approximately equal to (2kt + k)2kTFA, that is, T ≈ 4k2tTFA

(4.20)

As in Eq. (4.5) r1 > r2 > r3 > . . ., an upper bound of t is p < 2k, so that an upper (very pessimistic) bound of the computation time is T < k2 . 2k + 2TFA

(4.21)

A complete VHDL file Euclidean_divider.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity Euclidean_divider is port( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic;

99

100

Chapter Four z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end Euclidean_divider;

The VHDL architecture corresponding to the circuit of Fig. 4.3 is the following: divider: nr_divider port map(a, b, clk, reset, start_division, q, r, division_ done); multiplier: mult_subt port map(c, d, q, clk, reset, start_product, next_d, product_done); reducer: nr_reducer port map(d, p, clk, reset, start_reduction, z, reduction_ done); registers: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then a <= p; b <= y; c <= long_zero; d <= zero&x; elsif update = ‘1’ then a <= b; b <= r; c <= d; d <= next_d; end if; end if; end process registers; b_equal_1 <= ‘1’ when b = one else ‘0’;

The complete model additionally includes a control unit.

4.2

Binary Algorithm The binary algorithm ([Knu81]) for computing the gcd of two naturals is based on the following observation: Given two naturals a and b, if both are even then gcd(a, b) = 2gcd(a/2, b/2); if one of them, say b, is even and the other odd then gcd(a, b) = gcd(a, b/2); if both are odd and b greater than or equal to a then gcd(a, b) = gcd(a, b − a) and b − a is even. Assume that a is odd and define a0 = a and b0 = b. Two sequences a1, a2, a3, . . . and b1, b2, b3, . . . of naturals are generated ([BK83]): given ai and bi with ai odd, then If bi is even: ai + 1 = ai, bi + 1 = bi /2; If bi is odd and bi ≥ ai: ai + 1 = ai, bi + 1 = bi − ai; If bi is odd and bi < ai: ai + 1 = bi, bi + 1 = ai − bi. Obviously ai + 1 is odd and gcd(ai + 1, bi + 1) = gcd(ai, bi) so that gcd(ai + 1, bi + 1) = gcd(ai, bi) = . . . = gcd(a0, b0) = gcd(a, b)

(4.22)

Operations over GF(p) In order to demonstrate the convergence of this iteration, compare ai + 1 + bi + 1 with ai + bi: in the first case ai + 1 + bi + 1 = ai + bi/2 ≤ ai + bi; in the second case ai + 1 + bi + 1 = bi < ai + bi (ai is odd); in the third case ai + 1 + bi + 1 = ai < ai + bi (bi is odd). Thus ai + 1 + bi + 1 < ai + bi, unless bi = 0. In conclusion, after a finite number of steps bi = 0 so that, according to Eq. (4.22), gcd(a, b) = gcd(ai, 0) = ai.

Lemma 4.2 After a finite number of steps, ai = gcd(a, b).

In the particular case where a = p (a prime greater than 2 and thus an odd natural) and b = y, the gcd is equal to 1, that is, ai = 1. For computing z = xy-1 mod p, two additional set of values c1, c2, c3, . . . and d1, d2, d3, . . . are computed in parallel. The initial values are c0 = 0 and d0 = x. Then If bi is even: ci + 1 = ci, di + 1 = bi2 − 1 mod p; If bi is odd and bi ≥ ai: ci + 1 = ci, di + 1 = di − ci mod p; If bi is odd and bi < ai: ci + 1 = di, di + 1 = ci − di mod p. The following lemma is demonstrated by induction:

Lemma 4.3 ciy ≡ aix mod p

and

diy ≡ bix mod p

Proof For i = 0: c0y = 0 and a0x = px ≡ 0 mod p, d0y = xy and b0x = yx. For i >1 the values of ci + 1 and di + 1 in function of ci and di are calculated in the same way as the values of ai + 1 and bi + 1 in function of ai and bi, but for the substitution of the conventional arithmetic operations by mod p operations. In conclusion after a finite number of steps ai = 1 (Lemma 4.2) and ciy ≡ x mod p (Lemma 4.3), that is, xy − 1 mod p = ci

(4.23)

Given an element w of Zp, the value of w2 − 1 mod p is computed as follows: If w is even w2 − 1 mod p = w/2; if w is odd w2 − 1 mod p = (w + p)/2. Assume that a function function divide_by_2(w, p: in integer) return integer

returning w2 −1 mod p has been defined.

Algorithm 4.4—mod p division, binary algorithm a := p; while a while b :=

b := y; c := 0; d := x; > 1 loop (b mod 2) = 0 loop b/2; d := Divide_By_2(d, P);

101

102

Chapter Four end loop; if b >= a then b := b-a; d := (d-c) mod P; else Old_b := b; b := a-b; a := Old_b; Old_d := d; d := (c-d) mod P; c := Old_d; end if; end loop; Z := c;

An executable Ada file binary_algorithm.adb, including Algorithm 4.4, is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 4.4 is shown in Fig. 4.4. The minimum clock period is determined by the k-bit adders and subtractors, that is, about kTFA if ripple adders are used. Let t be the number of executions of the main loop of algorithm 4.4. The total computation time T is approximately equal to T ≈ tkTFA

(4.24)

As a0 + b0 = p + y > a1 + b1 > . . . > ai + bi = 1 + bi, with y < p, an upper bound of t is 2p < 2k + 1, so that an upper (very pessimistic) bound of the computation time is T < k2k + 1TFA

(4.25)

A complete VHDL file binary_algorithm.vhd is available at www. arithmetic-circuits.org. The entity declaration is

a

b

b

sign

0

1

a

a

b

subtractor

subtractor

0

1

d

p

conditional adder d + b0p /2

d

b0

c

c

d

mod p subtractor

mod p subtractor

0

1

c

d

0

1

d2–1 mod p p a

a/b

y b/2

b – a/a – b

d – c/c – d 0 c

x

0 1 2

k-bit ce register

0 1 2

0 1 2

0 1 2

k-bit ce register

k-bit ce register

k-bit ce register

a

b

FIGURE 4.4

Binary algorithm.

d

z,c

ce

Operations over GF(p) entity binary_algorithm is port( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end binary_algorithm;

The VHDL architecture corresponding to the circuit of Fig. 4.4 is the following: long_b <= ‘0’ & b; long_a <= ‘0’ & a; long_d <= ‘0’ & d; b_minus_a <= long_b - long_a; sign <= b_minus_a(K); a_minus_b <= a - b; with sign select ba_minus_ab <= b_minus_a(K-1 downto 0) when ‘0’, a_minus_b when others; with sign select ab <= a when ‘0’, b when others; and_gates: for i in 0 to K-1 generate p_by_d0(i) <= d(0) and p(i); end generate; p_by_d0(K) <= ‘0’; d_plus_p <= long_d + p_by_d0; divide_by_2: for i in 0 to K-1 generate half_d(i) <= d_plus_p(i+1); end generate; subtractor1: subtractor port map(d, c, d_minus_c); subtractor2: subtractor port map(c, d, c_minus_d); with sign select dc_minus_cd <= d_minus_c when ‘0’, c_minus_d when others; with sign select cd <= c when ‘0’, d when others; half_b <= ‘0’ & b(K-1 downto 1); with sel select next_a <= p when “00”, a when “01”, ab when others; with sel select next_b <= y when “00”, half_b when “01”, ba_minus_ab when others; with sel select next_c <= ZERO when “00”, c when “01”, cd when others; with sel select next_d <= x when “00”, half_d when “01”, dc_minus_cd when others; a_equal_1 <= ‘1’ when a = ONE else ‘0’; b_0 <= b(0); z <= c; parallel_registers: process(clk) begin if clk’event and clk = ‘1’ then if ce = ‘1’ then a <= next_a; b <= next_b; c <= next_c; d <= next_d; end if; end if; end process parallel_registers;

The complete model additionally includes a control unit.

103

104 4.3

Chapter Four

Plus-Minus Algorithm The plus-minus algorithm ([Tak98], [MBQ04], [DS06]) is a modified version of the binary algorithm. It is based on the following observations: if ai and bi are odd, bi + ai and bi − ai will be even and their sum (bi + ai) + (bi − ai) = 2bi cannot be a multiple of 4 (bi is odd), so that either bi + ai mod 4 = 0 and bi − ai mod 4 = 2, or bi + ai mod 4 = 2 and bi − ai mod 4 = 0. This allows dividing by 4, instead of 2, and possibly speeding up the computation. Another modification consists of allowing negative values of ai and bi, and using a convergence criterion based on their absolute values ⏐ai⏐ and ⏐bi⏐. Actually, logarithmic estimations of ⏐ai⏐ and ⏐bi⏐are used, that is, integers αi and βi such that ai < 2αi and bi < 2βi

(4.26)

Initially α0 = β0 = k. As before assume that a is odd and define a0 = a and b0 = b. Two sequences a1, a2, a3, . . . and b1, b2, b3, . . . of integers are generated: given ai and bi with ai odd, then If bi mod 4 = 0: ai + 1 = ai, bi + 1 = bi /4, αi + 1 = αi and βi + 1 = βi − 2 If bi mod 4 ≠ 0 and bi is even: ai + 1 = ai, bi + 1 = bi /2, αi + 1 = αi and βi + 1 = βi − 1 If bi is odd, bi + ai mod 4 = 0 and βi ≥ αi: ai + 1 = ai, bi + 1 = (bi + ai)/4, αi + 1 = αi and βi + 1 = βi − 1 If bi is odd, bi + ai mod 4 = 0 and βi < αi: ai + 1 = bi, bi + 1 = (bi + ai)/4, αi + 1 = βi and βi + 1 = αi − 1 If bi is odd, bi − ai mod 4 = 0 and βi ≥ αi: ai + 1 = ai, bi + 1 = (bi − ai)/4, αi + 1 = αi and βi + 1 = βi − 1 If bi is odd, bi − ai mod 4 = 0 and βi < αi: ai + 1 = bi, bi + 1 = (bi − ai)/4, αi + 1 = βi and βi + 1 = αi − 1 Obviously ai + 1 is odd and gcd(ai + 1, bi + 1) = gcd(ai, bi) so that gcd(ai + 1, bi + 1) = gcd(ai, bi) = . . . = gcd(a0, b0) = gcd(a, b)

(4.27)

In order to demonstrate the convergence of this iteration, observe that αi + 1 + βi + 1 < αi + βi. Observe also that as long as αi > 0 and βi > 0, αi + 1 > 0. In conclusion, after a finite number of steps βi ≤ 0, that is, ⏐bi⏐ < 1, and thus bi = 0, so that gcd(a, b) = gcd(ai, 0) = ⏐ai⏐.

Lemma 4.4 After a finite number of steps, ⏐ai⏐ = gcd(a, b).

In the particular case where a = p and b = y, the gcd is equal to 1, that is, ⏐ai⏐ = 1. For computing z = xy − 1 mod p, two additional set of values c1, c2, c3, . . . and d1, d2, d3, . . . are computed in parallel. The initial values are c0 = 0 and d0 = x. Then

Operations over GF(p) If bi mod 4 = 0: ci + 1 = ci, di + 1 = di4 − 1 mod p If bi mod 4 ≠ 0 and bi is even: ci + 1 = ci, di + 1 = di2 − 1 mod p If bi is odd, bi + ai mod 4 = 0 and βi ≥ αi: ci + 1 = ci, di + 1 = (di + ci)4 − 1 mod p If bi is odd, bi + ai mod 4 = 0 and βi < αi: ci + 1 = di, di + 1 = (di + ci)4 − 1 mod p If bi is odd, bi − ai mod 4 = 0 and βi ≥ αi: ci + 1 = ci, di + 1 = (di − ci)4 − 1 mod p If bi is odd, bi − ai mod 4 = 0 and βi < αi: ci + 1 = di, ci + 1 = (di − ci)4 − 1 mod p In conclusion, after a finite number of steps ⏐ai⏐ = 1 (Lemma 4.4) and ciy ≡ x mod p (same proof as Lemma 4.3), that is, xy − 1 mod p = ci if ai = 1

xy − 1 mod p = p − ci if ai = − 1

(4.28)

Given an integer w, the value of an integer equivalent to w2 −1 mod p can be computed as follows: if w mod 2 = 0 then w/2 ≡ w2 − 1 mod p if w mod 2 = 1 then (w + p)/2 ≡ w2 − 1 mod p

(4.29)

and the value of an integer equivalent to w4 − 1 mod p as follows: if p mod 4 = 1 then if w mod 4 = 0 then w/4 ≡ w4 − 1 mod p if w mod 4 = 1 then (w − p)/4 ≡ w4 − 1 mod p if w mod 4 = 2 then (w + 2p)/4 ≡ w4 − 1 mod p

(4.30)

if w mod 4 = 3 then (w + p)/4 ≡ w4 − 1 mod p and if p mod 4 = 3 then if w mod 4 = 0 then w/4 ≡ w4 − 1 mod p if w mod 4 = 1 then (w + p)/4 ≡ w4 − 1 mod p if w mod 4 = 2 then (w + 2p)/4 ≡ w4 − 1 mod p

(4.31)

if w mod 4 = 3 then (w − p)/4 ≡ w4 − 1 mod p If w belongs to the interval − p < w < p then − p < w/2 < p and − p < (w + p)/2 < p, and if w belongs to the interval − 2p < w < 2p then − p < w/4 < p, − p < (w + p)/4 < p, − p < (w + 2p)/4 < p, and − p < (w − p)/4 < p. Assume that the functions function divide_by_2(w, p: in integer) return integer function divide_by_4(w, p: in integer) return integer

returning integers equivalent to w2−1 mod p and w4−1 mod p, respectively, according to the sets of Eqs. (4.29) to (4.31), have been defined. During the execution of the following algorithm, a, b, c, and d

105

106

Chapter Four will remain included between −p and p, so that they can be represented as (k + 1)-bit 2’s complement numbers. A final correction step is necessary if c < 0.

Algorithm 4.5—mod p division, plus-minus algorithm a := p; b := y; c := 0; d := x; alpha := k; beta := k; while beta > 0 loop if b mod 4 = 0 then b := b/4; d := divide_by_4(d, p); beta := beta - 2; elsif b mod 2 = 0 then b := b/2; d := divide_by_2(d, p); beta := beta - 1; else old_b := b; old_d := d; old_alpha := alpha; old_beta := beta; if (b+a) mod 4 = 0 then b := (b+a)/4; d := divide_by_4(d+c, p); else b := (b-a)/4; d := divide_by_4(d-c, p); end if; if beta < alpha then a := old_b; c := old_d; alpha := old_beta; beta := old_alpha - 1; else beta := beta - 1; end if; end if; end loop; if c < 0 then c := c + p; end if; if a = 1 then z := c; else z := p-c; end if;

In order to avoid the comparison of αi and βi, an alternative method consists of updating at each step difi = βi − αi

and

mini = min(αi, βi)

Algorithm 4.6—mod p division, plus-minus algorithm, second version a := p; b := y; c := 0; d := x; dif := 0; min := k; while min > 0 loop if b mod 4 = 0 then b := b/4; d := divide_by_4(d, p); if dif <= 0 then min := min - 2; elsif dif = 1 then min := min - 1; end if; dif := dif - 2; elsif b mod 2 = 0 then b := b/2; d := divide_by_2(d, p); if dif <= 0 then min := min - 1; end if; dif := dif - 1; else old_b := b; old_d := d; if (b+a) mod 4 = 0 then b := (b+a)/4; d := divide_by_4(d+c, p); else

(4.32)

Operations over GF(p) b := (b-a)/4; d := divide_by_4(d-c, p); end if; if dif < 0 then a := old_b; c := old_d; dif := -dif - 1; elsif dif = 0 then dif := dif - 1; min := min - 1; else dif := dif - 1; end if; end if; end loop; if c < 0 then c := c + p; end if; if a = 1 then z := c; else z := p-c; end if;

Executable Ada files plus_minus_algorithm.adb and plus_minus_ algorithm2.adb, including Algorithms 4.5 and 4.6, are available at www.arithmetic-circuits.org. A datapath for executing either Algorithm 4.5 or 4.6 is shown in Fig. 4.5. The value of next_b is defined by the control signals oper(1..0) and sel_bd (Table 4.1). Observe that if a and b are odd, then (b + a)/2 = ⎣b/2⎦ + ⎣a/2⎦ + 1 and (b − a)/2 = ⎣b/2⎦ − ⎣a/2⎦. The final value of z, generated when lastb = 0, can be equal to p + c (if a = 1 and c < 0), p − c (if a = − 1 and c ≥ 0), c (if a = 1 and c ≥ 0), or − c (if a = − 1 and c < 0) a/2

c oper (0)

0

d

0

1

oper (0) d

oper (1) lastb

oper (1)

sel_ac

b/2 ce (k + 1)-bit register

y

(k + 1)-bit c adder in sum_ab

(k + 2)-bit c in adder sum_cd

sum_ab/2

0

0 0

1

2

1

2

3

c sel_correction

sel_bd

next_b ce (k + 1)-bit register

p 2p –p

(k + 3)-bit adder ce_bd

p

b

0

1

sel_ac

k – 1..0 corrected_sum

b

x

ce (k + 1)-bit register

corrected_sum/2 corrected_sum/4

a 0

1

2

sel_bd

next_d ce (k + 1)-bit register

z

FIGURE 4.5

ce_ac

Plus-minus algorithm.

d

ce_bd

ce_ac

107

108

Chapter Four

sel_bd

oper − −

0

next_b y

1

00

b/2

2

00

b/4

2

10

(b + a)/4

2

11

(b − a)/4

TABLE 4.1 next _b in function of sel_bd and oper

For computing next_d the value of corrected_sum must be first computed; it can be equal to d, d + p, d + 2p, d − p, d + c, d + c + p, d + c + 2p, d + c − p, d − c, d − c + p, d − c + 2p, or d − c − p Tables 4.2 and 4.3 define the value of the control signals for generating z and corrected_sum. last b

oper

sel_correction

z

0

10

0

c

0

10

1

p+c

0

11

0

−c

0

11

1

p−c

TABLE 4.2 last b 1 1 1 1 1 1 1 1 1 1 1 1 TABLE 4.3

oper 0− 0− 0− 0− 10 10 10 10 11 11 11 11

z in function of oper (last b = 0)

sel_correction 0 1 2 3 0 1 2 3 0 1 2 3

corrected_sum d d+p d + 2p d−p d+c d+c+p d + c + 2p d+c−p d−c d−c+p d − c + 2p d−c−p

corrected_sum in function of oper and sel_correction (last b = 1)

Operations over GF(p) The minimum clock period is determined by the adders, that is, about (k + 4)TFA if ripple adders are used. Let t be the number of executions of the main loop of Algorithm 4.5 or 4.6. The total computation time T is approximately equal to T ≈ tkTFA

(4.33)

As α0 + β0 = 2k > α1 + β1 > . . . > αi − 1 + βi − 1 > αi + βi with αi − 1 > 0, βi − 1 > 0, and βi ≤ 0, an upper bound of t is 2k, so that an upper bound of the computation time is T < 2k2TFA

(4.34)

A complete VHDL file plus_minus.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity plus_minus is port( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end plus_minus;

The VHDL architecture corresponding to the circuit of Fig. 4.5 is the following: half_b <= b(K)& b(K downto 1); half_a <= a(K)& a(K downto 1); gates1: for i in 0 to K generate aa(i) <= (oper(1) and (oper(0) xor half_a(i))); end generate; sum_ab <= half_b + aa + oper(1); half_sum_ab <= sum_ab(K)& sum_ab(K downto 1); with sel_bd select next_b <= ‘0’ & y when “00”, sum_ab when “01”, half_sum_ab when others; gates2: for i in 0 to k generate dd(i) <= lastb and d(i); cc(i) <= (oper(1) and (oper(0) xor c(i))); end generate; sum_cd <= (dd(k)&dd) + (cc(k)&cc) + oper(0); with sel_correction select pp <= ‘0’ & ZERO when “00”, pp1 when “01”, TWO_P when “10”, pp3 when others; corrected_sum <= (sum_cd(K+1) & sum_cd) + (pp(K+1) & pp); z <= corrected_sum(K-1 downto 0); with sel_bd select next_d <= ‘0’&x when “00”, corrected_sum(K+1 downto 1) when “01”, corrected_sum(K+2 downto 2) when others; with sel_ac select next_a <= p when ‘0’, b when others;

109

110

Chapter Four with sel_ac select next_c <= ZERO when ‘0’, d when others; registers_ac: process(clk) begin if clk’event and clk = ‘1’ then if ce_ac = ‘1’ then a <= next_a; c <= next_c; end if; end if; end process registers_ac; registers_bd: process(clk) begin if clk’event and clk = ‘1’ then if ce_bd = ‘1’ then b <= next_b; d <= next_d; end if; end if; end process registers_bd;

The complete model additionally includes registers for storing dif and min, combinational circuits that compute branching conditions, such as min < 0, b mod 4 = 0, dif ≤ 0, and so on, and a control unit.

4.4

Fermat’s Little Theorem Fermat’s little theorem states that if y is a nonzero element of Zp, then yp − 1 mod p = 1. In particular, given x in Zp, then y(xyp − 2) mod p = xyp − 1 mod p = x. Thus, z = xyp − 2 mod p

(4.35)

A straightforward modification of Algorithm 3.14 computes z = xyp − 2 mod p.

Algorithm 4.7—mod p division based on Fermat’s little theorem e := exp_k; tx := mp(x, exp_2k); ty := mp(y, exp_2k); for i in 1 .. k loop e := mp(e, e); if binary_p_minus_2(k-i) = 1 then e := mp(e, ty); end if; end loop; e := mp(e, tx); z := mp(e, 1);

An executable Ada file Fermat_division.adb, including Algorithm 4.7, is available at www.arithmetic-circuits.org. The corresponding datapath, similar to Fig. 3.10, is shown in Fig. 4.6. Algorithm 4.7 mainly consists of k executions of a loop that in turn includes at most two Montgomery products. According to Eq. (3.24) the computation time of a Montgomery product is about 3kTFA, so that the total computation time is approximately equal to T ≈ 6k2TFA

(4.36)

Operations over GF(p)

cont 2

x

y

0

1

exp_2k e

00

txy 1

e

01,10,11 00 01 10 11

operand 1

x

cont 1

operand 2 x

parallel_in shift serial_out

y

reset Montgomery start multiplier done z

load

update load

k-bit shift register

start_mp

xk–i

mp_done

result

ce k-bit register initially: exp_k load

e

FIGURE 4.6

ce_e load

k-bit register

ce

ce_txy

txy

Divider based on the Fermat’s little theorem.

A complete VHDL file Fermat_divider.vhd is available at www. arithmetic-circuits.org. The entity declaration is entity Fermat_divider is port ( x, y: in std_logic_vector(K-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(K-1 downto 0); done: out std_logic ); end Fermat_divider;

The VHDL architecture corresponding to the circuit of Fig. 4.6 is the following: with cont1 select operand1 <= xy when “00”, e when others; with cont1 select operand2 <= EXP_2K when “00”, e when “01”, txy when “10”, ONE when others; with cont2 select xy <= x when ‘0’, y when others; main_component: Montgomery_multiplier port map(operand1, operand2, clk, reset, start_mp, result, mp_done); z <= result; register_e: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then e <= EXP_K;

111

112

Chapter Four elsif ce_e = ‘1’ then e <= result; end if; end if; end process register_e; register_txy: process(clk) begin if clk’event and clk = ‘1’ then if ce_txy = ‘1’ then txy <= result; end if; end if; end process register_txy; shift_register: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then int_P_MINUS_2 <= P_MINUS_2; elsif update = ‘1’ then for i in 1 to k-1 loop int_P_MINUS_2(K-i) <= int_P_MINUS_2(K-i-1); end loop; int_P_MINUS_2(0) <= ‘0’; end if; end if; end process shift_register; ser_out <= int_P_MINUS_2(K-1);

The complete model additionally includes a (k + 1)-state counter and a control unit.

4.5

Comparison In this chapter four division algorithms were considered: the Euclidean algorithm, the binary algorithm, the plus-minus algorithm, and an algorithm based on the Fermat’s little theorem. The corresponding approximate computation times are the following (Eqs. 4.20, 4.24, 4.33, and 4.36): Division algorithm

Computation time

Euclidean

4k2tTFA

Binary

ktTFA

Plus-minus

ktTFA

Fermat’s little theorem

6k2TFA

Note: t is the number of executions of the main loop.

TABLE 4.4 Approximate Computation Times

Obviously, the binary and plus-minus algorithms give the shortest computation times. Furthermore, the number of executions of the main loop is smaller in the case of the plus-minus algorithm. So, the latter usually generates the fastest divider.

Operations over GF(p)

4.6

FPGA Implementations Several dividers have been implemented within Spartan3 (speed-5) programmable devices. As before, the times (period, total time) are expressed in ns, and the parameters FFs and LUTs represent the numbers of flip-flops and look-up tables, respectively. Every slice includes two flip-flops and two look-up tables. All the source files are available at www.arithmetic-circuits.org.

4.6.1

Euclidean Algorithm

The divider of Fig. 4.3, based on the Euclidean algorithm, has been implemented. The number of cycles depends on the value of the operands. The cost and period of several Euclidean dividers are shown in Table 4.5.

k

FFs

LUTs

Slices

8

123

197

120

5.1

32

459

729

435

7.8

64

906

1,426

849

13.1

128

1,805

2,809

1,772

18.7

192

2,703

4,205

2,644

26.1

256

3,605

5,579

3,519

36.6

TABLE 4.5

Period

Cost and Period of Euclidean Dividers

In order to get an estimation of the computation time, 1,000 pairs (x, y) have been generated for several values of k, and the corresponding numbers of cycles have been observed by simulation. Then, the minimum (MinCycles), maximum (MaxCycles), and average (AverCycles) numbers of cycles have been obtained. By multiplying the average number of cycles by the period, an estimation of the average computation time AverTime can be computed (Table 4.6).

k

Period

MinCycles

MaxCycles

8 32 64

5.1

35

203

7.8

683

2,051

1,348.3

10,517

13.1

3,331

7,003

5,112.6

66,975

128

18.7

15,179

24,155

19,622.1

366,933

192

26.1

32,731

55,467

43,782.7

1,142,728

256

36.6

64,219

88,659

77,624.6

2,841,060

TABLE 4.6 Average Delay of Euclidean Dividers

AverCycles 110.3

AverTime 563

113

114

Chapter Four

4.6.2

Binary Algorithm

The divider of Fig. 4.4, based on the binary algorithm, has been implemented. The number of cycles depends on the value of the operands. The cost and period of several dividers are shown in Table 4.7.

k 8 32 64 128 192 256

FFs 34 130 259 515 771 1,027

LUTs 145 510 1,069 2,198 3,401 4,164

Slices 78 285 590 1,364 2,091 2,754

Period 6.9 8.9 11.6 15.8 19.9 26.5

TABLE 4.7 Cost and Period of Dividers Based on the Binary Algorithm

In order to get an estimation of the computation time, 1,000 pairs (x, y) have been generated for several values of k, and the corresponding numbers of cycles have been observed by simulation. Then, the minimum (MinCycles), maximum (MaxCycles), and average (AverCycles) numbers of cycles have been obtained, and the average computation time AverTime has been computed (Table 4.8).

k 8 32 64 128 192 256

Period 6.9 8.9 11.6 15.8 19.9 26.5

TABLE 4.8

4.6.3

MinCycles 3 41 112 243 376 509

MaxCycles 19 77 149 287 425 568

AverCycles 14.2 65.3 132.8 268.8 404.3 539.6

AverTime 98 582 1,541 4,247 8,046 14,299

Average Delay of Dividers Based on the Binary Algorithm

Plus-Minus Algorithm

The divider of Fig. 4.5, based on the plus-minus algorithm, has been implemented. The number of cycles depends on the value of the operands. The cost and period of several dividers are shown in Table 4.9. In order to get an estimation of the computation time, 1,000 pairs (x, y) have been generated for several values of k, and the corresponding numbers of cycles have been observed by simulation. Then, the

Operations over GF(p) k 8 32 64 128 192 256

FFs 52 151 282 542 798 1,057

LUTs 188 402 727 1,372 2,016 2,697

Slices 99 206 395 750 1,103 1,467

Period 10.5 13.9 17.0 21.9 26.2 28.3

TABLE 4.9 Cost and Period of Dividers Based on the Plus-Minus Algorithm

minimum (MinCycles), maximum (MaxCycles), and average (AverCycles) numbers of cycles have been obtained, and the average computation time ‘AverTime’ has been computed (Table 4.10).

k

Period

MinCycles

MaxCycles

8

10.5

12

17

14.4

152

32

13.9

41

53

46.5

647

64

17.0

81

100

89.2

1,517

128

21.9

161

187

175.0

3,833

192

26.2

246

276

260.7

6,831

256

28.3

329

364

346.4

10,087

TABLE 4.10

4.6.4

AverCycles

AverTime

Average Delay of Dividers Based on the Plus-Minus Algorithm

Fermat’s Little Theorem

The divider of Fig. 4.6, based on Fermat’s little theorem, has been implemented. The average delay of several dividers is shown in Table 4.11.

k

FFs

LUTs

Slices

8

85

167

100

32

228

445

64

420

128

Period

Cycles

Total time

6.9

323

2,229

267

8.8

3,457

30,422

767

477

11.5

13,360

153,640

800

1,412

995

15.2

51,280

779,456

192

1,143

2,012

1,460

19.4

113,483

2,201,570

256

1,530

2,947

1,958

24.4

151,445

3,695,258

TABLE 4.11 Average Delay of Dividers Based on Fermat’s Little Theorem

115

116 4.7

Chapter Four

Comments and Conclusions The experimental results confirm the theoretical results. The plusminus algorithm gives the shortest computation time. Furthermore, it also gives the most cost-effective circuit.

4.8

References [BK83] R. P. Brent and H. T. Kung. “Systolic Arrays for Linear Time GCD Computation.” Proceedings of VLSI’83, pp. 145–154, 1983. [DBS06] J.-P. Deschamps, G. Bioul, and G. Sutter. Synthesis of Arithmetic Circuits. Wiley, Hoboken, New Jersey, 2006. [DS06] J.-P. Deschamps and G. Sutter. “Hardware Implementation of Finite-Field Division.” Acta Applicandae Mathematicae, vol. 93, pp.119–147, September 2006. [EL04] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann, San Francisco, 2004. [HMV04] D. Hankerson, A. Menezes, and S. Vanstone. Guide Elliptic Curve Cryptography. Springer, New York, 2004. [Knu81] D. E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, vol. 2, 2d ed., Addison-Wesley, MA, USA, 1981. [MBQ04] G. Meurice de Dormale, Ph. Bulens, and J.-J. Quisquater. “Efficient Modular Division Implementation.” Lecture Notes in Computer Sciences, Vol. 3203, pp. 231–240, 2004. [MOV96] A. J. Menezes, P. C. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, 1996. [Par00] B. Parhami. Computer Arithmetic. Oxford University Press, New York, 2000. [Tak98] N. Takagi. “A VLSI Algorithm for Modular Division Based on the Binary GCD Algorithm.” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E81-A, no. 5, pp. 724–728, May 1998.

CHAPTER

5

Operations over Zp[x]/f (x)

C

hapter 5 deals with the study of the arithmetic operations over the commutative ring Zp[x]/f(x), with p prime and f(x) not necessarily irreducible. If f(x) is an irreducible polynomial of degree m over Zp, then Zp[x]/f(x) is a field with pm elements and is called GF(pm). The elements of Zp[x]/f(x) are polynomials of degree at most m – 1 with coefficients from Zp.

5.1 Addition and Subtraction mod f(x) Let f(x) be a polynomial of degree m > 0 over Zp. Addition and subtraction of two elements a(x), b(x) ∈ Zp[x]/f(x) is accomplished in a straightforward way by addition/subtraction of the corresponding coefficients [MOV96]. Let a(x) = ∑ im=−01 ai xi , b(x) = ∑ im=−01 bi x i , and c(x) = ∑ im=−01 ci x i be polynomials in Zp[x]/f(x), where the coefficients ai, bi, ci ∈ Zp. Then the addition/subtraction c(x) = a(x) ± b(x) is as follows c(x) = a(x) ± b(x) =

m−1

∑ ci x i i=0

with ci = ai ± bi mod p

(5.1)

A reduction modulo p (i.e., an addition or subtraction of p) is necessary whenever the sum or difference of two coefficients ai and bi is outside the range of [0, . . . , p – 1]. There are no carries propagating between the coefficients. Operations modulo p have been studied in Chap. 3. The addition of two elements a(x) + b(x) in Zp[x]/f(x) is accomplished using Eq. (5.1) as follows:

Algorithm 5.1—Addition of polynomials mod p for i in 0 .. m-1 loop c(i) := (a(i)+b(i)) mod p; end loop;

117

118

Chapter Five where a(x) and b(x) are defined as polynomials with maximum degree m − 1. Assume that the function function mod_m_addition(x, y, p, k: natural) return natural

computing (x + y) mod p, with p a k-bit natural, is available. This function implements the optimized binary mod p addition given in Algorithm 3.2. Then the addition of two polynomials a(x) + b(x) in Zp[x]/f(x) is accomplished using Eq. (5.1) as follows:

Algorithm 5.2—Addition of polynomials mod p, version 2 for i in 0 .. m-1 loop c(i) := mod_m_addition(a(i),b(i),p,m); end loop;

where k has been particularized to be equal to m, and where the polynomials a, b, and c range from 0 to m − 1. An executable Ada file addition_mod_f_poly.adb, including Algorithm 5.2, is available at www.arithmetic-circuits.org. A VHDL model for the second version of the addition of polynomials mod p (Algorithm 5.2) is given in the file adder_polynom.vhd which is available at www.arithmetic-circuits.org. The entity declaration is entity adder_polynom is port( a, b: in polynomial; z: out polynomial ); end adder_polynom;

The VHDL architecture is the following: gen: for i in 0 to M-1 generate addition: process(a,b) variable z1, z2: std_logic_vector(K downto 0); begin z1 := a(i) + (‘0’ & b(i)); z2 := z1 - P; if z1(K) = ‘0’ then z(i) <= z1(K-1 downto 0); else z(i) <= z1(K-1 downto 0); end if; end process; end generate;

The subtraction of two elements a(x) − b(x) in Zp[x]/f(x) is accomplished using Eq. (5.1) as follows:

Algorithm 5.3—Subtraction of polynomials mod p for i in 0 .. m-1 loop c(i) := (a(i)-b(i)) mod p; end loop;

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) where a(x) and b(x) are defined as polynomials with maximum degree m − 1. Assume that the function function mod_m_subtraction(x, y, p, k: natural) return natural

computing (x − y) mod p, with p a k-bit natural, is available. This function implements the optimized binary mod p subtraction given in Algorithm 3.4. Then the subtraction of two polynomials a(x) − b(x) in Zp[x]/f(x) is accomplished using Eq. (5.1) as follows:

Algorithm 5.4—Subtraction of polynomials mod p, version 2 for i in 0 .. m-1 loop c(i) := mod_m_subtraction(a(i),b(i),p,m); end loop;

where k has been particularized to be equal to m, and where the polynomials a, b, and c range from 0 to m − 1. An executable Ada file subtraction_mod_f_poly.adb, including Algorithm 5.4, is available at www.arithmetic-circuits.org. A VHDL model for the second version of the subtraction of polynomials mod p (Algorithm 5.4) is given in the file subtractor_ polynom.vhd which is available at www.arithmetic-circuits.org. The entity declaration is entity subtractor_polynom is port( a, b: in polynomial; z: out polynomial ); end subtractor_polynom;

The VHDL architecture is the following: gen: for i in 0 to M-1 generate subt: process(a,b) variable z1, z2: std_logic_vector(K downto 0); begin z1 := (‘0’ & a(i)) - b(i); z2 := z1 + P; if z1(K) = ‘0’ then z(i) <= z1(K-1 downto 0); else z(i) <= z1(K-1 downto 0); end if; end process; end generate;

A parallel adder in Zp[x]/f(x) requires m Zp adders, and its critical path delay is one Zp adder. For the addition given in Algorithm 5.2, where the function mod_m_addition that implements the optimized binary mod p addition presented in Algorithm 3.2 is used, the delay is given in Eq. (3.2).

119

120

Chapter Five In a similar way, a parallel subtractor, in Zp[x]/f(x) requires m Zp subtractors, and its critical path delay is one Zp subtractor. For the subtraction given in Algorithm 5.4, where the function mod_m_ subtraction that implements the optimized binary mod p subtraction presented in Algorithm 3.4 is used, the delay is given in Eq. (3.4). Additionally, a VHDL file adder_subt_polynom.vhd modeling an adder/subtractor is available at www.arithmetic-circuits.org. This model includes the component adder_subtractor (described in Chap. 3) with the following entity declaration: entity adder_subtractor is port ( x, y: in std_logic_vector(K-1 downto 0); add_sub: in std_logic; z: out std_logic_vector(K-1 downto 0) ); end adder_subtractor;

The VHDL architecture is the following: long_x <= ‘0’ & x; xor_gates1: for i in 0 to K-1 generate xor_y(i) <= y(i) xor add_sub; end generate; xor_y(K) <= ‘0’; sum1 <= add_sub + long_x + xor_y; c1 <= sum1(K); z1 <= sum1(K-1 downto 0); long_z1 <= ‘0’&z1; xor_gates2: for i in 0 to K-1 generate xor_p(i) <= P(i) xor not(add_sub); end generate; xor_p(K) <= ‘0’; sum2 <= not(add_sub) + long_z1 + xor_p; c2 <= sum2(K); z2 <= sum2(K-1 downto 0); sel <= (not(add_sub) and (c1 or c2)) or (add_sub and not(c1)); with sel select z <= z1 when ‘0’, z2 when others;

Using the above component, the entity declaration of the adder/ subtractor of polynomials mod p is as follows: entity add_sub_polynom is port( a, b: in polynomial; add_sub: in std_logic; z: out polynomial ); end add_sub_polynom;

The VHDL architecture with the instantiation of the adder_subtractor component is the following:

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) main_component: for i in 0 to M-1 generate comp1: adder_subtractor port map(x => a(i), y => b(i), add_sub => add_sub, z => z(i)); end generate;

5.2

Multiplication mod f(x) Let a(x) and b(x) be two elements from Zp[x]/f(x) and c(x) be their product, in such a way that a(x) = ∑ im=−01 ai x i , b(x) = ∑ im=−01 bi x i, and c(x) = ∑ im=−01 ci xi, with coefficients ai, bi, ci ∈ Zp. Then, c(x) = a(x)b(x) mod f(x)

(5.2)

Thus, the multiplication involves two steps: polynomial multiplication and reduction modulo f(x). The product d(x) of the polynomials representing the elements a(x) and b(x), d(x) = a(x)b(x), is a degree 2m − 2 polynomial, that is, d(x) = ∑ i2=m0− 2 di x i . In the modular reduction c(x) = d(x) mod f(x), the d(x) polynomial is reduced by the degree m polynomial f(x) iteratively. The following algorithms ([MS99]) can be used for the implementation of the above multiplication Eq. (5.2).

5.2.1 Two-Step Multiplication The two-step multiplication in Zp[x]/f(x) is a straightforward translation of the classic school multiplication algorithm. In the twostep multiplication, the product c(x) given in Eq. (5.2) involves two steps: polynomial multiplication and reduction modulo an irreducible polynomial. The polynomial d(x) of degree 2m − 2 can be written in matrix form as follows:

⎛ d0 ⎞ ⎛ a0 ⎜ d ⎟ ⎜ a ⎜ 1 ⎟ ⎜ 1 ⎜ d2 ⎟ ⎜ a2 ⎜ ⎟ ⎜ ⎟ ⎜a ⎜d ⎜ m−2 ⎟ ⎜ m−2 ⎜ dm − 1 ⎟ = ⎜ am − 1 ⎜ d ⎟ ⎜ 0 ⎜ m ⎟ ⎜ ⎜ dm + 1 ⎟ ⎜ 0 ⎜ ⎟ ⎜ ⎟ ⎜ ⎜d ⎜ 2m − 3⎟ ⎜ 0 ⎜⎝ d2 m − 2 ⎟⎠ ⎜⎝ 0

0 a0 a1

0 0 a0

0 0 0

am − 3 am − 2 am − 1 0 0 0

am − 4 am − 3 am − 2 am − 1 0 0

am − 5 am − 4 am − 3 am − 2 0 0

0 0 0 a0 a1 a2 a3 am − 1 0

⎞ ⎟ ⎟ ⎟⎛ b ⎞ 0 ⎟ ⎟ ⎜ b1 ⎟ ⎟⎜ b ⎟ ⎟⎜ 2 ⎟ ⎟⎜ ⎟ ⎟ ⎜ bm − 2 ⎟ ⎟ ⎜⎜ b ⎟⎟ ⎟ ⎝ m − 1⎠ ⎟ am − 2 ⎟ am − 1 ⎟⎠ 0 0 0 0 a0 a1 a2

(5.3)

.

121

122

Chapter Five From Eq. (5.3), the coefficients of d(x) are determined by the following expressions: k ⎧ ∑ i = 0 ai bk − i ; k = 0, . . . , m − 1 ⎪ dk = ⎨ 2 m − 2 ⎪∑ i = k ak − i + ( m − 1) bi − (m m − 1) ; k = m, . . . , 2 m − 2 ⎩

(5.4)

where additions and multiplications are mod p operations (performed over Zp). Assume that the function dar_mod_multiplication(x, y, m, k) computes xy mod m; x, y, and m being k-bit numbers according to double, add, and reduce multiplication mod m given in Chap. 3 (Algorithm 3.7). Then the function poly_mult_zp(a, b) performing the polynomial multiplication of a(x) and b(x), d(x) = a(x)b(x), where a(x) = ∑ im=−01 ai xi , b(x) = ∑ im=−01 bi xi , and d(x) = ∑ i2=m0− 2 di x i , with ai, bi, di ∈ Zp, can be easily implemented using Eq. (5.4). After the polynomial multiplication, reduction modulo polynomial f(x) must be performed. In the modular reduction c(x) = d(x) mod f(x), the degree 2m − 2 polynomial d(x) is reduced by the degree m polynomial f(x), resulting in a polynomial c(x) with degree deg(c(x)) ≤ m − 1: c(x) = d(x) mod f(x) = (d2m − 2x2m − 2 + . . . + d1x + d0) mod f(x) = cm − 1xm − 1 + … + c1x + c0

(5.5)

The reduction modulo f(x) can be viewed as a linear mapping of the 2m − 1 coefficients of d(x) into the m coefficients of c(x). This mapping can be represented in matrix notation as follows:

⎛ c0 ⎞ ⎛ 1 0 ⎜ c ⎟ ⎜0 1 ⎜ 1 ⎟ =⎜ ⎜ ⎟ ⎜ ⎜⎝ cm − 1⎟⎠ ⎜ 0 0 ⎝

0 0

r0,0 r1,0

1 rm − 1,0

⎛ d0 ⎞ ⎟ r0,m − 2 ⎞ ⎜ ⎜ ⎟ r1,m − 2 ⎟ ⎜ dm − 1 ⎟ ⎟ ⎟ ⎜ dm ⎟ ⎟ rm − 1, m − 2 ⎟⎠ ⎜ ⎜ ⎟ ⎜⎝ d2 m − 2 ⎟⎠

(5.6)

.

The matrix in Eq. (5.6) consists of an (m × n) identity matrix and an (m × m – 1) matrix R named a reduction matrix. The R matrix is a function of the selected polynomial f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0. Therefore, to every f(x) a reduction matrix R is uniquely assigned. The coefficients rj,i ∈ Zp of R can be recursively computed in function of f(x) as follows: − f j ; j = 0, … , m − 1; i = 0 ⎪⎧ rj ,i = ⎨ r + ⎪⎩ j − 1,i − 1 rm − 1,i − 1rj ,0 ; j = 0, … , m − 1; i = 1, … , m − 2

(5.7)

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) where rj − 1,i − 1 = 0 if j = 0. It must be noted that Eq. (5.7) has been obtained due to the fact that xm = – fm − 1xm − 1 − . . . − f1x − f0. Furthermore, additions and multiplications involved in Eq. (5.7) are mod p operations. Therefore, the term (–fj) for i = 0 must be reduced mod p. Mod m reduction was dealt with in Chap 2. Assume that the function nr_reducer(x, p, n, k) computes x mod p, where x is an integer belonging to the range − 2n ≤ x < 2n and p is a natural belonging to the range 2k − 1 ≤ p < 2k. The function nr_reducer implements the generic digitrecurrence reduction algorithm given in Chap. 2 (Algorithm 2.1). Then the function reduction_matrix_R_zp(f) computing the reduction matrix R can be implemented using Eq. (5.7). Finally, the two-step multiplication performing c(x) = a(x)b(x) mod f(x) = d(x) mod f(x) using Eq. (5.6) can be given, where the previously defined functions poly_mult_zp and reduction_matrix_R_zp are used.

Algorithm 5.5—Multiplication mod f d := poly_mult_zp(a,b); R := reduction_matrix_R_zp(f); for j in 0 .. m-1 loop c(j) := d(j); end loop; for j in 0 .. m-1 loop for i in 0 .. m-2 loop c(j) := mod_m_addition(c(j),dar_mod_multiplication(R(j,i), d(m+i),p,m),p,m); end loop; end loop;

An executable Ada file multiplication_mod_f_poly.adb, including Algorithm 5.5, is available at www.arithmetic-circuits.org. The corresponding combinational hardware implementations are very inefficient. Serial implementations, like those presented in Section 5.2.2, should be used.

5.2.2

Serial Multiplication

Another way for computing the multiplication is serial multiplication. Serial multipliers process all coefficients of the multiplicand in parallel in the first step, while the coefficients of the multiplier are processed serially. Serial multiplication can be performed in two different ways, depending on the order in which the coefficients of the multiplier are processed: Most Significant Element (MSE) first multiplier and Least Significant Element (LSE) first multiplier [GGK06]. The Most Significant Element (MSE) first multiplication starts with the highest coefficient bm − 1 of the multiplier polynomial and continues with the remaining coefficients one at a time in descending order. Hence, multiplication according to this scheme can be performed in the following way. Given a polynomial f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 of degree m over Zp, and two polynomials a(x) = am − 1xm − 1 + . . . + a1x + a0

123

124

Chapter Five and b(x) = bm − 1xm − 1 + . . . + b1x + b0 of degrees less than m over Zp, the computation of c(x) = a(x)b(x) mod f(x) can be done as follows: a(x)b(x) = ( . . . ((0x + a(x)bm − 1)x + a(x)bm − 2)x + . . . )x + a(x)b0

(5.8)

In order to compute Eq. (5.8), a quantity of the form s(x)x has to be reduced modulo f(x). This can be computed as follows: s(x)x = sm − 1xm + sm − 2xm − 1 + . . . + s1x2 + s0x ≡ sm − 1xm + sm − 2xm − 1 + . . . + s x2 + s x − s f(x) 1

m−1

0

= (sm − 2 − sm − 1 fm − 1)xm − 1 + (sm − 3 − sm − 1 fm − 2)xm − 2 + . . . + (s0 − sm − 1 f1)x + (0 − sm − 1 f0)

(5.9)

where all operations are done modulo p. If d(x) = s(x)x = dm − 1xm − 1 + . . . + d1x + d0, then using Eq. (5.9) we have that d0 = − sm − 1 f0 di = si − 1 − sm − 1 fi , i = 1, 2, . . . , m − 1

(5.10)

Assume that the function function multiply_by_x_zp(s,f: polynomial_m1) return polynomial_m1

implementing Eq. (5.9) according to Eq. (5.10) and therefore returning the polynomial s(x)x mod f(x) has been defined, where polynomial_m1 is a vector from 0 to m – 1 with coefficients in Zp. Also assume that the functions product(a, b) and addition_mod_f_poly(a, b) compute the multiplication of the polynomial a(x) by b in Zp (ba0 mod p, ba1 mod p, . . . , bam − 1 mod p) and the addition of two polynomials a(x) and b(x) in Zp according to Algorithm 5.2, respectively. Then the following algorithm implements the MSE-first multiplication scheme given in Eq. (5.8):

Algorithm 5.6—MSE-first multiplier for i in 0 .. m-1 loop c := addition_mod_f_poly(multiply_by_x_zp(C,F), product(A,B(m-1-i))); end loop;

An executable Ada file MSEfirst.adb, including Algorithm 5.6, is available at www.arithmetic-circuits.org. The circuit of Fig. 5.1 can execute the main step of Algorithm 5.6. In order to reduce the number of computation resources, the circuit of Fig. 5.2 can also be used: With control = 1 the circuit computes c(x)x mod f(x), and with control = 0 it computes c(x) + a(x)bm − 1 − i. The minimum clock period is approximately equal to

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) cm–1

... fm–1

fm–2

cm–2

f1

cm–3 mod p multiplier

mod p subtractor

mod p multiplier

0 mod p multiplier

mod p subtractor

am–1

f0

c0

mod p subtractor

am–2

bm–1–i

mod p multiplier

a1

mod p subtractor

a0

...

mod p multiplier

mod p multiplier

mod p multiplier

mod p multiplier

mod p adder

mod p adder

mod p adder

mod p adder

next_cm–1

next_cm–2

next_c1

next_c0

FIGURE 5.1

MSE-ﬁrst multiplier ﬁrst circuit.

Tmod-p-product + Tmod-p-sum, and the number of cycles is 2m, so that the computation time is about T ≈ 2m(Tmod-p-product + Tmod-p-sum) A VHDL model has been generated for p = 239. It uses two components described in Chap. 3 (mod_239_multiplier.vhd and adder_ subtractor.vhd). The complete VHDL file MSE_first_mod_f_multiplier. vhd is available at www.arithmetic-circuits.org. bm–1–i

1 0 1 0 1 0

mod p multiplier

mod p add/subt.

next_cm–1

FIGURE 5.2

1

am–2

fm–2

cm–2

cm–3

am–1

fm–1

cm–1

... ... cm–2

cm–1

c 0 c1

f1 a1

0 c0

f0 a0

0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 1 0

mod p multiplier

mod p multiplier

mod p multiplier

mod p add/subt.

next_cm–2

mod p add/subt.

mod p add/subt.

next_c1

next_c0

MSE-ﬁrst multiplier second circuit.

control

125

126

Chapter Five The Least Significant Element (LSE) first multiplication starts with the lowest coefficient b0 of the multiplier polynomial and continues with the remaining coefficients one at the time in ascending order. Hence, multiplication according to this scheme can be performed in the following way: a(x)b(x) = b0a(x) + b1(a(x)x) + b1(a(x)x2) + . . . + bm − 1(a(x)xm − 1)

(5.11)

where all coefficient arithmetic is done modulo p.

Algorithm 5.7—LSE-first multiplier for c a end

i in 0 .. m-1 loop := addition_mod_f_poly(product(a,b(i)),c); := multiply_by_x_zp(a,f); loop;

An executable Ada file LSEfirst.adb, including Algorithm 5.7, is available at www.arithmetic-circuits.org. It can be noted that the implementation of the LSE-first multiplier presented in Algorithm 5.7 is slightly more complex than the implementation given for the MSEfirst multiplier (Algorithm 5.6), because the LSE-first multiplier requires one register for c(x) and another one for a(x). However, LSEfirst multipliers are faster than MSE-first ones, because c(x) and a(x) can be updated in parallel. In general, LSE-first scheme is faster than MSE-first. A VHDL model has been generated for p = 239. It uses two components described in Chap. 3 (mod_239_reducer.vhd and subtractor_ mod_p.vhd). The complete VHDL file LSE_first_mod_f_multiplier.vhd is available at www.arithmetic-circuits.org. The datapath is given in Fig. 5.3. The entity declaration is the following: entity LSE_first_mod_f_multiplier is port( a, b: in polynomial; clk, reset, start: in std_logic; z: out polynomial; done: out std_logic ); end LSE_first_mod_f_multiplier;

The VHDL architecture is the following: next_c_calc: for i in 0 to m-1 generate mult_add(i) <= ( int_b(0) * int_a(i) ) + c(i); comp1: mod_239_reducer port map(mult_add(i), next_c(i)); end generate;

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) am–2

fm–1

am–3

fm–2

a0

f1

f0 am–1

... 0 mod p multiplier

mod p subtractor

mod p multiplier

mod p subtractor

next_am–1 cm–1

mod p multiplier

am–1

mod p subtractor

next_am–2 cm–2

mod p multiplier

mod p subtractor

next_a1

am–2

c1

next_a0 a1

c0

a0

...

k-bit by k-bits multiplier

adder

k-bit by k-bits multiplier

adder

2k bits

2k bits

k-bit by k-bits multiplier

k-bit by k-bits multiplier

adder

adder

2k bits

2k bits

mod p reducer

mod p reducer

mod p reducer

mod p reducer

next_cm–1

next_cm–2

next_c1

next_c0

FIGURE 5.3

LSE-ﬁrst multiplier datapath.

next_a_calc: for i in 1 to m-1 generate mult_f_x_a(i) <= F(i) * int_a(M-1); comp1: mod_239_reducer port map (mult_f_x_a(i), mult_sub(i)); comp2:subtractor_mod_P port map (int_a(i-1), mult_sub(i), next_a(i)); end generate; mult_f_x_a(0) <= ( F(0) * int_a(M-1) ); comp1: mod_239_reducer port map(mult_f_x_a(0), mult_sub(0)); comp2: subtractor_mod_P port map(ZERO, mult_sub(0), next_a(0)); registers_abc: process(clk) begin if clk’event and clk = ‘1’ then

bi

127

128

Chapter Five if load = ‘1’ then c <= ZERO_POLY; int_b <= b; int_a <= a; elsif update = ‘1’ then c <= next_c; int_b <= zero_coef & int_b(M-1 downto 1); int_a <= next_a; end if; end if; end process registers_abc; z <= c; counter: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then count <= conv_std_logic_vector(M-1, LOGM); elsif update = ‘1’ then count <= count - 1; end if; end if; end process counter; count_equal_zero <= ‘1’ when count = 0 else ‘0’; control_unit: process(clk, reset, count_equal_zero, current_state) begin case current_state is when 0 to 1 => load <= ‘0’; update <= ‘0’; done <= ‘1’; when 2 => load <= ‘1’; update <= ‘0’; done <= ‘0’; when 3 => load <= ‘0’; update <= ‘1’; done <= ‘0’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; when 3 =>if count_equal_zero=’1’ then current_state <= 0; end if; end case; end if; end process control_unit;

5.3

Exponentiation mod f(x) In general, an arbitrary integer power k of an element a(x) ∈ Zp[x]/f(x) can be computed using the repeated “square and multiply” method [MOV96], also known as binary method ([Knu81]), which breaks the exponentiation operation into a series of squaring and multiplication operations in Zp[x]/f(x). This method is based on the observation that the binary representation of any integer k, with 0 ≤ k ≤ pm − 1, is given by k = ∑ ti = 0 ki 2i , with ki ∈ {0,1}.

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) In this method, repeated squaring of the partial results is used to reduce the required number of multiplications. Each integer exponent k can be presented in its binary representation as a t-bit vector as k = k0 + k12 + k222 + . . . + kt − 12t − 1 = (k0, k1, . . . , kt − 1). According to this method, we can obtain: t−1

t−1

k 2i 2 t−1 k a k = a∑ i = 0 i = a k0 (a 2 )k1 (a 2 )k2 . . . (a 2 ) t − 1 = ∏ Bi

(5.12)

⎧⎪a 2i , if ki = 1 i Bi = (a 2 )ki = ⎨ ⎪⎩ 1 , if ki = 0

(5.13)

i=0

where

The square and multiply method given in Eqs. (5.12) and (5.13) can be implemented in the following algorithm:

Algorithm 5.8—Square-and-multiply exponentiation mod f for i in 0 .. m-1 loop b(i) := 0; end loop; c := a; b(0) := 1; for i in 0 .. m-1 loop if k(i) = 1 then b := LSEfirst(b,c,f); end if; c := LSEfirst(c,c,f); end loop;

where the result of the exponentiation is the final value of the b(x) polynomial, and where the multiplication and squaring operations are both computed with the function LSE first given in Algorithm 5.7. Furthermore, in Algorithm 5.8, t has been selected to be equal to m. An executable Ada file exp_mod_f.adb, including Algorithm 5.8, is available at www.arithmetic-circuits.org. A VHDL model for the square-and-multiply exponentiation mod f algorithm is given in the file exp_sq_mult.vhd, available at www. arithmetic-circuits.org. The datapath corresponding to Algorithm 5.8 is shown in Fig. 5.4. The entity declaration of the square-and-multiply exponentiator mod f given in the file exp_sq_mult.vhd is entity exp_sq_mult is port ( A: in polynomial; E: in std_logic_vector (N-1 downto 0); clk, reset, start: in std_logic;

129

130

Chapter Five A

E(n – 1: 0)

new_c

1

inic capt

m × k bits register

ce_c

m × k bits register c

b

start_sq

start_mul LSE-first mod multiplier

done_mul

LSE-first mod multiplier

new_b

done_sq

new_c

start n – 1 bit shift register

inic

0

inic shift_right E (0)

start_mul start_sq State Machine (control)

inic ce_c = shift_right capt = ce_c and e (0) done

FIGURE 5.4

Square-and-multiply exponentiation mod f datapath.

B: out Polynomial; done: out std_logic ); end exp_sq_mult;

The VHDL architecture corresponding to the circuit of Fig. 5.4 is the following, where the LSE-first multiplier mod f given in Algorithm 5.7 has been used: inst_mult: LSE_first_mod_f_multiplier port map (A => cc, B => bb, clk=>clk,reset=>reset,start=>start_mult, Z=> new_B, done=>done_mult); inst_square: LSE_first_mod_f_multiplier port map (A => cc, B => cc, clk=>clk,reset=>reset,start=>start_sq,Z=>new_c, done=>done_sq); counter: process(reset, clk) begin if reset = ‘1’ then count <= 0; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then count <= 0; elsif shift_r = ‘1’ then count <= count+1; end if; end if; end process counter; sh_reg_e: process(reset, clk)

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) begin if reset = ‘1’ then ee <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then ee <= e; elsif shift_r = ‘1’ then ee <= ‘0’ & ee(N-1 downto 1); end if; end if; end process sh_reg_e; register_c: process(reset, clk) begin if reset = ‘1’ then cc <= ZERO_POLY; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then cc <= a; elsif shift_r = ‘1’ then cc <= new_c; end if; end if; end process register_c; register_b: process(reset, clk) begin if reset = ‘1’ then bb <= ZERO_POLY; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then bb <= ONE_POLY; elsif shift_r = ‘1’ and ee(0) = ‘1’ then bb <= new_b; end if; end if; end process register_b; b <= bb; control_unit: process(clk, reset, current_state, ee(0)) begin case current_state is when 0 to 1 => inic<=’0’; shift_r<=’0’; done <= ‘1’; ce_c <= ‘0’; start_sq <= ‘0’; start_mult <= ‘0’; when 2 => inic <= ‘1’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘0’; start_sq <= ‘0’; start_mult <= ‘0’; when 3 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘1’; start_mult <= ee(0); when 4 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘0’; start_mult <= ‘0’; when 5 => inic <= ‘0’; shift_r <= ‘1’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘0’; start_mult <= ‘0’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if;

131

132

Chapter Five when 2 => current_state <= 3; --capture operands when 3 => current_state <= 4; --start operations when 4 => if (done_sq = ‘1’ and (ee(0)= ‘0’ or done_mult = ‘1’)) then current_state <= 5; end if; when 5 => if count = N-1 then current_state <= 0; else current_state <= 3; end if; end case; end if; end process control_unit;

5.4

Optimal Extension Fields Optimal extension fields (OEFs) are a family of extension fields GF(pm) with special properties defined as follows ([BP01], [Bai98], [GG03], [GKP04]):

Definition 5.1 An optimal extension field is a finite field GF(pm) such that: 1. The prime p is a pseudo-Mersenne prime of the form p = 2n ± b with log2(b) ≤ ⎣n/2⎦. 2. An irreducible binomial f(x) = xm – c exists over GF(p). The following theorem from [LN83] describes the cases when an irreducible binomial exists:

Theorem 5.1 Let m ≥ 2 be an integer and c ∈ GF(p). Then the binomial x m − c is irreducible in GF(p) if and only if the following two conditions are satisfied: (i) each prime factor of m divides the order e of c over GF(p), but not (p − 1)/e; (ii) p ≡ 1 mod 4 if m ≡ 0 mod 4. An important corollary is also given in [Jun93]: Corollary 5.1 Let c be a primitive element for GF(p) and let m be a

divisor of p − 1. Then the polynomial xm − c is irreducible. It must be noted that irreducible binomials do not exist over GF(2). Furthermore, the following corollary follows directly from the above, since p − 1 is always an even number [Bai98]:

Corollary 5.2 Let c be a primitive element for GF(p). Then x2 − c is irreducible over GF(p). There are two special cases of OEFs which yield additional arithmetic advantages, called Type I and Type II OEFs [BP01]. A Type I OEF has p = 2n ± 1. This OEF allows subfield modular reduction with low complexity. For Elliptic Curve Cryptography (ECC),

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) particularly good choices of p are 231 − 1 and 261 − 1. A Type II OEF has an irreducible binomial f(x) = xm − 2. This OEF allows a reduction in the complexity of extension field modular reduction, as it will be proven later on. The elements of an OEF can be represented by polynomials of degree m − 1 with coefficients from the subfield GF(p). Addition and subtraction of two field elements are implemented in a straightforward way by adding or subtracting, respectively, the coefficients of their polynomial representations and performing, if necessary, a reduction modulo p. Addition and subtraction in OEFs can be implemented using the algorithms given in Sec. 5.1. Extension field multiplication comprises polynomial multiplication over GF(p) and a reduction modulo the irreducible binomial f(x). Ordinary polynomial multiplication of two field elements a(x) and b(x) results in an intermediate product d(x) of degree less than or equal to 2m − 2, as given in Eqs. (5.3) and (5.4). After polynomial multiplication, the reduction c(x) = d(x) mod f(x) must be performed. The reduction modulo the binomial f(x) = xm – c can be represented using Eq. (5.6) as follows: ⎛1 ⎛ c0 ⎞ ⎜ 0 ⎜ c ⎟ ⎜0 ⎜ 1 ⎟ =⎜ ⎜ ⎟ ⎜ ⎜⎝ cm − 1⎟⎠ ⎜ 0 ⎜⎝ 0

0 1 0 0 0

0 0 1 0 0

0 0 0 0 0

0 0 0 1 0

0 0 0 0 1

c 0 0 0 0

0 c 0 0 0

0 0 c 0 0

0⎞ ⎛ d0 ⎞ 0⎟ ⎜ ⎟ ⎟ ⎜ 0⎟ ⎜ dm − 1 ⎟ ⎟ ⎟ ⎜ dm ⎟ c⎟ ⎜ ⎟ ⎟ ⎜ 0⎟⎠ ⎝⎜ d2 m − 2 ⎟⎠

(5.14)

The reduction matrix in Eq. (5.14) is easily computed for f(x) = xm – c using Eq. (5.7) and using the fact that dm + i xm + i ≡ cdm + ixi mod f(x) [Bai98]. Therefore, from Eq. (5.14), the following general expression for the reduced polynomial is given by: c(x) ≡ dm − 1xm − 1 + (cd2m − 2 + dm − 2)xm − 2 + . . . + (cdm + d0) mod f(x)

(5.15)

From Eq. (5.15), it must be noted that a polynomial d(x) over GF(p) of degree less than or equal to 2m − 2 can be reduced modulo the binomial f(x) = xm – c, requiring at most m − 1 multiplications by c and m − 1 additions, where both multiplications and additions are performed in GF(p) ([Bai98], [BP01]). It can also be noted that for a Type II OEFs with f(x) = xm − 2, the multiplications given in Eq. (5.15) can be implemented as shifts, therefore reducing the complexity of the modular reduction. The above OEF multiplication method given by Eqs. (5.14) and (5.15) for the computation of the product e(x) = d(x) mod f(x) can be implemented in the following algorithm:

133

134

Chapter Five Algorithm 5.9—OEF multiplication d := poly_mult_zp(a,b); e(m-1) := d(m-1); for i in 0 .. m-2 loop e(i) := mod_m_addition(d(i),dar_mod_multiplication(d(m+i), c,p,m),p,m); end loop;

where the intermediate product d(x) of degree less than or equal to 2m − 2 is computed using the function poly_mult_zp, and where the functions mod_m_addition and dar_mod_multiplication compute the addition and multiplication mod p, respectively. Furthermore, in Algorithm 5.9, c in GF(p) is the term corresponding to the irreducible binomial f(x) = xm – c. An executable Ada file OEF_ mult_mod_f.adb, including Algorithm 5.9, is available at www. arithmetic-circuits.org. For an OEF with f(x) = xm – c, it must also be noted that the quantity s(x)x mod f(x) given in Eq. (5.9), can be computed as follows: s(x)x = sm − 1xm + sm − 2xm − 1 + . . . + s1x2 + s0x ≡ sm − 1xm + sm − 2xm − 1 + . . . + s1x2 + s0x − sm − 1 f(x) m−1

= sm − 2x

+ sm − 3x

m−2

(5.16)

+ . . . + s0x + sm − 1c

where coefficient arithmetic is done modulo p. If d(x) = s(x)x = dm − 1xm − 1 + . . . + d1x + d0, then using Eq. (5.16) we have d0 = sm − 1c di = si − 1 , i = 1, 2, . . . , m − 1

(5.17)

Therefore, s(x)x mod f(x) requires only one multiplication mod p when an OEF is used. Assume that the function function mult_x_OEF(s: polynomial_m1; c: integer) return polynomial_m1

implementing Eq. (5.17) according to Eq. (5.16) and therefore returning the polynomial s(x)x mod f(x) has been defined. Then the following algorithm implements the MSE-first multiplication scheme given in Algorithm 5.6 for an OEF with f(x) = xm – c:

Algorithm 5.10—MSE-first multiplier for OEF for i in 0 .. m-1 loop d := addition_mod_f_poly(mult_x_OEF(d,c), product(a,b(m-1-i))); end loop;

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) An executable Ada file OEF_MSE_mult.adb, including Algorithm 5.10, is available at www.arithmetic-circuits.org. In a similar way, the LSE-first multiplication scheme given in Algorithm 5.7 can be implemented for an OEF with f(x) = xm – c as follows:

Algorithm 5.11—LSE-first multiplier for OEF for d a end

i in 0 .. m-1 loop := addition_mod_f_poly(product(a,b(i)),d); := mult_x_OEF(a,c); loop;

An executable Ada file OEF_LSE_mult.adb, including Algorithm 5.11, is also available at www.arithmetic-circuits.org. The datapath for a circuit implementing the LSE-first multiplier for OEF is given in Fig. 5.5. It can be observed that the circuit is similar to the circuit shown in Fig. 5.3 with the simplifications given for OEF. am–2

am–3

a0

F0 = C am–1 0

...

next_am–1 next_am–2

mod p multiplier

next_a1

mod p subtractor

next_a0 cm–1

am–1

cm–2

am–2

c1

a1

c0

a0

...

k-bit by k-bits multiplier

adder 2k bits

k-bit by k-bits multiplier

adder 2k bits

k-bit by k-bits multiplier

adder 2k bits

k-bit by k-bits multiplier

adder 2k bits

mod p reducer

mod p reducer

mod p reducer

mod p reducer

next_cm–1

next_cm–2

next_c1

next_c0

FIGURE 5.5

LSE-ﬁrst multiplier datapath for OEF.

bi

135

136

Chapter Five Exponentiation can also be computed using the binary or “square and multiply” method given in Sec. 5.3 for OEFs. If the function OEF_LSE_mult(a, b, c) computes the LSE-first multiplication a(x)b(x) mod f(x), with f(x) = xm – c, given in Algorithm 5.11, then the following algorithm implements the exponentiation given in Algorithm 5.8 for OEFs:

Algorithm 5.12—Square-and-multiply exponentiation for OEF for i in 0 .. m-1 loop b(i) := 0; end loop; d := a; b(0) := 1; for i in 0 .. m-1 loop if k(i) = 1 then b := OEF_LSE_mult(b,d,c); end if; d := OEF_LSE_mult(d,d,c); end loop;

where the result of the exponentiation is the final value of the b(x) polynomial, and where the multiplication and squaring operations are both computed with the function OEF_LSE_mult given in Algorithm 5.11. Furthermore, in Algorithm 5.12, t has been selected to be equal to m. An executable Ada file OEF_exp.adb, including Algorithm 5.12, is available at www.arithmetic-circuits.org.

5.5

FPGA Implementations Several circuits described in this chapter have been implemented within a Xilinx Spartan3 (speed grade-5) programmable devices. The times (period, total time) are expressed in ns. The parameters FFs and LUTs represent the numbers of flip-flops and look-up tables, respectively. Every slice includes two flip-flops and two look-up tables. All the source files are available at www.arithmetic- circuits.org.

5.5.1 Adders of Polynomials mod p The cost and delay of several adders are shown in Table 5.1. p

m

LUTs

Slices

Total time

17

8

40

24

8

239

17

136

68

10

TABLE 5.1 mod p

5.5.2

Cost and Delay of Adders of Polynomials

Subtractors of Polynomials mod p

The cost and delay of several subtractors are shown in Table 5.2.

O p e r a t i o n s o v e r Z p[ x ] / f ( x ) p

m

LUTs

Slices

Total time

17

8

40

24

8

239

17

136

68

10

TABLE 5.2 Cost and Delay of Subtractors of Polynomials mod p

5.5.3 Adders/Subtractors of Polynomials mod p The cost and delay of several subractors are shown in Table 5.3.

p

m

LUTs

Slices

Total time

17

8

112

64

10

239

17

425

221

12

TABLE 5.3 Cost and Delay of Adders/Subtractors of Polynomials mod p

5.5.4

Serial Multipliers

The circuits are for 23917 with irreducible polynomial f(x) = x17 + 237. The parameter Mult 18 × 18 represents the number of embedded multipliers used in the Xilinx FPGAs. The cost and delay of these serial multipliers are shown in Table 5.4.

p

m

FFs

LUTs

Slices

Mult 18 × 18

Period

Cycles

Total time

MSE-first

239

17

303

1,690

937

17

26.1

34

888

LSE-first

239

17

415

1,779

1,061

18

19.4

17

330

TABLE 5.4 Cost and Delay of Serial Multipliers

5.5.5

Exponentiation

The circuit is for 23917 with irreducible polynomial f(x) = x17 + 237. The parameter Mult 18 × 18 represents the number of embedded multipliers used in the Xilinx FPGAs.

p

m

FFs

LUTs

Slices

Mult 18 × 18

Period

Cycles

Total time

239

17

1,241

3,989

2,318

36

19.5

289

5636

137

138 5.6

Chapter Five

Comments and Conclusions The experimental results confirm the conclusions obtained in Section 5.2.2 with reference to the comparison between MSE- and LSE-first multipliers, that is, LSE-first multipliers are more complex (area) than MSE-first multipliers; however, LSE-first multipliers are faster than MSE-first ones. It can also be noted that the squaring of polynomials could be performed with the half of multiplications with respect to the use of normal multipliers.

5.7

References [Bai98] D. V. Bailey. “Optimal Extension Fields for Fast Arithmetic in Public-Key Algorithms.” BS Thesis, Worcester Polytechnic Institute, 1998. [BP01] D.V. Bailey and C. Paar. “Efficient arithmetic in finite field extensions with application in elliptic curve cryptography.” Journal of Cryptology, vol. 14, no. 3, pp. 153–176, 2001. [GG03] J. von zur Gathen and J. Gerhard. Modern Computer Algebra. 2d ed., Cambridge University Press, New York, 2003. [GGK06] J. Guajardo, T. Güneysu, S. Kumar, C. Paar, and J. Pelzl. “Efficient Hardware Implementation of Finite Fields with Applications to Cryptography.” Acta Applicandae Mathematicae, Special Issue, Finite Fields: Applications and Implementations, vol. 93, pp. 75–118, September 2006. [GKP04] J. Groβschädl, S. Kumar, and C. Paar. “Architectural Support for Arithmetic in Optimal Extension Fields.” IEEE Conf. on Application-specific Systems, Architectures and Processors - ASAP 2004, Galveston, Texas, pp. 111–124, 2004. [Jun93] D. Jungnickel. Finite Fields. B.I.-Wissenschaftsverlag, Mannheim, Leipzig, Wien, Zürich, 1993. [Knu81] D. E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, vol. 2, 2d ed., Addison-Wesley, MA, 1981. [LN83] R. Lidl and H. Niederreiter. Finite Fields, vol. 20 of Encyclopedia of Mathematics and Its Applications. Addison-Wesley, Reading, Massachusetts, 1983. [MOV96] A. J. Menezes, P. C. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, 1996. [MS99] M. Mignotte and D. Stefanescu. Polynomials. An Algorithmic Approach. Springer, New York, 1999.

CHAPTER

6

Operations over GF (p m )

L

et f(x) be a polynomial of degree m > 0 over Zp. If f(x) is irreducible, any nonzero polynomial h(x) of the ring Zp[x]/f(x) has an inverse h−1(x) such that h(x)h−1(x) mod f(x) = 1

(6.1)

Thus Zp[x]/f(x) is a finite field—an extension of the finite field Zp—also called Galois field GF(pm). Algorithms and circuits for executing additions, subtractions, multiplications, and exponentiations over Zp[x]/f(x) have already been studied in Chap. 5. In this chapter the inversion or, more generally, the division operation will be dealt with. The problem under study is the following: given g(x) and h(x) in Zp[x]/f(x), where h(x) is a nonzero polynomial, compute z(x) such that g(x) = h(x)z(x) mod f(x), that is, z(x) = g(x)h−1(x) mod f(x)

(6.2)

As in the case of Zp there are two types of algorithms for computing the inverse of an element h(x) of Zp[x]/f(x). A first method consists of using an algorithm that allows it to express the gcd (greatest common divisor) of two polynomials a(x) and b(x) over Zp under the form α(x)a(x) + β(x)b(x) where α(x) and β(x) are polynomials over Zp. Assume that a(x) = f(x) and b(x) = h(x) and express the gcd of f(x) and h(x) under the form α(x)f(x) + β(x)h(x). As f(x) is irreducible and the degree of h(x) is smaller than the degree m of f(x), their gcd is 1, so that α(x)f(x) + β(x)h(x) = 1 and β(x)h(x) mod f(x) = 1, that is, h−1(x) = β(x) mod f(x)

(6.3)

To this class of algorithms belong the extended Euclidean algorithm and the binary algorithm.

139

140

Chapter Six Another method is based on the fact that (h(x))q − 1 mod f(x) = 1 for any nonzero polynomial h(x), q = pm being the number of field elements. Thus, (h(x))q − 2h(x) mod f(x) = 1 and h−1(x) = (h(x))q − 2 mod f(x)

(6.4)

In this way inversion is substituted by exponentiation.

6.1

Euclidean Algorithm The classical Euclidean algorithm ([MOV96], [HMV04]) for computing the gcd of two polynomials a(x) and b(x) consists of a set of integer divisions: r0(x) = a(x), r1(x) = b(x) r0(x) = r1(x)q1(x) + r2(x) r1(x) = r2(x)q2(x) + r3(x)

(6.5)

... rn − 2(x) = rn − 1(x)qn − 1(x) + rn(x) As degree(r1(x)) > degree(r2(x)) > degree(r3(x)) > . . . , after a finite number of steps, say n, rn(x) will be a constant polynomial, that is, an element of Zp. Furthermore, gcd(rn − 1(x), rn(x)) = gcd(rn − 2(x), rn − 1(x)) = . . . = gcd(r0(x), r1(x)) = gcd(a(x), b(x)). Thus, gcd(a(x), b(x)) = gcd(rn − 1(x), rn(x))

with rn(x) ∈ Zp

(6.6)

In the particular case where a(x) = f(x) and b(x) = h(x) with degree(h(x)) < m, the gcd is equal to 1. If rn(x) = 0 then gcd(rn − 1(x), rn(x)) = gcd(rn − 1(x), 0) = 1, so that degree(rn − 1(x)) = 0. Assuming that n is the first index such that degree(rn(x)) ≤ 0, the conclusion is that rn(x) is a nonzero element of Zp: rn(x) ∈ Zp

rn(x) ≠ 0

(6.7)

For computing z(x) = g(x)h−1(x) mod f(x), another set of polynomials u0(x), u1(x), u2(x), . . . are computed in parallel with the computation of q1(x), q2(x), q3(x), . . . : u0(x) = 0, u1(x) = g(x) u2(x) = u0(x) − u1(x)q1(x) u3(x) = u1(x) − u2(x)q2(x) ... un(x) = un − 2(x) − un − 1(x)qn − 1(x)

(6.8)

O p e r a t i o n s o v e r G F ( p m) The following lemma is demonstrated by induction.

Lemma 6.1 ui(x)h(x) ≡ ri(x)g(x) mod f(x)

Proof For i = 0 and 1: u0(x)h(x) = 0 and r0(x)g(x) = f(x)g(x) ≡ 0 mod f(x), u1(x)h(x) = g(x)h(x) and r1(x)g(x) = h(x)g(x) For i ≥ 2: ui(x)h(x) = (ui − 2(x) − ui − 1(x)qi − 1(x))h(x) = ui − 2(x)h(x) − ui − 1(x) qi − 1(x)h(x) ≡ ri − 2(x)g(x) − ri − 1(x)qi − 1(x)g(x) = (ri − 2(x) − ri - 1(x)qi − 1(x))g(x) = ri(x)g(x) Thus, according to Eq. (6.7) and Lemma 6.1, un(x)h(x) ≡ rn(x)g(x) mod f(x), where rn(x) is a nonzero element of Zp, so that (rn(x))−1 un(x)h (x) ≡ g(x) mod f(x) and g(x)h−1(x) = (rn(x))−1un(x) mod f(x)

(6.9)

Algorithm 6.1—Euclidean algorithm A := F; B := H; C := zero; D := G; while Degree(B) > 0 loop Q := Quotient(A, B); R := Remainder(A, B); Next_D := Subtract(C, Product_mod_f(D, Q, f)); A := B; B := R; C := D; D := Next_D; end loop; Z := Product(D, Invert(B(0)));

An executable Ada file Euclidean_algorithm_polynomials.adb, including Algorithm 6.1, is available at www.arithmetic-circuits.org. As regard the degree of the polynomials ui(x) first proof the following lemma.

Lemma 6.2 degree(ui(x)) = degree(g(x)) + m − degree(ri − 1(x)), ∀i > 0

Proof For i = 1: u1(x) = g(x), r0(x) = f(x), degree(r0(x)) = m, degree(g(x)) + m − degree(r0(x)) = degree(g(x)). For i = 2: u2(x) = u0(x) − u1(x)q1(x) = −g(x)q1(x), so that degree(u2(x)) = degree(g(x)) + degree(q1(x)). Furthermore r0(x) = r1(x)q1(x) + r2(x) so that degree(q1(x)) = degree(r0(x)) − degree(r1(x)) = m − degree(r1(x)). Finally degree(u2(x)) = degree(g(x)) + m − degree(r1(x)). For i ≥ 3: ui(x) = ui − 2(x) − ui − 1(x)qi − 1(x). First observe that degree(ri − 3(x)) > degree(ri − 2(x)); so, by induction, degree(ui − 2(x)) < degree(ui − 1(x)) and degree(ui(x)) = degree(ui − 1(x)) + degree(qi − 1(x)) = degree(g(x)) + m − degree(ri − 2(x)) + degree(qi − 1(x)). Furthermore ri − 2(x) = ri − 1(x)qi − 1(x) + ri(x), so that degree(q i − 1(x)) = degree(r i − 2(x)) − degree(r i − 1(x)).

141

142

Chapter Six Finally degree(ui(x)) = degree(g(x)) + m − degree(ri − 2(x)) + degree(ri − 2(x)) − degree(ri − 1(x)) = degree(g(x)) + m − degree(ri − 1(x)). As degree(r1(x)) > degree(r2(x)) > . . . > degree(rn(x)) = 0, a consequence of the preceding lemma is that degree(ui(x)) ≤ degree(g(x)) + m − 1, ∀i ≤ n

(6.10)

In particular, if g(x) = 1 (computation of the inverse h−1(x) of h(x)), the degree of ui(x) is always smaller than m. In order to avoid the necessity of computing the quotient and the remainder of the division of a(x) by b(x), a simpler operation can be used ([HMV04]). Assume that a(x) is a polynomial of degree s, b(x) a polynomial of degree t and that s ≥ t. Then define q(x) = as(bt)−1xs – t

r(x) = a(x) − b(x)q(x)

(6.11)

Actually, these operations correspond to the first step of the classical division algorithm for polynomials. The coefficient of degree s of r(x) is equal to as – bsas(bt )−1 = 0, so that r(x) is a polynomial of degree less than s and degree(r(x)) < s = max{degree(a(x)), degree(b(x))}

(6.12)

Taking into account that initially a(x) = f(x) and b(x) = h(x), so that s > t, the main step of the Euclidean algorithm can be substituted by the following one: q(x) = as(bt)−1xs – t a(x) = b(x)

r(x) = a(x) − b(x)q(x)

(6.13)

b(x) = r(x)

if degree(a(x)) < degree(b(x)), permute a(x) and b(x) After a number n of steps, smaller than two times the degree of f, the degree of r(n) will be equal to 0. In the following algorithm, the functions pseudo_quotient and pseudo_remainder compute q(x) and r(x) according to Eq. (6.13).

Algorithm 6.2—Euclidean algorithm, version 2 A := F; B := H; C := Zero; D := G; while Degree(B) > 0 loop Q := Pseudo_Quotient(A, B); R := Pseudo_Remainder(A, B); Next_D := Subtract(C, Product_Mod_F(D, Q, F)); A := B; B := R; C := D; D := Next_D; if Degree(A) < Degree(B) then Swap(A,B); Swap(C,D); end if; end loop; Z := Product(D, Invert(B(0)));

O p e r a t i o n s o v e r G F ( p m) An executable Ada file Euclidean_algorithm_polynomials2.adb, including Algorithm 6.2, is available at www.arithmetic-circuits.org. As the degree of a(x) and b(x) must be known at every step, a kind of normalized representation could be used. Given a polynomial e(x) = euxu + eu − 1xu − 1 + . . . + e0

with eu ≠ 0, u ≤ m

define n_e(x) = e(x)xm – u = euxm + eu − 1xm − 1 + . . . + e0xm − u, deg_e = u

(6.14)

so that e(x) = n_e(x)xdeg_e – m

(6.15)

where n_e(x) is a polynomial of degree m. Then, according to Eqs. (6.13) and (6.15), r(x) = a(x) − b(x)asbt−1xs – t = n_a(x)xs – m − n_b(x)xt – masbt−1xs – t = (n_a(x) − n_b(x)asbt−1)xs – m that is, r(x) = n_r(x)x deg_r – m where n_r(x) = n_a(x) − n_b(x)asbt−1, deg_r = s

(6.16)

Actually, n_r(x) is not of degree m and must be normalized. The following operations must be executed: deg_r := deg_a; while n_r(m) = 0 loop n_r := multiply_by_x(n_r); deg_r := deg_r-1; end loop;

Initially, a(x) = f(x) so that n_a(x) = f(x) and deg_a = m. The degree of b(x) = h(x) is smaller than m and a previous normalization step must be executed.

Algorithm 6.3—Euclidean algorithm, version 3 n_a := f; deg_a := m; n_b := multiply_by_x(h); deg_b := m-1; c := zero; d := g; --previous step: while n_b(m) = 0 loop n_b := multiply_by_x(n_b); deg_b := deg_b-1; end loop; -while deg_b > 0 loop dif := deg_a - deg_b; coef := (n_a(m)*invert(n_b(m))) mod p;

143

144

Chapter Six --thread1: n_r := subtract(n_a, product(n_b,coef)); deg_r := deg_a; while n_r(m) = 0 loop n_r := multiply_by_x(n_r); deg_r := deg_r-1; end loop; --thread2: e := d; for i in 1 .. dif loop e := multiply_by_x(e, f); end loop; e := subtract(c, product(e,coef)); -if deg_b >= deg_r then n_a := n_b; deg_a := deg_b; n_b := n_r; deg_b := deg_r; c := d; d := e; else n_a := n_r; deg_a := deg_r; c := e; end if; end loop; z := product(d,invert(n_b(m)));

An executable Ada file pseudo_Euclidean_algorithm.adb, including Algorithm 6.3, is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 6.3 is shown in Fig. 6.1. It is important to note that the sequences of instructions thread1 and thread2 can be executed in parallel. The control unit executes the main loop in at least four clock cycles: 1. Load the initial values: dif := deg_a - deg_b; coef := (n_a(m)*invert(n_b(m))) mod p; --thread1: n_r := subtract(n_a, product(n_b,coef)); deg_r := deg_a; --thread2: e := d;

2. Normalize n_r(x) and update e(x) (executed at least once): --thread1: n_r := multiply_by_x(n_r); deg_r := deg_r-1; end loop; --thread2: e := multiply_by_x(e, f);

3. Final value of e(x): e := subtract(c, product(e,coef));

4. Swap if deg_r > deg_b. As already quoted above, the number of executions of the main loop is smaller than two times the degree of f, that is, 2m. As regards the number of cycles, the worst case probably occurs when at each

O p e r a t i o n s o v e r G F ( p m) n_a(x)

c (x) n_b(x)

e(x) n_bm

0

1

0

1

coef

sub1(x)

sel_sub

mod p multipliers sub2 (x)

f (x) – x

em–1

e (x)x

nbe(x)

mod p subtractors out 2(x)

mod p inverter

n_am

mod p multipliers sub4(x )

sub3 (x)

mod p subtractors out1(x )

m

d (x )

mod p multiplier

mod p multipliers

coef

z(x)

d (x)

deg_r 2 deg_a n_r (x)x

deg_r – 1 0

1

1

0

sel_e

e(x)

–1

0

1

register

ce_e

ce

initially: g(x )

sel_r e (x)

d (x)

ce_r

ce

register

ce_d

ce

deg_b n_r (x)

deg_r

n_b (x) n_r (x)

deg_b deg_r

–1

0

1

initially: f (x)

n_a(x)

FIGURE 6.1

0

1

ce m

deg_a

n_b (x)x n_r (x)

sel_a

0

ce_a

1

initially : h(x )x

n_b(x)

deg_b – 1

0

1

ce m–1

deg_r

sel_b

ce_b

d(x )

e(x )

0

1

initially : zero

deg_b

sel_a

ce

ce_a

c(x)

Euclidean algorithm.

iteration step 2 is executed just once (minimum reduction of the remainder degree). So, the number of cycles is approximately equal to 8m. The most time-consuming operation is n_r(x) = n_a(x) − n_am(n_bm)−1n_b(x) (cycle 1). It includes one mod p inversion, two mod p products, and one mod p subtraction, so that the total computation time is about T ≈ 8m(Tmod-p-inverter + 2Tmod-p-multiplier + Tmod-p-subtractor)

(6.17)

A VHDL model has been generated for p = 239. The mod 239 inverter is a table storing x−1 mod 239 for all x in {1, 2, . . . , 238}, and the other components have been described in Chap. 3 (mod_239_multiplier.

145

146

Chapter Six vhd, reduced version of adder_subtractor.vhd). The complete VHDL file pse-udo_Euclidean_divider.vhd is available at www.arithmeticcircuits. org. The entity declaration is entity pseudo_Euclidean_divider is port( g, h: in polynomial; clk, reset, start: in std_logic; z: out polynomial; done: out std_logic ); end pseudo_Euclidean_divider;

The VHDL architecture corresponding to the circuit of Fig. 6.1 is the following: long_c(m) <= “00000000”; long_e(m) <= “00000000”; definition: for i in 0 to m-1 generate long_c(i) <= c(i); long_e(i) <= e(i); end generate; with sel_sub select sub1 <= n_a when ‘0’, long_c when others; with sel_sub select nbe <= n_b when ‘0’, long_e when others; functions1: for i in 0 to m generate m1: mod_239_multiplier port map(coef, nbe(i), sub2(i)); s1: subtractor port map(sub1(i), sub2(i), out1(i)); end generate; by_x1: for i in 1 to m-1 generate sub3(i) <= e(i-1); end generate; sub3(0) <= “00000000”; functions2: for i in 0 to m-1 generate m2: mod_239_multiplier port map(e(m-1), f(i), sub4(i)); s2: subtractor port map(sub3(i), sub4(i), out2(i)); end generate; dr_minus1 <= deg_r - 1; db_minus1 <= deg_b - 1; dif <= deg_a - deg_b; inverter: mod_239_inverter port map(clk, n_b(m), inv_ out); m3: mod_239_multiplier port map(n_a(m), inv_out, coef); functions3: for i in 0 to m-1 generate m4: mod_239_multiplier port map(inv_out, d(i), z(i)); end generate; by_x2: for i in 1 to m generate nr_by_x(i) <= n_r(i-1); end generate; nr_by_x(0) <= “00000000”; with sel_r select next_r <= out1 when ‘0’, nr_by_x when others; with sel_r select next_dr <= deg_a when ‘0’, dr_minus1 when others; definition2: for i in 0 to m-1 generate

O p e r a t i o n s o v e r G F ( p m) short_out1(i) <= out1(i); end generate; with sel_e select next_e <= d when “00”, out2 when “01”, short_out1 when others; with sel_a select next_a <= n_b when ‘0’, n_r when others; with sel_a select next_da <= deg_b when ‘0’, deg_r when others; by_x3: for i in 1 to m generate nb_by_x(i) <= n_b(i-1); end generate; nb_by_x(0) <= “00000000”; with sel_b select next_b <= nb_by_x when ‘0’, n_r when others; with sel_b select next_db <= db_minus1 when ‘0’, deg_r when others; with sel_a select next_c <= d when ‘0’, e when others; registers_ac: process(clk) ... end process registers_ac; by_x4: for i in 1 to m generate h_by_x(i) <= h(i-1); end generate; h_by_x(0) <= “00000000”; register_b: process(clk) ... end process register_b; register_r: process(clk) ... end process register_r; register_d: process(clk) ... end process register_d; register_e: process(clk) ... end process register_e;

The complete model includes additionally a counter (variable dif of Algorithm 6.3), combinational circuits that compute the branching conditions and a control unit. For great values of p, the mod p inverter and multiplier components are sequential circuits so that additional state and control signals must be added (start_inversion, inversion_done, start_product, product_done). In Eq. (6.17) Tmod-p-inverter and Tmod-p-multiplier must be substituted by the total computation time (number of cycles by the minimum clock period) of the corresponding operations.

6.2

Binary Algorithm The binary algorithm ([WWSH02], [DS06], [HMV04]) for computing the gcd of two polynomials is based on the following observation: Given two polynomials a(x) and b(x), if both are divisible by x, that is, if a0 = b0 = 0, then gcd(a(x), b(x)) = x · gcd(a(x)/x, b(x)/x); if one of them, say b(x), is divisible by x (b0 = 0) and the other is not (a0 ≠ 0), then gcd(a(x), b(x)) = gcd(a(x), b(x)/x); if none of them is divisible by x then define a new polynomial ab(x) = a(x) − a0b0−1b(x), so that gcd(a(x), b(x)) = gcd(ab(x), b(x)) = gcd(ab(x), a(x)), and ab(x) is divisible by x. Assume that a(x) is not divisible by x and degree(a(x)) > 0, and define a(0, x) = a(x) and b(0, x)= b(x). Two sequences a(1, x), a(2, x), a(3, x), . . . and b(1, x), b(2, x), b(3, x), . . . of polynomials are generated: given a(i, x) and b(i, x) where a(i, x) is not divisible by x, then

147

148

Chapter Six if b0(i, x) = 0: a(i + 1, x) = a(i, x), b(i + 1, x) = b(i, x)/x if b0(i, x) ≠ 0 and degree(b(i, x)) ≥ degree(a(i, x)): a(i + 1, x) = a(i, x), b(i + 1, x) = ab(i, x)/x where ab(i, x) = a(i, x) − a0(i, x)(b0(i, x))−1b(i, x) if b0(i, x) ≠ 0 and degree(b(i, x)) < degree(a(i, x)): a(i + 1, x) = b(i, x), b(i + 1, x) = ab(i, x)/x where ab(i, x) = a(i, x) − a0(i, x)(b0(i, x))−1b(i, x) Obviously a(i + 1, x) is not divisible by x and gcd(a(i + 1, x), b(i + 1, x)) = gcd(a(i, x), b(i, x)) so that gcd(a(i + 1, x), b(i + 1, x)) = gcd(a(i, x), b(i, x)) . . . = = gcd(a(0, x),b(0, x)) = gcd(a(x), b(x))

(6.18)

In order to demonstrate the convergence of this iteration, compare degree(a(i + 1, x)) + degree (b(i + 1, x)) with degree(a(i, x)) + degree(b(i, x)): in the first case degree(a(i + 1, x)) = degree(a(i, x)) and degree(b(i + 1, x)) = degree(b(i, x)) − 1 if b(i, x) ≠ 0, and degree(b(i + 1, x)) = degree(b(i, x)) if b(i, x) = 0 in the second case degree(a(i + 1, x)) = degree(a(i, x)) and degree(b(i + 1, x)) = degree(ab(i, x)) − 1 ≤ degree(b(i, x)) −1 if ab(i, x) ≠ 0, and degree(b(i + 1, x)) < degree(b(i, x)) if ab(i, x) = 0 in the third case degree(a(i + 1, x)) = degree(b(i, x)) and degree(b(i + 1, x)) = degree(ab(i, x)) − 1 ≤ degree(a(i, x)) −1 if ab(i, x) ≠ 0, and degree(b(i + 1, x)) < degree(a(i, x)) if ab(i, x) = 0 Thus, unless b(i, x) = 0, degree(a(i + 1, x)) + degree(b(i + 1, x)) < degree(a(i, x)) + degree(b(i, x)). Furthermore, as long as degree(a(i, x)) > 0 and degree(b(i, x)) > 0, then degree(a(i + 1, x)) > 0. If degree(a(i, x)) > 0 and degree(b(i, x)) > 0, then degree (a(i + 1, x)) + degree(b(i + 1, x)) < degree(a(i, x)) + degree(b(i, x)), and degree (a(i + 1, x)) > 0. In conclusion, after a finite number of steps, say n, b(n, x) will be a constant polynomial, that is, an element of Zp. Thus

Lemma 6.3

gcd(a(x), b(x)) = gcd(a(n, x), b(n, x))

with b(n, x) ∈ Zp

(6.19)

In the case where a(x) = f(x) and b(x) = h(x) with degree(h(x)) < m, the gcd is equal to 1. If b(n, x) = 0, then gcd(a(n, x), b(n, x)) = gcd(a(n, x), 0) = 1, so that degree(a(n, x)) = 0. Assuming that n is the first index such that degree(b(n, x)) ≤ 0, then degree(a(n, x)) > 0 (Lemma 6.3), and b(n, x) is a nonzero element of Zp: b(n, x) ∈ Zp

b(n, x) ≠ 0

(6.20)

O p e r a t i o n s o v e r G F ( p m) For computing z(x) = g(x)h−1(x) mod f(x), two additional sets of polynomials c(1, x), c(2, x), c(3, x), . . . and d(1, x), d(2, x), d(3, x), . . . are computed in parallel. The initial values are c(0, x) = 0 and d(0, x) = g(x). Then if b0(i, x) = 0 : c(i + 1, x) = c(i, x), d(i + 1, x) = d(i, x)x−1 mod f(x) if b0(i, x) ≠ 0 and degree(b(i, x)) ≥ degree(a(i, x)): c(i + 1, x) = c(i, x), d(i + 1, x) = cd(i, x)x−1 where cd(i, x) = c(i, x) − a0(i, x)(b0(i, x))−1d(i, x) if b0(i, x) ≠ 0 and degree(b(i, x)) < degree(a(i, x)): c(i + 1, x) = d(i, x), d(i + 1, x) = cd(i, x)x−1 where cd(i, x) = c(i, x) − a0(i, x)(b0(i, x))−1d(i, x) The following lemma is demonstrated by induction:

Lemma 6.4 c(i, x)h(x) ≡ a(i, x)g(x) mod f(x) d(i, x)h(x) ≡ b(i, x)g(x) mod f(x)

Proof For i = 0: c(0, x)h(x) = 0 and a(0, x)g(x) = f(x)g(x) ≡ 0 mod f(x), d(0, x)h(x) = g(x)h(x) and b(0, x)g(x) = h(x)g(x). For i >1 the values of c(i + 1, x) and d(i + 1, x) in function of c(i, x) and d(i, x) are calculated in the same way as the values of a(i + 1, x) and b(i + 1, x) in function of a(i, x) and b(i, x), but for the substitution of the conventional arithmetic operations by mod f(x) operations. Thus, according to Eq. (6.20) and Lemma 6.4, d(n, x)h(x) ≡ b(n, x) g(x) mod f(x) where b(n, x) is a nonzero element of Zp, so that (b(n, x))−1 d(n, x)h(x) ≡ g(x) mod f(x) and g(x)h−1(x) = (b(n, x))−1d(n, x) mod f(x)

(6.21)

Given a polynomial w(x) of degree smaller than m over Zp, the value of w(x)x−1 mod f(x) is computed as follows: w(x)x−1 mod f(x) = (w(x) − w 0 f0−1f(x))/x. Assume that a function function divide_by_x(a, polynomial

f:

in

polynomial)

returning w(x)x −1 mod f(x) has been defined.

Algorithm 6.4—Binary algorithm A := f; B := h; C := zero; d := g; while Degree(b) > 0 loop

return

149

150

Chapter Six if b(0) = 0 then b := Shift_One(b); d := Divide_By_X(d, F); else coef := (A(0)*Invert(B(0))) mod p; Old_b := b; Old_d := d; b := Shift_One(Subtract(A, Product(B, coef))); d := Divide_By_X(Subtract(C, Product(D, coef)),F); if Degree(a) > Degree(Old_b) then a := Old_b; c := Old_d; end if; end if; end loop; Z := Product(d, Invert(b(0)));

An executable Ada file binary_polynomials.adb, including Algorithm 6.4, is available at www.arithmetic-circuits.org. Instead of computing the degree of a(i, x) and b(i, x) at each step, a better solution consists of defining upper bounds αi and βi ([DS06]) such that degree(a(i, x)) ≤ αi

degree(b(i, x)) ≤ βi

Initially a(0, x) = f(x), α0 = m, b(0, x) = h(x), β0 = m − 1. The two sequences a(1, x), a(2, x), a(3, x), . . . and b(1, x), b(2, x), b(3, x), . . . of polynomials and the two sequences α1, α2, α3, . . . and β1, β2, β3 , . . . of integers are generated as follows: if b0(i, x) = 0: a(i + 1, x) = a(i, x), b(i + 1, x) = b(i, x)/x, αi + 1 = αi, βi + 1 = βi − 1 if b0(i, x) ≠ 0 and βi ≥ αi : a(i + 1, x) = a(i, x), b(i + 1, x) = ab(i, x)/x where ab(i, x) = a(i, x) − a0(i, x)(b0(i, x))−1b(i, x), αi + 1 = αi, βi + 1 = βi − 1 if b0(i, x) ≠ 0 and βi < αi: a(i + 1, x) = b(i, x), b(i + 1, x) = ab(i, x)/x where ab(i, x) = a(i, x) − a0(i, x)(b0(i, x))−1b(i, x), αi + 1 = βi, βi + 1 = αi − 1 In order to demonstrate the convergence of this iteration, observe that αi + 1 + βi + 1 = αi + βi − 1

(6.22)

Also note that as long as αi ≥ 0 and βi ≥ 0, αi + 1 ≥ 0. In conclusion, after a finite number of steps, say n, βn < 0, that is degree(b(n, x)) < 0 (= −∞), and thus b(n, x) = 0. As gcd(a(n, x), b(n, x)) = gcd(a(0, x), b(0, x)) = gcd( f(x), h(x)) = 1 and b(n, x) = 0, then a(n, x) ∈ Zp

a(n, x) ≠ 0

(6.23)

O p e r a t i o n s o v e r G F ( p m) According to Eq. (6.23) and Lemma 6.4, c(n, x)h(x) ≡ a(n, x)g(x) mod f(x) where a(n, x) is a nonzero element of Zp, so that (a(n, x))−1c(n, x) h(x) ≡ g(x) mod f(x) and g(x)h−1(x) = (a(n, x))−1c(n, x) mod f(x)

(6.24)

Algorithm 6.5—Binary algorithm, version 2 A := f; B := h; C := zero; d := g; alpha := m; beta := m-1; while beta >= 0 loop if b(0) = 0 then b := Shift_One(b); d := Divide_By_X(d, F); beta := beta - 1; else coef := (A(0)*Invert(B(0))) mod p; Old_b := b; Old_d := d; old_beta := beta; b := Shift_One(Subtract(A, Product(B, coef))); d := Divide_By_X(Subtract(C, Product(D, coef)),F); if alpha > beta then a := Old_b; c := Old_d; beta := alpha - 1; alpha := old_beta; else beta := beta - 1; end if; end if; end loop; Z := Product(c, Invert(a(0)));

An executable Ada file binary_polynomials2.adb, including Algorithm 6.5, is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 6.5 is shown in Fig. 6.2. The minimum clock period is equal to Tmod-p-inverter + 2Tmod-p-multiplier + 2Tmod-p-subtractor. In order to calculate an upper bound of the number of cycles take into account that α0 = m, β0 < m, so that, according to Eq. (6.22), the number of cycles is smaller than 2m. Thus, the total computation time is about T ≈ 2m(Tmod-p-inverter + 2Tmod-p-multiplier + 2Tmod-p-subtractor)

(6.25)

A VHDL model has been generated for p = 239. As in the case of the Euclidean algorithm (Sec. 6.1), the mod 239 inverter is a table storing x−1 mod 239 for all x in {1, 2, . . . , 238}, and the other components have been described in Chap. 3 (mod_239_multiplier.vhd, reduced version of adder_subtractor.vhd). The complete VHDL file binary_ algorithm_polynomials.vhd is available at www.arithmetic-circuits. org. The entity declaration is entity binary_algorithm_polynomials is port( g, h: in polynomial; clk, reset, start: in std_logic;

151

152

Chapter Six b0

a0

0

1

coef c(x)

1

0

1

mod p inverter

0

mod p multipliers

inv_ab

z (x)

c(x)

coefc_or_d (x)

mod p multiplier b(x)

coef

mod p subtractors d(x) cd(x)

mod p multipliers

a(x)

coefb (x)

b(x)/x

0

1

w (x)

mod p subtractors ab(x)

sel_bd

w0

f0–1f (x ) mod p multipliers

ab (x)/x 0

final

c_or_d(x)

mult_in a0

d (x)

1

w0f0–1f (x)

sel_bd

mod p subtractors

next_b (x)

w(x) – w0f0–1f (x) initially : h(x)

ce

ce_bd

(w (x) – w0f0–1f (x))/x initially : g (x )

b (x) initially : f(x)

ce

ce_ac

ce

ce_bd

ce

ce_ac

d(x) initially : zero

a(x) c (x )

FIGURE 6.2

Binary algorithm.

z: out polynomial; done: out std_logic ); end binary_algorithm_polynomials;

The VHDL architecture corresponding to the circuit of Fig. 6.2 follows:

O p e r a t i o n s o v e r G F ( p m) with final select inv_input <= b(0) when ‘0’, a(0) when others; inverter1: mod_239_inverter port map(clk, inv_input, inv_ab); multiplier1: mod_239_multiplier port map(a(0), inv_ab, coef); multipliers2: for i in 0 to m-1 generate a_multiplier: mod_239_multiplier port map(coef, b(i), coef_by_b(i)); end generate; coef_by_b(m) <= conv_std_logic_vector(0, k); subtractor1: for i in 0 to m generate sub1: subtractor port map(a(i), coef_by_b(i), ab(i)); end generate; divide_by_x1: for i in 0 to m-2 generate b_div_x(i) <= b(i+1); end generate; b_div_x(m-1) <= conv_std_logic_vector(0, k); divide_by_x2: for i in 0 to m-1 generate ab_div_x(i) <= ab(i+1); end generate; with sel_bd select next_b <= b_div_x when ‘0’, ab_div_x when others; with final select mult_in <= coef when ‘0’, inv_ab when others; with final select c_or_d <= d when ‘0’, c when others; multipliers3: for i in 0 to m-1 generate a_second_multiplier: mod_239_multiplier port map(mult_in, c_or_d(i), coef_by_cd(i)); end generate; z <= coef_by_cd; subtractor2: for i in 0 to m-1 generate sub2: subtractor port map(c(i), coef_by_cd(i), cd(i)); end generate; with sel_bd select w <= d when ‘0’, cd when others; multipliers4: for i in 0 to m generate a_third_multiplier: mod_239_multiplier port map(w(0), inv_f0_by_f(i), subtractor_input(i)); end generate; subtractor3: for i in 0 to m-1 generate sub3: subtractor port map(w(i), subtractor_input(i), subtractor3_output(i)); end generate; w_m <= conv_std_logic_vector(0, k); sub4: subtractor port map(w_m, subtractor_input(m), subtractor3_output(m)); divide_by_x3: for i in 0 to m-1 generate next_d(i) <= subtractor3_output(i+1); end generate; registers_ac: process(clk) begin

153

154

Chapter Six if clk’event and clk = ‘1’ then if first = ‘1’ then a <= f; c <= zero_poly; elsif ce_ac = ‘1’ then a(m) <= conv_std_logic_vector(0, k); for i in 0 to m-1 loop a(i) <= b(i); end loop; c <= d; end if; end if; end process registers_ac; registers_bd: process(clk) begin if clk’event and clk = ‘1’ then if first = ‘1’ then b <= h; d <= g; elsif ce_bd = ‘1’ then b <= next_b; d <= next_d; end if; end if; end process registers_bd;

The complete model additionally includes counters for storing α and β, combinational circuits that compute the branching conditions and a control unit.

6.3 Reduction to Multiplications over GF(p m) and Inversion over Zp

The set of nonzero elements of GF(pm) is a cyclic group containing q − 1 elements (q = pm), so that, given a nonzero polynomial h(x), h(x)q − 1 = 1. Let r = 1 + p + p2 + . . . + pm − 1 = (pm − 1)/(p − 1). Thus, (h(x)r)p − 1 = h(x) q − 1 = 1 and (h(x)r)p = h(x)r. Taking into account that the fixed elements of the Frobenius automorphism ϕ: a → ap, ∀a ∈ GF(pm) are the elements of Zp ([Kob94]), one deduces that h(x)r is a nonzero element of Zp: h(x)r ∈ Zp

h(x)r ≠ 0

(6.26)

Thus, g(x)h(x)(h(x)r)−1h(x)r − 1 ≡ g(x) mod f(x) so that z(x) = g(x)(h(x)r)−1h(x)r − 1 mod f(x)

(6.27)

The problem has been reduced to multiplications over GF(pm), and to an inversion over Zp (calculation of (h(x)r)−1). Assume that the binary representation rs − 1 rs − 2 . . . r0 of r − 1 = p + p2 + . . . + pm − 1 has been previously computed. The following algorithm computes z(x):

Algorithm 6.6—mod f(x) division, multiplications over GF(pm), and inversion over Zp E := One; for I in 0 .. S-1 loop E := Product_Mod_F(E, E, F);

O p e r a t i o n s o v e r G F ( p m) if R(S-1-I) = 1 then E := Product_Mod_F(E, H, F); end if; end loop; A := Product_Mod_F(E, H, F); E := Product(E, Invert(A(0))); Z := Product_Mod_F(E, G, F);

An executable Ada file reduction_to_multiplications.adb, including Algorithm 6.6 is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 6.6 is shown in Fig. 6.3. Its maximum computation time is approximately equal to 2s times the computation time of a mod f(x) multiplier, where 2s ≥ r = (pm − 1)/(p − 1), that is, s ≈ mlog2 p = log2 q. Thus the computation time is prohibitively long, except for small values of q = pm. A VHDL model has been generated for p = 239. As in Secs. 6.1 and 6.2 the mod 239 inverter is a table storing x−1 mod 239 for all x in {1, 2, . . . , 238}, and the other components have been described in Chaps. 3 and 5 (mod_239_multiplier.vhd and mod_ f_multiplier.vhd). The complete VHDL

h(x) g(x )

0

1

2

sel_mult

mult_in2 start_mult

mod f (x) multiplier

mult_done

mult_out 1(x ) 1

0

sel_e next_ea0

next_ea(x)

initially : 1

ce

ce

ce_e

a0

e (x ) mod p inverter z(x )

a0–1 mod p multipliers mult_out 2 (x )

FIGURE 6.3

Reduction to multiplications.

ce_a

155

156

Chapter Six file reduction_to_multiplications.vhd is available at www.arithmeticcircuits.org. The entity declaration is entity reduction_to_multiplications is port( g, h: in polynomial; clk, reset, start: in std_logic; z: out polynomial; done: out std_logic ); end reduction_to_multiplications;

The VHDL architecture corresponding to the circuit of Fig. 6.3 follows: with sel_mult select mult_in2 <= e when “00”, h when “01”, g when others; a_mod_f_multiplier: LSE_first_mod_f_mult port map (e, mult_in2, clk, reset, start_mult, mult_out1, mult_done); with sel_e select next_ea <= mult_out1 when ‘0’, mult_out2 when others; an_inverter: mod_239_inverter port map(clk, a0, inv_a0); mod_p_multipliers: for i in 0 to m-1 generate a_mod_p_multiplier: mod_239_multiplier port map(e(i), inv_a0, mult_out2(i)); end generate; register_e: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then e <= one_poly; elsif ce_e = ‘1’ then e <= next_ea; end if; end if; end process; z <= e; register_a: process(clk) begin if clk’event and clk = ‘1’ then if ce_a = ‘1’ then a0 <= next_ea(0); end if; end if; end process;

The complete model additionally includes an s-state counter, a shift register initially storing r − 1, and a control unit.

6.4

Optimal Extension Fields A particularly interesting case is when f is a binomial and p a multiple of m plus 1, that is, f(x) = xm – c

and

p mod m = 1

in which case Zp[x]/f(x) is an optimal extension field (OEF).

(6.28)

O p e r a t i o n s o v e r G F ( p m) Property 6.1 In an OEF i a(x)(p ) = am – 1 fm – 1 i x m – 1 + am – 2 fm – 2 i x m – 2 + . . . + a0 f0i

with

f ji = c jt mod p and t = ⎣pi/m⎦

Proof Taking into account that ap = a for any a in Zp (Fermat’s little theorem) and that (α + β)p = αp + β p for any α and β belonging to a finite field of characteristic p, one deduces that a(p ) = a,∀a ∈ Zp i

and

(α + β)(p ) = α (p ) + β(p ) , ∀α and β ∈ GF(pm) i

i

i

Thus i (am – 1x m – 1 + am – 2 x m – 2 + . . . + a1x + a0 )(p ) i i i = am – 1x( m – 1) p + am – 2 x( m – 2) p + . . . + a1x(p ) + a0

Then, according to Eq. (6.28), pi ≡ 1 mod m, that is, pi = tm + 1, so that a j x jp = a j (x tm + 1 ) j = a j (x m ) jt x j = a j c jt x j = a j f ji x j i

The Frobenius constants fji can be computed in advance so that, in Algorithm 6.6, the computation of h(x)r − 1 , that is, h(x)r – 1 = h(x)p h(x)(p ) . . . h(x)(p 2

m–1)

can be computed using Property 6.1. In the following algorithm the function frobenius( j, i) returns fji.

Algorithm 6.7—mod f(x) division, optimal extension field, version 1 e := one; for I in 1 .. M-1 loop for J in 0 .. M-1 loop D(J) := (H(J)*Frobenius(J,I)) mod P; end loop; E := Product_Mod_F(E, D, F); end loop; A := Product_Mod_F(E, H, F); e := Product(E, invert(a(0))); Z := Product_Mod_F(e, G, F);

An executable Ada file oef1.adb, including Algorithm 6.7, in the particular case where p = 239 and f(x) = x17 − 2, is available at www. arithmetic-circuits.org.

157

158

Chapter Six The most complex operation of Algorithms 6.6 and 6.7 is the mod f(x) product. In Algorithm 6.6, it is executed approximately 2s ≈ 2mlog2p times, while in Algorithm 6.7 it is executed approximately m times. Furthermore, it is worthwhile to look for optimized computation schemes. For instance, if m is odd, then r − 1 = p + p2 + . . . + pm − 1 can be decomposed under the form (p + p2 + . . . + pu) + (p + p2 + . . . + pu)pu, where u = (m − 1)/2, and the computation of (h(x))r − 1 can be performed as follows: e(x) = h(x)p h(x)(p ) ... h(x)(p 2

h(x)r – 1 = e(x)e(x)(p

u)

u)

In this way the number of mod f(x) products is approximately equal to m/2. By recursively using this type of decomposition, the number of mod f(x) products can be reduced to about log2m. As an example consider the case where m = 17. Then h(x)r − 1 can be computed according to the following scheme ([WBP00][DCW00]): d0 (x) = h(x) d1 (x) = h(x)p d2 (x) = d0 (x)d1 (x) = h(x)1+ p d3 (x) = d2 (x)(p ) = h(x)p 2

2 + p3

d4 (x) = d2 (x)d3 (x) = h(x)1 + p + p d5 (x) = d4 (x)(p ) = h(x)p 4

2 + p3

4 + p 5 + p6 + p7

d6 (x) = d4 (x)d5 (x) = h(x)1 + p + p d7 (x) = d6 (x)(p ) = h(x)p 8

2 + p 3 + p 4 + p 5 + p6 + p7

8 + p 9 + p10 + p11 + p12 + p13 + p14 + p15

d8 (x) = d6 (x)d7 (x) = h(x)1 + p + p d9 (x) = d8 (x)p = h(x)p + p

2

+ p 3 + p 4 + p 5 + p6 + p7 + p 8 + p 9 + p10 + p11 + p12 + p13 + p14 + p15

2 + p 3 + p 4 + p 5 + p6 + p7 + p 8 + p 9 + p10 + p11 + p12 + p13 + p14 + p15 + p16

= h(x)r – 1 It includes 4 mod f(x) products.

Algorithm 6.8—mod f(x) division, optimal extension field, m = 17, version 2 a := h; for j in 0 .. m-1 loop e(J) := (a(J)*Frobenius(j,1)) mod P; end loop;

O p e r a t i o n s o v e r G F ( p m) a := product_mod_f(e, a, f); for j in 0 .. m-1 loop e(J) := mod P; end loop; a := product_mod_f(e, a, f); for j in 0 .. m-1 loop e(J) := mod P; end loop; a := product_mod_f(e, a, f); for j in 0 .. m-1 loop e(J) := mod P; end loop; a := product_mod_f(e, a, f); for j in 0 .. m-1 loop e(J) := mod P; end loop; A := Product_Mod_F(E, H, F); e := Product(E, invert(a(0))); Z := Product_Mod_F(e, G, F);

(a(J)*Frobenius(j,2))

(a(J)*Frobenius(j,4))

(a(J)*Frobenius(j,8))

(a(J)*Frobenius(j,1))

An executable Ada file oef2.adb, including Algorithm 6.8, in the particular case where p = 239 and f(x) = x17 − 2, is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 6.8 is shown in Fig. 6.4. The total computation time approximately amounts to six times the computation time of a mod f(x) multiplier. A VHDL model fm –1 1 fm –1 2 fm –1 4 fm –1 8

f1 1

f1 2

f1 4

f1 8

0

1

2

3

a0 e0

0 am–1

0

1

2

3 a0–1

em–1

1

........

0

a1

1

0

a0–1

a0–1

e1

1

mod p inverter

sel_f

0

mod p multiplier

1 a0

mod p multiplier

mod p multiplier

0

next_em–1

next_e1

next_e0

1

sel_e

a(x ) h(x) g(x) next_e(x ) e (x)

0

1

2

sel_a ce

next_a(x) mod f (x) multiplier

z(x)

FIGURE 6.4

start_mult mult_done

initially : h(x)

ce

ce_a

a (x)

Division over an optimal extension ﬁeld.

e (x)

ce_e

159

160

Chapter Six has been generated. The complete VHDL file oef.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity oef is port( g, h: in polynomial; clk, reset, start: in std_logic; z: out polynomial; done: out std_logic ); end oef;

The VHDL architecture corresponding to the circuit of Fig. 6.4 follows: main_iteration: for j in 1 to m-1 generate with sel_f select f_coef(j) <= f1(j) when “00”, f2(j) when “01”, f4(j) when “10”, f8(j) when others; with sel_e select op1(j) <= a(j) when ‘0’, e(j) when others; with sel_e select op2(j) <= f_coef(j) when ‘0’, inv_a0 when others; mod_239_multipliers: mod_239_multiplier port map(op1(j), op2(j), next_e(j)); end generate; mod_239_multiplier2: mod_239_multiplier port map(e(0), inv_a0, mult_out); with sel_e select next_e(0) <= a(0) when ‘0’, mult_out when others; with sel_a select ahg <= a when “00”, h when “01”, g when others; mod_f_multiplier1: LSE_first_mod_f_mult port map(e, ahg, clk, reset, start_mult, next_a, mult_done); z <= next_a; inverter: mod_239_inverter port map(clk, a(0), inv_a0); register_e: process(clk) begin if clk’event and clk = ‘1’ then if ce_e = ‘1’ then e <= next_e; end if; end if; end process; register_a: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then a <= h; elsif ce_a = ‘1’ then a <= next_a; end if; end if; end process;

The complete model additionally includes a control unit.

FFs

LUTs

Slices

Mult

RAM

Period

Cycles

Total time

Pseudo Euclidean

871

3,923

2,272

39

1

36

147

5,292

Binary

623

3,235

2,001

37

–

56

37

2,072

Reduction to multiplications (MSE)

562

2,607

1,594

34

1

25

7,602

190,050

Reduction to multiplications (LSE)

672

2,794

1,754

35

1

19

4,202

79,838

Optimal extension field (MSE)

603

2,873

1,609

34

1

25

235

5,875

Optimal extension field (LSE)

715

3,268

1,894

35

1

19

133

2,527

TABLE 6.1

Cost and Delay of Dividers over GF (23917)

161

162 6.5

Chapter Six

FPGA Implementations Several dividers over GF(23917) have been implemented within Spartan3 (speed-5) programmable devices, namely: pseudo Euclidean algorithm (Algorithm 6.3), binary algorithm (Algorithm 6.5), reduction to multiplications (Algorithm 6.6) with either LSE-first or MSE-first multipliers, optimal extension field (Algorithm 6.8) with either LSEfirst or MSE-first multipliers. Their costs and delays are shown in Table 6.1. As noted previously, the times (period, total time) are expressed in ns, and the parameters FFs, LUTs, Mult and RAM represent the numbers of flip-flops, look-up tables, embedded 18-bit-by-18-bit multipliers, and RAM blocks, respectively. All the source files are available at www.arithmetic-circuits.org.

6.6

Comments and Conclusions The binary algorithm is a good option as it gives the fastest circuit, with a number of slices similar to that of other options, and furthermore does not need RAM blocks for storing the inverses mod 239. Another advantage is that it can be used for any extension field, not necessarily an optimal one. If the delay is not an issue, then the reduction to multiplications can also be considered.

6.7

References [DCW00] J. Domingo-Ferrer, D. Chan, and A. Watson, eds. Smart Card Research and Advanced Applications. Kluwer, Dordredit, Netherlands, 2000. [DS06] J.-P. Deschamps and G. Sutter. “Hardware Implementation of Finite-Field Division.” Acta Applicandae Mathematicae, vol. 93, pp. 119–147, September 2006. [HMV04] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer, New York, 2004. [Kob94] N. Koblitz. A Course in Number Theory and Cryptography. Springer-Verlag, New York, 1994. [MOV96] A. J. Menezes, P. C. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, Boca Raton, Florida, 1996. [WBP00] A. D. Woodbury, D. V. Bailey, and C. Paar. “Elliptic curve cryptography on smart cards without coprocessors,” in [DCW00], pp. 71–92, 2000. [WWSH02] Ch.-H. Wu, Ch.-M. Wu, M.-D. Shieh, and Y.-T. Hwang. “Novel Algorithm and VLSI Design for Division over GF(2m).” IEICE Transactions Fundamentals, vol. E85-A, no. 5, pp.1129–1139, May 2002.

CHAPTER

7

Operations over GF (2m)—Polynomial Bases

T

he goal of this chapter is the study of the operations over the binary field GF(2m), where the elements of the finite field are represented in the polynomial basis. Let α ∈ GF(2m) and be a root of the irreducible polynomial f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 over GF(2). Then, the set {1, α, . . . , αm − 1} constitutes the polynomial basis in GF(2m). With polynomial basis, the elements in GF(2m) can be represented as polynomials of degree at most m – 1 in the form GF(2m) = {a(α)|a(α) = am − 1αm − 1 + . . . + a1α + a0, ai ∈ GF(2)}, where the coefficients ai are the polynomial basis coordinates in GF(2). Polynomial basis can also be represented as the set {1, x, x2, . . . , xm − 1} and, therefore, GF(2m) = {a(x)|a(x) = am − 1xm − 1 + . . . + a1x + a0, ai ∈ GF(2)}. Arithmetic operations in GF(2m) are performed modulo an irreducible polynomial f(x) over GF(2). Addition of polynomials is carried out under modulo 2 arithmetic. Therefore, the addition of two polynomials becomes the bitwise exclusive-or (XOR) of their binary representations. Subtraction is exactly the same as addition in modulo 2 arithmetic, so 1 – x equals 1 + x. Among the GF(2m) arithmetic operations, multiplication is usually considered the most important, complex, and time-consuming operation. Complexity could depend on many factors, such as the selection of the irreducible polynomial or the basis selected for the representation of the field elements. A number of efficient GF(2m) multiplication approaches and architectures have been proposed in which different basis representations of field elements are used. Among them, the most widely used are the polynomial (or standard or canonical) basis and normal basis; although other bases can also be used. The complexity of basis conversion is heavily dependent on the irreducible polynomial selected. If the polynomial is adequately chosen, the basis conversion is a simple operation.

163

164

Chapter Seven Polynomial basis is more promising in the sense that it gives designers more freedom on irreducible polynomial selection and hardware optimization. Among the important irreducible polynomials usually selected, trinomials, pentanomials, ESPs (equally-spaced polynomials), and AOPs (all-one polynomials) can be considered.

7.1

Multiplication In Chap. 5, the multiplication modulo f(x) has been dealt with in the general case where the basic field is Zp. In this section, we will consider as basic field the binary field Z2 = GF(2). Let f(x) be a degree m irreducible polynomial over GF(2) in the form f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0

(7.1)

where fi ∈ GF(2) = {0, 1}. Then, the set {1, x, . . . xm − 1} is the polynomial basis in GF(2m), and we can represent arbitrary elements in GF(2m) defined by f(x) as a(x) = am − 1xm − 1 + . . . + a1x + a0, where ai ∈ GF(2). Let a(x) and b(x) be two field elements and c(x) be their product. Then, c(x) = a(x)b(x) mod f(x)

(7.2)

Thus, the polynomial basis multiplication involves two steps: polynomial multiplication and reduction modulo an irreducible polynomial. The product d(x) of the polynomials representing the field elements a(x) and b(x), d(x) = a(x)b(x), is a degree 2m – 2 polynomial. In the modular reduction c(x) = d(x) mod f(x), the degree 2m – 2 polynomial d(x) is reduced by the degree m irreducible polynomial f(x) iteratively. The choice of the irreducible polynomial f(x) may ease the modular reduction. Sparse irreducible polynomials having fewer nonzero terms are usually preferred for efficiency. In Sec. 7.6, important irreducible polynomials will be considered. Several algorithms can be used for the implementation of the field multiplication given in Eq. (7.2).

7.1.1 Two-Step Classic Multiplication The two-step classic multiplication in GF(2m) is a straightforward translation of the classic school multiplication algorithm, and corresponds to the binary version of the two-step multiplication given in Chap. 5. In the two-step multiplication, the field product c(x) given in Eq. (7.2) involves two steps: polynomial multiplication and reduction modulo an irreducible polynomial. The product d(x) of the polynomials a(x) and b(x), d(x) = a(x)b(x), is a polynomial with maximum degree 2m − 2. Polynomial multiplication d(x) can be written in matrix form [RSDK06] as:

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s ⎛ d0 ⎞ ⎛ a0 ⎜ d ⎟ ⎜ a1 ⎜ 1 ⎟ ⎜ a ⎜ d2 ⎟ ⎜ 2 ⎜ ⎟ ⎜ ⎜ d ⎟ ⎜ am − 2 ⎜ m− 2 ⎟ ⎜ ⎜ dm−1 ⎟ = ⎜ am − 1 ⎜ dm ⎟ ⎜ 0 ⎟ ⎜ ⎜ ⎜ dm + 1 ⎟ ⎜ 0 ⎜ ⎟ ⎜ ⎟ ⎜ ⎜d ⎜ 2 m − 3⎟ ⎜ 0 ⎜⎝ d2 m − 2⎟⎠ ⎜ 0 ⎝

0 a0 a1

0 0 a0

0 0 0

am − 3 am − 2 am − 1 0 0 0

am − 4 am − 3 am − 2 am − 1 0 0

am − 5 am − 4 am − 3 am − 2 0 0

0 0 0 a0 a1 a2 a3 am − 1 0

⎞ ⎟ ⎟ ⎟ ⎟ ⎛ b0 ⎞ ⎟ ⎜ b1 ⎟ ⎟⎜ b ⎟ ⎟⎜ 2 ⎟ ⎟⎜ ⎟ ⎟ ⎜ bm − 2⎟ ⎟ ⎟⎜ ⎟ ⎜⎝ bm − 1⎟⎠ ⎟ am − 2⎟ ⎟ am − 1⎠ 0 0 0 0 a0 a1 a2

(7.3)

.

The above equation is equal to Eq. (5.3) but with coefficients in GF(2). From Eq. (7.3), the coefficients of d(x) are determined by the following expressions: k ⎧ a b ; k = 0, . . . , m − 1 ∑ ⎪ i=0 i k −i dk = ⎨ 2 m − 2 ⎪∑ i = k ak − i + ( m − 1)bi − ( m − 1) ; k = m, . . . , 2m − 2 ⎩

(7.4)

These expressions are equal to Eq. (5.4) along with addition and multiplication in GF(2). Assume that the functions function m2xor(x, y: bit) return bit function m2and(x, y: bit) return bit

computing x XOR y (addition mod 2) and x AND y (multiplication mod 2) have been defined. Then the function function poly_multiplication(A,B: poly_vector) return poly2_vector

performing the polynomial multiplication of a(x) and b(x), d(x) = a(x)b(x), can be implemented using Eq. (7.4) as follows for i in 0 .. 2*m-2 loop d(i) := 0; end loop; for k in 0 .. m-1 loop for i in 0 .. k loop d(k) := m2xor(d(k),m2and(a(i),b(k-i))); end loop; end loop; for k in m .. 2*m-2 loop for i in k .. 2*m-2 loop d(k) := m2xor(d(k),m2and(a(k-i+(m-1)),b(i-(m-1)))); end loop; end loop; return d;

165

166

Chapter Seven where poly_vector and poly2_vector are bit vectors from 0 to m – 1, and 0 to 2m – 2, respectively. The total gate complexity for the bit-parallel computation of the matrix-vector product given in Eq. (7.3) is m2 AND gates and (m – 1)2 XOR gates ([RSDK06], [PL07]). The AND gates operate all in parallel and require a single AND gate delay TAND, while the XOR gates are organized as a binary tree of depth ⎡⎢log 2 j⎤⎥ in order to add j operands. The total time complexity is then found by considering the largest number of terms, which is equal to m for the computation of dm − 1. Therefore, the total delay complexity for the bitparallel matrix-vector product is TAND + ⎡⎢log 2 m⎤⎥ TXOR . After the above polynomial multiplication d(x) = a(x)b(x), a reduction modulo an irreducible polynomial f(x) must be performed. In modular reduction c(x) = d(x) mod f(x), the degree 2m – 2 polynomial d(x) is reduced by the degree m irreducible polynomial f(x), resulting in a polynomial c(x) with degree deg(c(x)) ≤ m – 1: c(x) = d(x) mod f(x) = (d2m - 2x2m - 2 + . . . + d1x + d0) mod f(x) = cm − 1xm - 1 + . . . + c1x + c0

(7.5)

Reduction modulo f(x) can be viewed as a linear mapping of the 2m – 1 coefficients of d(x) into the m coefficients of c(x). This mapping can be represented in a matrix notation as follows [Paa94]:

⎛ c0 ⎞ ⎛ 1 0 0 r0,0 ⎜ c ⎟ ⎜0 1 0 r 1, 0 ⎜ 1 ⎟ =⎜ ⎜ ⎟ ⎜ ⎜⎝ cm − 1⎟⎠ ⎜ 0 0 1 r ⎝ m − 1, 0

⎛ d0 ⎞ r0 ,m − 2 ⎞ ⎜ ⎟ ⎜ ⎟ r1,m − 2 ⎟ ⎜ dm − 1 ⎟ ⎟ ⎟ ⎜ dm ⎟ rm − 1, m − 2 ⎟⎠ ⎜ ⎟ ⎜ ⎟ ⎜⎝ d2 m − 2 ⎟⎠

(7.6)

It must be noted that Eq. (7.6) matches Eq. (5.6) given for Zp[x]/ f(x). The matrix in Eq. (7.6) consists of an (m × n) identity matrix and an (m × m – 1) matrix R named reduction matrix. The R matrix is a function only of the irreducible polynomial f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0. Therefore, a reduction matrix R is uniquely assigned to every f(x). The rj,i ∈ GF(2) coefficients of R can be recursively computed in function of f(x) as follows: ⎧⎪ rj ,i = ⎨ ⎪⎩

f j ; j = 0, . . . , m − 1; i = 0 rj − 1, i − 1 + rm − 1,i − 1rj ,0 ; j = 0, . . . , m − 1; i = 1, . . . , m − 2

(7.7)

where rj - 1, i - 1 = 0 if j = 0. It must also be noted that Eq. (7.7) is the binary version of Eq. (5.7). R is function of the selected irreducible polynomial. Therefore, by choosing an appropriate reduction polynomial f(x) the complexity of this operation can be reduced.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s The function function reduction_matrix_R(f: poly_vector) return poly_ matrix_m1m2

computing the reduction matrix R can be implemented using Eq. (7.7) as follows for j in 0 .. m-1 loop for i in 0 .. m-2 loop R(j,i) := 0; end loop; end loop; for j in 0 .. m-1 loop R(j,0) := f(j); end loop; for i in 1 .. m-2 loop for j in 0 .. m-1 loop if j = 0 then R(j,i) := m2and(R(m-1,i-1),R(j,0)); else R(j,i) := m2xor(R(j-1,i-1),m2and(R(m-1,i-1), R(j,0))); end if; end loop; end loop; return R;

where poly_matrix_m1m2 is an (m × m – 1) matrix of bits. Finally, the two-step classic multiplication performing c(x) = a(x)b(x) mod f(x) = d(x) mod f(x) using Eq. (7.6) and the reduction matrix computed with Eq. (7.7) can be given, where the previously defined functions poly_multiplication and reduction_matrix_R are used.

Algorithm 7.1—Classic multiplication d := poly_multiplication(a,b); R := reduction_matrix_R(f); for j in 0 .. m-1 loop c(j) := d(j); end loop; for j in 0 .. m-1 loop for i in 0 .. m-2 loop c(j) := m2xor(c(j),m2and(R(j,i),d(m+i))); end loop; end loop;

An executable Ada file classic_multiplication.adb, including Algorithm 7.1, is available at www.arithmetic-circuits.org. Algorithm 7.1 is the binary version of Algorithm 5.5. A VHDL model for the classic multiplication algorithm (Algorithm 7.1) is given in the file classic_multiplier.vhd which is available at www.arithmetic-circuits.org. This model includes two components poly_multiplier and poly_reducer that implement the polynomial multiplication and the reduction modulo f(x), respectively. The datapaths for these two components are shown in Fig. 7.1. The entity declaration of the classic multiplier given in the file classic_multiplier.vhd is

167

168 Poly_multiplier bm–1 am–1

bm–2 am–1

bm–1 am–2

b1 am–1 b2 am–1

...

...

d2·m–2

dm–1

d2·m–3

Rm–1,0 Rm–1,1 ·dm ·dm–1

Rm–1,m–2 ·d2·m–2 ...

cm–1

FIGURE 7.1

bm–1 a1

b0 am–1 b1 am–1

...

...

dk + 1

d2

bm–1 a0

b0 a2 b1 a1

R2,m–2 ·d2·m–2 ...

d1

d2

R1,0 R1,1 ·dm ·dm–1

R1,m–2 ·d2·m–2

...

c2

Datapaths for Poly_multiplier and Poly_reducer in classic multiplication.

b0 a1

b1 a0

b0 a0

...

dk

R2,0 R2,1 ·dm ·dm–1

b2 a0

...

c1

d1

d0

R0,0 R0,1 d0 ·dm ·dm–1

R0,m–2 ·d2·m–2 ...

c0 Poly_reducer

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s entity classic_multiplication is port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end classic_multiplication;

The VHDL architecture is the following: inst_mult: poly_multiplier port map(a => a, b => b, d => d); inst_reduc: poly_reducer port map(d => d, c => c);

That is the simple instantiation of the poly_multiplier and poly_reducer components given in Fig. 7.1.

7.1.2

Karatsuba-Ofman Polynomial Multiplication

The Karatsuba-Ofman algorithm ([KO63], [RMSC06]) is a recursive method for efficient polynomial multiplication or efficient multiplication in positional number systems. It is known [Paa94] that two arbitrary polynomials in one variable of degree less or equal to m − 1 with coefficients from a field GF(2m) can be multiplied with not more than m2 multiplications in GF(2m) and (m − 1)2 additions in GF(2m). The Karatsuba-Ofman algorithm provides a recursive algorithm which reduces the above multiplicative and additive (for large enough m) complexities. A Karatsuba-Ofman algorithm restricted to polynomials, where m = 2t with t an integer, is given in [Paa94] and is outlined as follows: Let a(x) and b(x) be two elements in GF(2m). We are interested in finding the product d(x) = a(x)b(x), with degree ≤ 2m – 2. Both elements can be represented in the polynomial basis as a(x) = x m/2 (x m/2−1am − 1 + . . . + am/2 ) + (x m/2−1am/2−1 + . . . + a0 ) = x m/2 AH + AL b(x) = x m/2 (x m/2 −1bm − 1 + . . . + bm/2 ) + (x m/2−1bm/2−1 + . . . + b0 ) = x m/2 BH + BL (7.8) Using Eq. (7.8), the polynomial product is given as d(x) = x m AH BH + x m/2 ( AH BL + AL BH ) + AL BL

(7.9)

Let us define the following auxiliary polynomials M (1) (x) : M0(1) (x) = AL (x)BL (x) M1(1) (x) = [ AL (x) + AH (x)][BL (x) + BH (x)] M2(1) (x) = AH (x)BH (x)

(7.10)

169

170

Chapter Seven Then the product given in Eq. (7.9) can be obtained by: d(x) = x m M2(1) (x) + x m/2 [ M1(1) (x) + M0(1) (x) + M2(1)) (x)] + M0(1) (x) (7.11) The algorithm becomes recursive if it is applied again to the polynomials given in Eq. (7.10). The next iteration step splits the polynomials AL, BL, AH, BH, (AL + AH), and (BL + BH) again in half. With these newly halved polynomials, new auxiliary polynomials M(2)(x) can be defined in a similar way to Eq. (7.10). The algorithm eventually terminates after t steps. In the final step the polynomials M(t)(x) are degenerated into single coefficients. Since every step halves the number of coefficients, the algorithm terminates after t = log2m steps. A VHDL model for the Karatsuba-Ofman multiplication (for m even) is given in the file Karatsuba_multiplier_even.vhd, which is available at www.arithmetic-circuits.org. This model includes the component polynom_multiplier that implements the polynomial multiplication as given in Section 7.1.1. entity karatsuba_multiplier_even is port ( a, b: in std_logic_vector(M-1 downto 0); d: out std_logic_vector(2*M-2 downto 0) ); end karatsuba_multiplier_even;

The VHDL architecture is the following: mult1: polynom_multiplier generic map(M => half_M) port map(a => a(half_M-1 downto 0), b => b(half_M-1 downto 0), d=> x0y0); mult2: polynom_multiplier generic map(M => half_M) port map(a => a(M-1 downto half_M), b => b(M-1 downto half_M), d=> x1y1); mult3: polynom_multiplier generic map(M => half_M) port map(a => x0_p_X1, b => y0_p_y1, d=> x01y01); gen_x0x1y0y1: for i in 0 to half_M-1 generate x0_p_X1(i) <= a(i) xor a(i + half_M); y0_p_y1(i) <= b(i) xor b(i + half_M); end generate; gen_prod1: for i in 0 to half_M-2 generate d(half_M+i) <= x01y01(i) xor x0y0(i) xor x1y1(i) xor x0y0(i+half_M); end generate; d(2*half_M-1)<=x01y01(half_M-1) xor x0y0(half_M-1) xor x1y1(half_M-1); gen_prod2: for i in half_M to 2*half_M-2 generate d(half_M+i) <= x01y01(i) xor x0y0(i) xor x1y1(i) xor x1y1(i-half_M); end generate;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s d(3*half_M-1) <= x1y1(half_M-1); d(half_M-1 downto 0) <= x0y0(half_M-1 downto 0); d(2*M-2 downto 3*half_M) <= x1y1(2*half_M-2 downto half_M);

The above model for an even m, of the Karatsuba-Ofman multiplier has been included for simplicity. However, a VHDL file Karatsuba_ multiplier.vhd, modeling the Karatsuba-Ofman multiplication for any m (even or odd), is also available at www.arithmetic-circuits.org.

7.1.3

Interleaved Multiplication

The simplest algorithm for GF(2m) multiplication is the shift-and-add method [Knu81] with the reduction step interleaved ([GGKPP06], [RSDK06]). Multiplication of two elements a(x), b(x) in GF(2m) can be expressed as: ⎛m − 1 ⎞ c(x) = a( x)b(x) mod f (x) = a( x) ⎜ ∑ bi xi ⎟ mod f (x) ⎝ i=0 ⎠ ⎛m − 1 ⎞ = ⎜ ∑ bi a(x)xi ⎟ mod f (x) ⎝ i=0 ⎠

(7.12)

Therefore, the product c(x) can be computed as c(x) = (b0a(x) + b1a(x)x + b2a(x)x2 + . . . + bm - 1a(x)xm - 1) mod f(x)

(7.13)

In order to compute Eq. (7.13), a quantity of the form xa(x), where a(x) = am − 1xm − 1 + . . . + a1x + a0, with ai ∈ GF(2), has to be reduced modulo f(x). The product d = xa(x) can be computed as follows: d = x(a0 + a1x + . . . + am − 1xm - 1) = a0x + a1x2 + . . . + am - 1xm

(7.14)

Using the fact that f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0, we have xm = f0 + f1x + . . . + fm − 1xm − 1, where fis are the coefficients of the irreducible polynomial. Substituting this expression into Eq. (7.14) we obtain d = d0 + d1x + . . . + dm − 1xm - 1

(7.15)

where d0 = am − 1 f0 di = ai - 1 + am - 1 fi,

i = 1, 2, . . . , m – 1

(7.16)

It can be noted that Eq. (7.16) is the binary GF(2m) version of Eq. (5.10). Assume that the function function Product_alpha_A(a,f: poly_vector) return poly_ vector

171

172

Chapter Seven implementing Eq. (7.14) according to Eqs. (7.15) and (7.16) and therefore, the polynomial xa(x) mod f(x) has been defined, where poly_vector is a bit vector from 0 to m – 1. Assume also that the functions function m2abv(x: bit; y: poly_vector) return poly_vector function m2xvv(x, y: poly_vector) return poly_vector

returning the multiplication of a bit x by a bit vector y (x AND y0, x AND y1, . . . , x AND ym − 1), and the bit-wise XOR of two bit vectors (x0 XOR y0, x1 XOR y1, . . . , xm − 1 XOR ym − 1), respectively, have also been defined. When the bits of b(x) are processed from the most-significant bit to the least-significant bit, then the shift-and-add method receives the name of most-significant bit-serial (MSB) multiplier.

Algorithm 7.2—MSB-first multiplier for for c end

i in 0 .. m-1 loop c(i) := 0; end loop; i in reverse 0 .. m-1 loop := m2xvv(Product_alpha_A(c,f), m2abv(b(i),a)); loop;

An executable Ada file MSBfirst.adb, including Algorithm 7.2, is available at www.arithmetic-circuits.org. It is important to note that Algorithm 7.2 is the binary version of the MSE-first multiplier given in Algorithm 5.6. Notice that in Algorithm 7.2, the computation of bia(x) and xc(x) mod f(x) can be performed in parallel as they are independent of each other. However, the value of c in each iteration depends on both the value of c at the previous iteration and on the value of bia(x). This dependency has the effect of making the MSB multiplier have a longer critical path than that of the least-significant-bit (LSB) multiplier. In a least-significant-bit (LSB) multiplier, the coefficients of b(x) are processed starting from the least-significant bit b0 and continue with the remaining coefficients one at a time in ascending order. Thus multiplication according to this scheme is performed in the following way: c(x) = a(x)b(x) mod f(x) = (b0a(x) + b1a(x)x + b2a(x)x2 + . . . + bm - 1a(x)xm - 1) mod f(x) = (b0a(x) + b1(a(x)x) + b2(a(x)x2) + . . . + bm - 1(a(x)xm - 1)) mod f(x) (7.17) = (b0a(x) + b1(a(x)x) + b2(a(x)x)x + . . . + bm - 1(a(x)xm - 2)x) mod f(x)

Algorithm 7.3—LSB-first multiplier for for c a end

i in 0 .. m-1 loop c(i) := 0; end loop; i in 0 .. m-1 loop := m2xvv(m2abv(b(i),a),c); := Product_alpha_A(a,f); loop;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s An executable Ada file LSBfirst.adb, including Algorithm 7.3, is available at www.arithmetic-circuits.org. It can be observed that Algorithm 7.3 is the binary version of the LSE-first multiplier given in Algorithm 5.7. It is also important to note that in the MSB- and LSBfirst multiplication schemes, several coefficients could be processed at each step. A VHDL model for the LSB-first multiplication algorithm is given in the file interleaved_mult.vhd available at www.arithmeticcircuits.org. The datapath corresponding to Algorithm 7.3 is shown in Fig. 7.2. The entity declaration of the LSB-first multiplier given in the file interleaved_mult.vhd is entity interleaved_mult is port ( A, B: in std_logic_vector (M-1 downto 0); clk, reset, start: in std_logic; Z: out std_logic_vector (M-1 downto 0); done: out std_logic ); end interleaved_mult;

A (m – 1 : 0) new_a (m – 1 : 0)

B (m – 1 : 0)

1

0

inic

inic m-bit shift register

m-bit register shift_right a (m – 1 : 0)

bi

cm –1 am –1 ci

ai

...

new_cm–1

am –2 fm –1 ai –1

c0 a0

am –1

...

new_ci

...

new_c0

new_am –1

new_ai

new_c (m – 1 : 0) m-bit register

inic ce_c

c (m – 1 : 0) Z (m – 1 : 0)

FIGURE 7.2

fi

a0

am –1

Interleaved LSB-ﬁrst multiplier datapath.

f1

am –1

am –1

f0

...

new_a1

new_a0

173

174

Chapter Seven The VHDL architecture corresponding to the circuit of Fig. 7.2 is the following: register_A: process(clk) begin if reset = ‘1’ then aa <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then aa <= a; else aa <= new_a; end if; end if; end process register_A; sh_register_B: process(clk) begin if reset = ‘1’ then bb <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then bb <= b; end if; if shift_r = ‘1’ then bb <= ‘0’ & bb(M-1 downto 1); end if; end if; end process sh_register_B; register_C: process(inic, clk) begin if inic = ‘1’ or reset = ‘1’ then cc <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if ce_c = ‘1’ then cc <= new_c; end if; end if; end process register_C; z <= cc; new_a(0) <= aa(m-1) and F(0); new_a_calc: for i in 1 to M-1 generate new_a(i) <= aa(i-1) xor (aa(m-1) and F(i)); end generate; new_c_calc: for i in 0 to M-1 generate new_c(i) <= cc(i) xor (aa(i) and bb(0)); end generate;

The complete model additionally includes a counter and a control unit.

7.1.4

Matrix-Vector Multipliers

The GF(2m) multiplication given by c(x) = a(x)b(x) mod f(x) can also be described in terms of matrix-vector operations. There are mainly two different approaches based on matrix vector operations for the computation of a field product. As previously described the first one is a two-step classic multiplication studied in Section 7.1.1, in which the polynomial multiplication is performed by any method, and then the resulting product is reduced by using a reduction matrix. In the second approach, the polynomial multiplication and modular reduction parts are combined in a single step by using the so-called Mastrovito product matrix ([Mas88], [Mas91]).

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s First, we introduce a matrix notation [Paa94] for the multiplication c(x) = a(x)b(x) mod f(x) in the field GF(2m). All elements are binary polynomials of degree less than m: cm − 1xm − 1 + . . . + c1x + c0 = (am − 1xm − 1 + . . . + a1x + a0) (bm − 1xm − 1 + . . . + b1x + b0) mod f(x)

(7.18)

The elements b(x) and c(x) can also be represented as column vectors with the polynomial coefficients. The matrix Z = h(a(x),f(x)) can be introduced in such a way that the multiplication can be described as ⎛ b0 ⎞ ⎛ c0 ⎞ ⎛ z0 , 0 z0 , m − 1 ⎞ ⎜ b ⎟ ⎜ c ⎟ ⎟⎜ 1 ⎟ C = ⎜ 1 ⎟ = ZB = ⎜ ⎜ ⎟ ⎟ ⎜ ⎟ ⎜⎝ zm − 1,0 zm − 1,m − 1⎟⎠ ⎜ ⎜⎝ bm − 1⎟⎠ ⎜⎝ cm − 1⎟⎠

(7.19)

where C = (c0, c1, . . . , cm − 1)T and B = (b0, b1, . . . , bm - 1)T are the vectors associated with c(x) and b(x), respectively. The (m × m) matrix Z is named the product matrix or Mastrovito matrix. Its coefficients zi,j ∈ GF(2) depend recursively on the coefficients ai and on the coefficients pi,j of the P matrix [introduced in Eq. (7.21)] as follows: ai ; j = 0; i = 0, . . . , m − 1 ⎧⎪ (7.20) zi , j = ⎨ j−1 u(i − j)ai − j + ∑ t = 0 p j − 1 − t ,i am − 1 − t ; j, i = 0, . . . , m − 1; j ≠ 0 ⎩⎪ where the step function u(μ) is 1 for μ ≥ 0 or 0 for μ < 0. The matrixvector product given in Eq. (7.19) describes the entire field of multiplication [Paa94]. The P matrix required for the computation of the Z matrix is a function of the irreducible polynomial f(x) of degree m. Its binary entries pi,j are defined below ⎛ p0 , 0 ⎛ xm ⎞ ⎛ 1 ⎞ ⎜ p ⎜ xm + 1 ⎟ ⎜ x ⎟ mod f (x) = ⎜ 1,0 ⎜ ⎟ = P⎜ ⎟ ⎜ ⎜ ⎟ ⎜ m− 1⎟ ⎜⎝ x 2 m − 2⎟⎠ ⎜p ⎝x ⎠ ⎝ m − 2 ,0

p0 , m − 1 ⎞ ⎛ 1 ⎞ p1,m − 1 ⎟ ⎜ x ⎟ ⎟⎜ ⎟ mod f (x) ⎟⎜ ⎟ pm − 2 ,m − 1⎟⎠ ⎝ x m − 1⎠ (7.21)

The binary entries pi,j of P given in Eq. (7.21) can be recursively computed in the function of the coefficients of the irreducible polynomial f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 as follows: ⎧⎪ pi − 1, m − 1 ; i = 1, . . . , m − 2 ; j = 0 pi , j = ⎨ + pi − 1, m − 1 p0, j ; i = 1, . . ., m − 2 ; p ⎩⎪ i − 1, j − 1

j = 1, . . . , m − 1

(7.22)

175

176

Chapter Seven where p0, j = fj. It must be noted that the P matrix given by Eqs. (7.21) and (7.22) is equivalent to the reduction matrix R given in Eq. (7.6) and (7.7) and used in the two-step classic multiplication. In fact, R = PT (transposed matrix). The matrix-vector operation given in Eq. (7.19) requires m2 modulo 2 multiplications. Therefore, it can be proven that the space complexity of a bit-parallel Mastrovito multiplier is m2 AND gates and more than (m2 – 1) XOR gates [the equality, i.e., (m2 – 1) XOR gates correspond to the irreducible trinomial f(x) = xm + x + 1] [Paa94]. The delay of the bit-parallel Mastrovito multiplier can also be upper bounded by T ≤ T . ⎡ m⎤ T AND + 2 ⎢log 2 ⎥ XOR

Example 7.1 Multiplication for f(x) = x4 + x3 + 1

Let f(x) = x4 + x3 + 1 be the generating irreducible polynomial for GF(24). The polynomials x4, x5, and x6 are given as: x4 = 1 + x3 mod f(x) = 1 + x3 x5 = x + x4 mod f(x) = 1 + x + x3 x6 = x2 + x5 mod f(x) = 1 + x + x2 + x3

These equations can be rewritten in matrix form in order to obtain the P matrix, also obtained using Eqs. (7.21) and (7.22), as follows: ⎛ 1⎞ ⎛ x 4⎞ ⎛1 0 0 1⎞ ⎜ ⎟ ⎜ x 5⎟ = ⎜1 1 0 1⎟ ⎜ x ⎟ mod x 4 + x 3 + 1 ⎜ 6⎟ ⎜ ⎟ x2 ⎝ x ⎠ ⎝1 1 1 1⎠ ⎜ x 3⎟ ⎝ ⎠

(7.23)

The product matrix Z can be finally obtained using Eqs. (7.19) and (7.20): ⎛ a0 ⎜a C = ZB = ⎜ 1 ⎜ a2 ⎜⎝ a 3

a3 a0 a1 a2 + a3

a2 + a3 a3 a0 a1 + a2 + a3

a1 + a2 + a3 ⎞ ⎛ b0⎞ ⎟ ⎜b ⎟ a2 + a3 ⎟ ⎜ 1⎟ a3 ⎟ ⎜ b2⎟ a0 + a1 + a2 + a3⎟⎠ ⎜⎝ b3⎟⎠

(7.24)

Assume that the function function matrix_P (f: poly_vector) return poly_matrix_ m2m1

computing the P matrix is implemented using Eq. (7.22) as follows for j in 0 .. m-1 loop P(0,j) := f(j); end loop; for i in 1 .. m-2 loop for j in 0 .. m-1 loop P(i,j) := 0; end loop; end loop;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s for i in 1 .. m-2 loop for j in 0 .. m-1 loop if j = 0 then P(i,j) := P(i-1,m-1); else P(i,j) := m2xor(P(i-1,j-1),m2and(P(i-1,m-1), P(0,j))); end if; end loop; end loop; return P;

where poly_matrix_m2m1 is a (m – 1 × m) matrix of bits. Assume also that the function function mastrovito_matrix (a: poly_vector; P: poly_ matrix_m2m1) return poly_matrix

computing the Mastrovito matrix Z also has been implemented using Eq. (7.20) as follows for i in 0 .. m-1 loop Z(i,0) := a(i); end loop; for i in 0 .. m-1 loop for j in 1 .. m-1 loop Z(i,j) := 0; end loop; end loop; for i in 0 .. m-1 loop for j in 1 .. m-1 loop for t in 0 .. j-1 loop Z(i,j) := m2xor(Z(i,j),m2and(P(j-1-t,i),a(m-1-t))); end loop; if i >= j then Z(i,j) := m2xor(a(i-j),Z(i,j)); end if; end loop; end loop; return Z;

where the P matrix has been previously computed and where poly_ matrix is a (m × m) matrix of bits. The Mastrovito multiplication in Eq. (7.19) can therefore be given in the following algorithm, where the functions matrix_P and mastrovito_matrix are used.

Algorithm 7.4—Mastrovito multiplication for j in 0 .. m-1 loop C(j) := 0; end loop; P := matrix_P(f); Z := mastrovito_matrix(a,P); for i in 0 .. m-1 loop for j in 0 .. m-1 loop C(i) := m2xor(C(i),m2and(Z(i,j),b(j))); end loop; end loop;

An executable Ada file mastrovito_multiplication.adb, including Algorithm 7.4, is available at www.arithmetic-circuits.org.

177

178

Chapter Seven A VHDL file mastrovito_multiplier.vhd which models the Mastrovito multiplication given in Algorithm 7.4 is available at www.arithmetic-circuits.org. The corresponding entity declaration is entity mastrovito_multiplication is port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end mastrovito_multiplication;

The VHDL architecture follows: z_matrix: process(a,z) -- Gen Z matrix variable Zi: matrix_mastrovito; begin for i in 0 to M-1 loop zi(i)(0) := a(i); zi(i)(1) := (P(0)(i) and a(M-1)); if i >= 1 then zi(i)(1) := (a(i-1) xor zi(i)(1)); end if; for j in 2 to M-1 loop zi(i)(j) := (P(j-1)(i) and a(M-1)); for t in 1 to j-1 loop zi(i)(j) := (zi(i)(j) xor (P(j-1-t)(i) and a (M-1-t))); end loop; if i >= j then zi(i)(j) := (a(i-j) xor zi(i)(j)); end if; end loop; end loop; Z <= zi; end process; mastrovito: process(b,z) --Mastrovito multiplication variable ci: std_logic_vector(M-1 downto 0); begin for i in 0 to m-1 loop ci(i) := (Z(i)(0) and b(0)); for j in 1 to m-1 loop ci(i) := (ci(i) xor (Z(i)(j) and b(j))); end loop; end loop; c <= ci; end process;

Several works have been done using the Mastrovito scheme outlined above for different irreducible polynomials ([HK00], [HK99], [IHT06], [IST06], [RH04], [SK99], [ZP01]). In most of these papers, the decomposition of the Mastrovito matrix Z in a sum of matrices is normally used. The essence of all these works is to find an architecture to exploit subexpression sharing [Par99] efficiently based on the specific

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s irreducible polynomials. Some important irreducible polynomials will be studied in Sec. 7.6. An efficient multiplication scheme was introduced in [RH04] for the computation of the coordinates of the product C = ZB. In this approach, the product or Mastrovito matrix Z is decomposed as: Z = L + PTU = L + RU

(7.25)

where L and U are the following (m × m) and (m – 1 × m) Toeplitz matrices (diagonal-constant matrices, in which each descending diagonal from left to right is constant): ⎛ a0 ⎜ a ⎜ 1 a L = ⎜⎜ 2 ⎜a ⎜ m− 2 ⎜⎝ am − 1 ⎛ 0 am − 1 ⎜0 0 ⎜ U = ⎜ ⎜0 0 ⎜ ⎜⎝ 0 0

0 a0 a1 am − 3 am − 2

0 0 0⎞ 0 0 0⎟ ⎟ a0 0 0 ⎟ ⎟ a1 a0 0 ⎟ ⎟ a2 a1 a0⎟⎠

am − 2 a2 am − 1 a3 0 am − 1 0 0

⎞ ⎟ ⎟ ⎟ a m − 2⎟ ⎟ am − 1⎟⎠ a1 a2

(7.26)

(7.27)

where ais are the binary coordinates of the vector A representing the field element a(x), that is, A = (a0, a1, . . . , am - 1)T. The following two vectors D = LB E = UB

(7.28)

that are functions of A and B, can be defined in such a way that the product C in GF(2m) can also be defined as: C = D + PTE = D + RE

(7.29)

where P and R are the P matrix defined in Eq. (7.22) and the reduction matrix R defined in Eq. (7.6), respectively. The coefficients of the vectors D and E given in Eq. (7.28) can be computed using Eqs. (7.26) and (7.27) as follows: di = ∑ k = 0 ak bi − k ; i = 0, . . ., m − 1 i

m− 2

ei = ∑ k = i am − 1 − ( k − i)bk + 1 ; i = 0, . . ., m − 2

(7.30)

179

180

Chapter Seven Assume that the following functions function vector_D (a,b: poly_vector) return poly_vector function vector_E (a,b: poly_vector) return poly_vector2

computing D and E vectors given in Eq. (7.30) are available, when poly_vector2 is a bit vector from 0 to m – 2. The new version of Mastrovito multiplication computed in Eq. (7.29) can therefore be given in the following algorithm, where the functions vector_D, vector_E, and reduction_matrix_R are used, and where re: poly_vector computes the product RE given in Eq. (7.29).

Algorithm 7.5—Mastrovito multiplication, version 2 for j in 0 .. m-1 loop c(j) := 0; re(j) := 0; end loop; D := vector_D(a,b); E := vector_E(a,b); R := reduction_matrix_R(f); for i in 0 .. m-1 loop for j in 0 .. m-2 loop re(i) := m2xor(re(i),m2and(R(i,j),E(j))); end loop; end loop; for i in 0 .. m-1 loop c(i) := m2xor(D(i),re(i)); end loop;

An executable Ada file mastrovito_multiplication_v2.adb, including Algorithm 7.5, is available at www.arithmetic-circuits.org. A VHDL file mastrovito_v2_multiplier.vhd, modeling the Mastrovito multiplication (version 2) given in Algorithm 7.5, is available at www.arithmetic-circuits.org. This model includes the processes genD, genE, and mastrovitoV2 that implement the generation of the D and E vectors and the Mastrovito multiplication. The datapaths for these components are shown in Fig. 7.3. The entity declaration of the Mastrovito multiplier, version 2, given in the VHDL file mastrovito_v2_multiplier.vhd is entity mastrovito_V2_multiplication is port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end mastrovito_V2_multiplication;

The VHDL architecture is the following: genD: process(a,b) variable di: std_logic_vector(M-1 downto 0); begin for i in 0 to M-1 loop di(i) := ‘0’;

E Vector Generation

181

bm–1 am–1

D Vector Generation

bm–2 am–1

bm–1 am–2

b1 am–1

b2 am–2 bm–1 a1

dm–1

em–3

e0

Rm–1,m– 2 ·em–2

Rm–1,0 Rm–1,1 ·e0 ·e1 ...

...

...

...

em–2

b0 am–1 b1 am–2 bm–1 a0

d2 ...

cm–1

b0 a2

b1 a1 b2 a0

R2,m–2 ·em–2 ...

c2

d1

b0 a0

.. .

dm–1

R2,0 R2,1 ·e0 ·e1

b1 a0

b0 a1

d2

R1,0 R1,1 ·e0 ·e1

R1,m–2 ·em–2 ...

c1

d0

d1

d0

R0,0 R0,1 ·e0 ·e1

R0,m–2 ·em–2 ...

c0 Mastrovito Multiplier, second version

FIGURE 7.3

Mastrovito multiplier, version 2.

182

Chapter Seven for k in 0 to i loop di(i) := (di(i) xor (a(k) and b(i-k))); end loop; end loop; D <= di; end process genD; genE: process(a,b) variable ei: std_logic_vector(M-2 downto 0); begin for i in 0 to M-2 loop ei(i) := ‘0’; for k in i to M-2 loop ei(i) := (ei(i) xor (a(M-1-(k-i)) and b(k+1))); end loop; end loop; E <= ei; end process genE; --Mastrovito multiplication, second version mastrovitoV2: process(e,d) variable ci, re: std_logic_vector(M-1 downto 0); begin for i in 0 to M-1 loop re(i) := (R(i)(0) and E(0)); for j in 1 to M-2 loop re(i) := (re(i) xor (R(i)(j) and E(j))); end loop; ci(i) := (D(i) xor re(i)); end loop; C <= ci; end process mastrovitoV2;

Some important irreducible polynomials will be studied in Sec. 7.6 using the above Mastrovito schemes.

7.1.5

Montgomery Multiplication

Montgomery multiplication was first proposed for integer modular multiplication (Montgomery modular multiplication was studied in Chap. 3) that can avoid trial division [Mon85]. Later, it was extended to finite field multiplication in GF(2m) [KA98], where it was shown that the operation can be simplified if a certain type of element r(x) is selected. In the following, Montgomery multiplication in GF(2m) as proposed in [KA98] is considered. Let f(x) be an irreducible polynomial over GF(2) that defines the finite field GF(2m). Rather than computing Eq. (7.2), the Montgomery multiplication calculates c(x) = a(x)b(x)r − 1(x) mod f(x)

(7.31)

where r(x) is a fixed element and gcd(r(x), f(x)) = 1. Because of Bezout’s identity, one can find two polynomials r − 1(x) and f’(x) such that r(x)r − 1(x) + f(x)f’(x) = 1

(7.32)

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s where r − 1(x) is the inverse of r(x) modulo f(x). These two polynomials can be calculated with the extended Euclidean algorithm. The Montgomery multiplication over GF(2m) given in Eq. (7.31) can therefore be computed using the following algorithm:

Algorithm 7.6—Montgomery multiplication Input: Output: 1. 2. 3.

a(x), b(x), r(x), and f’(x) c(x) = a(x)b(x)r-1(x) mod f(x) t(x) := a(x)b(x) u(x) := t(x)f’(x) mod r(x) c(x) := [t(x) + u(x)f(x)]/r(x)

The correctness of the above algorithm can be found in [KA98], where it was noted also that efficient multiplication can be obtained if r(x) is properly chosen. In fact, r(x) was chosen to be the monomial r(x) = xm. Thus, r is the element of the finite field represented by the polynomial r(x) mod f(x). If f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 ⇒ xm = fm − 1 xm − 1 + . . . + f1x + f0, that is, r = (fm − 1, fm − 2, . . . , f1, f0), where fis are the coefficients of the irreducible polynomial f(x). Montgomery multiplication method requires that gcd(r(x), f(x)) = 1, which in GF(2m) is always true because the polynomial f(x) is irreducible over GF(2). The computation of c(x) in Algorithm 7.6 requires a polynomial multiplication, in step 1, a modulo r(x) operation in step 2, and finally an addition, a polynomial multiplication, and a division by r(x) in step 3. The modular multiplication and division operations in steps 2 and 3 are very fast because r(x) = xm. The remainder operation using modulus r(x) = xm can be performed by simply ignoring the terms that have powers greater or equal to m, and the division of an arbitrary polynomial by r(x) = xm is accomplished by just shifting the polynomial to the right by m places. It also turns out that the computation of f’(x) can be completely avoided if the coefficients of a(x) are scanned one bit at a time. From the above remarks, the following bit-level algorithm for Montgomery multiplication in GF(2m) was given in [KA98]:

Algorithm 7.7—Bit-level algorithm for Montgomery multiplication Input: Output: 1. 2. 3. 4. 5.

a(x), b(x), f(x) c(x) = a(x)b(x)x-m mod f(x) c(x) := 0 for i = 0 to m – 1 do c(x) := c(x) + aib(x) c(x) := c(x) + c0f(x) c(x) := c(x)/x

Algorithm 7.7 needs m identical rounds to come up with the correct result in its output. In step 3, the polynomial c(x) is added to the polynomial b(x) if the appropriate coefficient ai of the polynomial a(x) is 1. In step 4, if the coefficient c0 of c(x) is 1 then the irreducible polynomial f(x) is added to c(x). Finally, the polynomial c(x), is divided

183

184

Chapter Seven by x in step 5. The overall operation count in each algorithmic round is two additions, two multiplications, and one division by x in the worst case (ai = c0 = 1). If the function function lshift(x: poly_vector) return poly_vector

computing the 1-bit left shift of a given binary polynomial is available, then the bit-level algorithm for Montgomery multiplication over GF(2m) given in Algorithm 7.7 can be rewritten in the following algorithm where the function lshift is used:

Algorithm 7.8—Bit-level Montgomery multiplication for i in 0 .. m-1 loop c(i) := 0; end loop; for i in 0 .. m-1 loop c := m2xvv(c,m2abv(a(i),b)); if c(0) = 1 then c := m2xvv(c,m2abv(c(0),f)); c := lshift(c); c(m-1) := 1; else c := lshift(c); end if; end loop;

An executable Ada file bmult_montgomery.adb, including Algorithm 7.8, is available at www.arithmetic-circuits.org. It must be noted that the use of a left-shift function for the implementation of the division by x in Algorithm 7.8 is because the binary coefficients of the polynomials defined by poly_vector are in order 0 . . . m – 1 (left-toright). Furthermore, the assignment c(m − 1) := 1 for c(0) = 1 is because f(x) is a polynomial with m + 1 coefficients, while c(x) has m coefficients; for the addition c(x): = c(x) + c0 f(x), the above assignment represents the addition of the term xm from f(x). A slightly different version of the Algorithm 7.8 is also available at www.arithmetic-circuits.org. This second version is given in the executable Ada file bmult_montgomery_v2.adb, which rewrites the Algorithm 7.8 as follows: for i in 0 .. m-1 loop c := m2xvv(c,m2abv(a(i),b)); prev_c0 := c(0); c := m2xvv(c,m2abv(c(0),f)); c := lshift(c); c(m-1) := prev_c0; end loop;

A VHDL file montgomery_mult.vhd, modeling a sequential implementation of the above second version of Algorithm 7.8, is available at www.arithmetic-circuits.org. The corresponding datapath

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s is shown in Fig. 7.4. This model includes the component montg_cell with the following VHDL description: prev_c0 <= c(0) xor (a_i and b(0)); datapath: for i in 1 to M-1 generate new_c(i-1) <= c(i) xor (a_i and b(i)) xor (F(i) and prev_c0); end generate; new_c(M-1) <= prev_c0;

The entity declaration of the sequential implementation of the Montgomery multiplier given in the VHDL file montgomery_mult. vhd is entity montgomery_mult is port ( a, b: in std_logic_vector (M-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector (M-1 downto 0); done: out std_logic ); end montgomery_mult;

and the VHDL architecture corresponding to the circuit of Fig. 7.4 is the following: A (m – 1: 0)

B (m – 1: 0) inic

inic

m-bit shift register

m-bit register shift_right

c0 b0 a0

b (m – 1: 0) bm–1 cm–1 fm–1

bi

ci

b0

fi

. .. . . .

c0

f1

... Prev_c0

new_cm–1

new_cm–2

new_ci–1

new_a0

new_c (m – 1: 0) inic m-bit register ce_c c (m – 1: 0) Z (m – 1: 0)

FIGURE 7.4

Montgomery multiplier sequential datapath.

185

186

Chapter Seven data_path: montg_cell port map (C=>cc,B=>bb,a_i=>aa(0),new_ c=>new_c); counter: process(reset, clk) begin if reset = ‘1’ then count <= 0; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then count <= 0; elsif shift_r = ‘1’ then count <= count+1; end if; end if; end process counter; sh_register_A: process(clk) begin if reset = ‘1’ then aa <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then aa <= a; else aa <= ‘0’ & aa(M-1 downto 1); end if; end if; end process sh_register_A; register_B: process(clk) begin if reset = ‘1’ then bb <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then bb <= b; end if; end if; end process register_B; register_C: process(inic, clk) begin if inic = ‘1’ or reset = ‘1’ then cc <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if ce_c = ‘1’ then cc <= new_c; end if; end if; end process register_C; z <= cc; control_unit: process(clk, reset, current_state) begin case current_state is when 0 to 1 => inic<=’0’; shift_r<=’0’; done<=’1’; ce_c<=’0’; when 2 => inic <= ‘1’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘0’; when 3 => inic <= ‘0’; shift_r <= ‘1’; done <= ‘0’; ce_c <= ‘1’;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; when 3 => if count = M-1 then current_state <= 0; end if; end case; end if; end process control_unit;

A combinational implementation of this Montgomery multiplier is also given in the VHDL file montg_comb_mult.vhd, that is available at www.arithmetic-circuits.org.

7.2

Squaring A straightforward manner for implementing field squaring in GF(2m) is using the multiplication algorithms given in Sec. 7.1 with only one input operand in order to perform c(x) = a(x)a(x) mod f(x) = a(x)2 mod f(x), that is, the operand b(x) is substituted by a(x). For example, two-step classic squaring could be implemented using the following algorithm:

Algorithm 7.9—Classic squaring d := poly_multiplication(a,a); R := reduction_matrix_R(f); for j in 0 .. m-1 loop c(j) := d(j); end loop; for j in 0 .. m-1 loop for i in 0 .. m-2 loop c(j) := m2xor(c(j),m2and(R(j,i),d(m+i))); end loop; end loop;

An executable Ada file classic_squaring.adb, including Algorithm 7.9, is available at www.arithmetic-circuits.org. MSB-first and LSB-first approaches for squaring can also be given in a similar manner. For example, LSB-first squaring could be implemented as follows:

Algorithm 7.10—LSB-first squaring for for c a end

i in 0 .. m-1 loop c(i) := 0; end loop; i in 0 .. m-1 loop := m2xvv(m2abv(aux(i),a),c); := Product_alpha_A(a,f); loop;

187

188

Chapter Seven where a new additional variable aux should be included in order to hold the input operand a(x) modified by the Product_alpha_A function. An executable Ada file LSBfirst_squarer.adb, including Algorithm 7.10, is available at www.arithmetic-circuits.org. Bit-level Montgomery squaring could also be computed slightly modifying Algorithm 7.8 for multiplication.

Algorithm 7.11—Bit-level montgomery squaring for i in 0 .. m-1 loop c(i) := 0; end loop; for i in 0 .. m-1 loop c := m2xvv(c,m2abv(a(i),a)); if c(0) = 1 then c := m2xvv(c,m2abv(c(0),f)); c := lshift(c); c(m-1) := 1; else c := lshift(c); end if; end loop;

An executable Ada file bsquarer_montgomery.adb, including Algorithm 7.11, is also available at www.arithmetic-circuits.org. However, the above multiplication-based algorithms can be further optimized because squaring operation is a linear operation in GF(2m), that is, c(x) = a(x)2 mod f(x) = (am − 1x2(m − 1) + am − 2x2(m − 2) + . . . + a1x2 + a0) mod f(x)

(7.33)

Therefore, in classic squaring given in Algorithm 7.9, polynomial multiplication d(x) = a(x)a(x) computed by d := poly_multiplication(a,a) can be substituted for the 2m – 2 polynomial d(x) = am − 1x2(m − 1) + am − 2x2(m − 2) + . . . + a1x2 + a0 = (am − 1, 0, am − 2, 0, . . . , 0, a1, 0, a0). The new algorithm is given as following.

Algorithm 7.12—Classic squaring, version 2 for i in 0 .. 2*m-2 loop d(i) := 0; end loop; for i in 0 .. m-1 loop d(2*i) := a(i); end loop; R := reduction_matrix_R(f); for j in 0 .. m-1 loop c(j) := d(j); end loop; for j in 0 .. m-1 loop for i in 0 .. m-2 loop c(j) := m2xor(c(j),m2and(R(j,i),d(m+i))); end loop; end loop;

An executable Ada file classic_squaring_v2.adb, including Algorithm 7.12, is available at www.arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s A VHDL model for the classic squaring algorithm (version 2, Algorithm 7.12) is given in the file classic_squarer.vhd which is available at www.arithmetic-circuits.org. This model includes the component poly_reducer, the datapath of which is shown in Fig. 7.5. The entity declaration of the classic squarer given in the VHDL file classic_squarer.vhd is entity classic_squarer is port ( a: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end classic_squarer;

The corresponding VHDL architecture follows: D(0) <= A(0); square: for i in 1 to M-1 generate D(2*i-1) <= ‘0’; D(2*i) <= A(i); end generate; inst_reduc: poly_reducer port map(d => d, c => c);

Bit-level Montgomery squaring can also be modified using Eq. (7.33), in such a way that the multiplication step can be skipped. Rm–1,0 Rm–1,2 am/ 2–1 ·am/ 2 ·am/2–1

Rm–1,m–2 ·am–1 ...

R2,0 R2,2 a1 ·am/ 2 ·am/ 2–1 ...

cm–1 R1,2 R1,0 ·am/2 ·am/ 2+1

R1,m–2 ·am–1

...

c2 R0,2 R0,0 a0 ·am/ 2 ·am/ 2+1

...

c1

FIGURE 7.5

Classic squaring, version 2.

R2,m–2 ·am–1

...

c0

R0,m–2 ·am–1

189

190

Chapter Seven The following bit-level algorithm for Montgomery squaring in GF(2m) can be found in [KA98]:

Algorithm 7.13—Bit-level algorithm for Montgomery squaring Input: Output: 1. 2. 3. 4.

a(x), f(x) c(x) = a(x)2x-m mod f(x) m-1 c(x) := ∑i=0 aix2i for i = 0 to m – 1 do c(x) := c(x) + c0f(x) c(x) := c(x)/x

Assume that the following functions are available: function m2xvv2(x: poly2_vector; y: poly_vector) return poly2_vector function lshift2(x: poly2_vector) return poly2_vector

where m2xvv2 computes the bit-wise XOR of 2-bit vectors x (with 2m – 1 bits) and y (with m bits) in the form (x0 XOR y0, x1 XOR y1, . . . , xm − 1 XOR ym − 1, xm, xm + 1, . . . , x2m − 2), and where lshift2 computes the left shift of a given bit vector x (with 2m – 1 bits). Therefore, the bit-level Montgomery squaring over GF(2m) can be rewritten in the following algorithm:

Algorithm 7.14—Bit-level Montgomery squaring, version 2 for i in 0 .. 2*m-2 loop c(i) := 0; end loop; for i in 0 .. m-1 loop c(2*i) := a(i); d(i) := 0; end loop; for i in 0 .. m-1 loop if c(0) = 1 then c := m2xvv2(c,f); c(m) := m2xor(c(m),1); end if; c := lshift2(c); end loop;

An executable Ada file bsquarer_montgomery_v2.adb, including Algorithm 7.14, is available at www.arithmetic-circuits.org. The assignment c(m) := m2xor(c(m),1) for c(0) = 1 is such that f(x) is a polynomial with m + 1 coefficients; for the addition c(x) := c(x) + c0 f(x), the above assignment represents the addition of the term xm from f(x). A VHDL model for the combinational implementation of the bitlevel Montgomery squaring (version 2, Algorithm 7.14) is given in the file montg_comb_squarer.vhd, which is available at www. arithmetic-circuits.org. This model includes the component montg_ sq_c_cell, with c and new_c as std_logic_vector(2*m − 2 downto 0) as input and output, and includes the following architecture:

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s datapath: for i in 1 to M-1 generate new_c(i-1) <= c(i) xor (F(i) and c(0)); end generate; new_c(M-1) <= c(0) xor c(M); new_c(2*M-2 downto M) <= ‘0’ & c(2*M-2 downto M+1);

The entity declaration of the bit-level Montgomery squaring (version 2, Algorithm 7.14) given in the VHDL file montg_comb_ squarer.vhd is entity montgomery_comb_squarer is port ( a: in std_logic_vector (M-1 downto 0); c: out std_logic_vector (M-1 downto 0) ); end montgomery_comb_squarer;

The corresponding VHDL architecture follows: cc(0)(0) <= a(0); gen_aa: for i in 1 to M-1 generate cc(0)(2*I-1)<= ‘0’; cc(0)(2*I) <= a(i); end generate; forg: for i in 0 to M-1 generate cell: montg_sq_c_cell port map (C => cc(i), new_c => cc(i+1)); end generate; c <= cc(M)(M-1 downto 0);

A sequential VHDL model is also given in the file montgomery_ square.vhd that is available at www.arithmetic-circuits.org. MSB-first and LSB-first methods for squaring can also be modified using Eq. (7.33). For example, LSB-first squaring can be computed as follows [JSP98]. Let a(x) = am − 1xm − 1 + am − 2xm − 2 + . . . + a1x + a0 ∈ GF(2m). Then W = a2 mod f(x) can be computed in an LSB-first manner as: W = a 2 = am − 1x 2 m − 2 + . . . + a1x 2 + a0 x 2 ⎢⎣( m − 1)/2⎥⎦ + . . . + a1x 2 + a0 ⎤⎥ = ⎡⎢a ⎣ ⎢⎣( m − 1)/2⎥⎦ ⎦ +a

⎢⎣( m − 1)/2⎥⎦ + 1

(7.34)

⎡x 2 ⎢⎣( m − 1)/2⎥⎦ x 2 ⎤ + . . . + a (x 2 m − 4 x 2 ) m−1 ⎣ ⎦

where the least-significant bits of the operand a(x) are processed first. The basic computations that can be performed in parallel in step k are the following: a( k ) = a( k − 1) x 2 mod f (x) W ( k ) = W ( k − 1) + a( k − 1) a⎡m/2⎤ + k − 1 ⎢

⎥

(7.35)

191

192

Chapter Seven ⎢ ⎥ with the initial values W ( 0) = a⎢( m − 1)/2⎥ x 2 ⎣( m − 1)/2⎦ + . . . + a1x 2 + a0, ⎣ ⎦ ⎡ ⎤ 2 1 2 2 2 ⎢ ( m − )/ ⎥ m / ⎦ x2 = x ⎢ ⎥ , and k = 1, 2, . . . , ⎣m/2⎦. For the coma( 0 ) = x ⎣ putation, multiply-by-x2, we can assume that a(2) = ax2 mod f(x). A straightforward multiply-by-x2 operation is equivalent to shift-left by 2 bits operation. Therefore, the following intermediate result is obtained:

a( 2) = am − 1x m + 1 + am − 2 x m + . . . + a1x 3 + a0 x 2

(7.36)

where polynomial modulo operations have to be performed in order to reduce the degree of a(2) from m + 1 to less than or equal to m – 1. The following polynomial f’(x) can be defined in order to do that: f’(x) = f(x)x

(7.37)

with f(x) = fm − 1xm − 1 + fm − 2xm − 2 + . . . + f1x + f0, and where the coefficients of f’(x), f’i, can be computed from the coefficients of f(x), fi, as follows: ⎪⎧ fi − 1 + fm − 1 fi ; 1 ≤ i ≤ m − 1 f 'i = ⎨ fm − 1 f0 ; i = 0 ⎪⎩

(7.38)

For the irreducible polynomial f(x) = xm + fm − 1xm − 1 + . . . + f0, we have xm = f0 + f1x + . . . + fm − 1xm − 1. Therefore, xm = f(x) xm + 1 = f’(x)

(7.39)

Substituting Eq. (7.39) into Eq. (7.36), we have

ai( 2)

⎧a +a f a f i m−1 m − 1 ′i + m − 2 i ; 2 ≤ ≤ ⎪ i−2 = ⎨am − 1 f ′ 1 + am − 2 f1 ; i = 1 ⎪a f ′ + am − 2 f 0 ; i = 0 ⎩ m−1 0

(7.40)

where ai and ai(2) are coordinates of a and a(2), respectively. Therefore, LSB-first squaring operation in GF(2m) can be computed with the following steps [JSP98]: 1. Initially, W ( 0) = a⎢( m − 1)/2⎥ x 2 ⎢⎣( m − 1)/2⎥⎦ + . . . + a1x 2 + a0 ⎣

⎦

⎧ f (x), even m a( 0) = x 2 ⎡⎢m/2⎤⎥ = ⎨ ⎩ f '(x), odd m

(7.41)

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s 2. At step-k, with 1 ≤ k ≤ ⎣m/2⎦ − 1, the following values are computed:

ai( k )

⎧a( k − 1) + a( k − 1) f ′ + a( k − 1) f ; 2 ≤ i ≤ m − 1 m−1 i m− 2 i ⎪⎪ i(−k 2− 1) ( k − 1) = ⎨am − 1 f ′ 1 + am − 2 f1 ; i = 1 ⎪ ( k − 1) ( k − 1) ⎪⎩am − 1 f ′ 0 + am − 2 f0 ; i = 0

(7.42)

wi( k ) = wi( k −1) + ai( k −1) a⎡m/2⎤+ k −1 ⎢

⎥

3. For k = ⎣m/2⎦, wi( k ) = wi( k − 1) + ai( k − 1) am − 1

(7.43)

4. Finally, the result is W = W(⎣m/2⎦). For even values of the field order m, m/2 steps are required. For odd values of m, (m – 1)/2 steps will be required. Assume that the function function Product_alpha(f: poly_vector) return poly_ vector

performing the product f’(x) = f(x)x given in Eq. (7.38) is available. Then the above LSB-first squaring operation in GF(2m) can be given in the following algorithm:

Algorithm 7.15—LSB-first squaring, version 2 for i in 0 .. m-1 loop W(i) := 0; baux(i) := 0; end loop; for i in 0 .. (m-1)/2 loop W(2*i) := a(i); end loop; if (m rem 2) = 0 then b := f; else b := Product_alpha(f); end if; faux := Product_alpha(f); for k in 1 .. (m/2)-1 loop for i in 2 .. m-1 loop baux(i) := m2xor(b(i-2),m2xor(m2and(b(m-1),faux(i)),m2and (b (m-2),f(i)))); end loop; baux(1) := m2xor(m2and(b(m-1),faux(1)),m2and (b (m-2),f(1))); baux(0) := m2xor(m2and(b(m-1),faux(0)),m2and (b (m-2),f(0))); W := m2xvv(W,m2abv(a(((m+1)/2)+k-1),b)); b := baux; end loop; W := m2xvv(W,m2abv(a(m-1),b));

193

194

Chapter Seven An executable Ada file LSBfirst_squarer_v2.adb, including Algorithm 7.15, is available at www.arithmetic-circuits.org. A VHDL model for the LSB-first squaring algorithm (version 2, Algorithm 7.15) is given in the file LSB_first_squarer_V2.vhd, which is available at www.arithmetic-circuits.org. This model includes the component LSB_first_squarer_cell, with the following architecture: new_b_calc: for i in 2 to M-1 generate new_b(i) <= B(i-2) xor (B(m-1) and Faux(i)) xor (B(m-2) and F(i)); end generate; new_b(1) <= (B(m-1) and Faux(1)) xor (B(m-2) and F(1)); new_b(0) <= (B(m-1) and Faux(0)) xor (B(m-2) and F(0)); new_w_calc: for i in 0 to M-1 generate new_w(i) <= w(i) xor (b(i) and a_k); end generate;

The entity declaration of the LSB-first squaring circuit given in the VHDL file LSB_first_squarer_V2.vhd is entity LSB_first_squarer is port ( A: in std_logic_vector (M-1 downto 0); clk, reset, start: in std_logic; Z: out std_logic_vector (M-1 downto 0); done: out std_logic ); end LSB_first_squarer;

The corresponding VHDL architecture follows: basicCell: LSB_first_squarer_cell port map ( b => b, w => w, a_k => aa(0), new_b => new_b, new_w => new_w ); register_A: process(reset, clk) begin if reset = ‘1’ then aa <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then aa <= a(M-1 downto (M+1)/2); else aa <= ‘0’ & aa(M/2-1 downto 1); end if; end if; end process register_A; register_b: process(reset, clk) begin if reset = ‘1’ then b <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then if M mod 2 = 0 then b <= F; else b <= product_alphaF; end if; elsif ce_c = ‘1’ then b <= new_b; end if; end if; end process register_b;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s register_w: process(reset, inic, clk) begin if reset = ‘1’ then w <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then w(0) <= A(0); square: for i in 1 to (M-1)/2 loop w(2*i-1) <= ‘0’; w(2*i) <= A(i); end loop; if M mod 2 = 0 then w(M-1) <= ‘0’; end if; elsif ce_c = ‘1’ then w <= new_w; end if; end if; end process register_w; z <= w; counter: process(reset, clk) begin if reset = ‘1’ then count <= 0; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then count <= 0; elsif shift_r = ‘1’ then count <= count+1; end if; end if; end process counter; control_unit: process(clk, reset, current_state) begin case current_state is when 0 to 1 => inic <= ‘0’; shift_r <=’0’; done <=’1’; ce_c <=’0’; when 2 => inic <= ‘1’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘0’; when 3 => inic <= ‘0’; shift_r <= ‘1’; done <= ‘0’; ce_c <= ‘1’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; when 3 => if count = (M+1)/2-1 then current_state <= 0; end if; end case; end if; end process control_unit;

7.3

Exponentiation Let a be an arbitrary element of a finite field GF(2m), and e an arbitrary positive integer. Then, field exponentiation is defined as the problem of finding an element b ∈ GF(2m) so that the equation b = ae holds. In general, an arbitrary integer power of an element a ∈ GF(2m) can be

195

196

Chapter Seven computed using the binary method [Knu81], also known as the square and multiply method, which breaks the exponentiation operation into a series of squaring and multiplication operations in GF(2m). The binary method [Knu81] is a popular algorithm for computing b = ae because it is suitable for hardware implementation ([SSTP88], [Wan94], [Ara93]). In this method, repeated squaring of the partial results is used to reduce the required number of multiplications. Each integer exponent e can be presented in its binary representation as an m-bit vector as e = e0 + e12 + e222 + . . . + em − 12m − 1 = (e0, e1, . . . , em − 1). According to this method, we can obtain: m−1

b = ae = a∑ i = 0

ei 2i

( )

= a e0 (a 2 )e1 a 2

2

e2

(

. . . a2

m−1

)

em − 1

m−1

= ∏ Bi i=0

(7.44)

where ⎪⎧a 2 , if ei = 1 i Bi = ( a 2 )ei = ⎨ ⎩⎪ 1, if ei = 0 i

(7.45)

If the exponentiation ae is performed from least significant bit (e0) to most significant bit (em − 1) of the exponent, it can be proved [Knu81] that this method requires ⎣log2e⎦ + v(e) – 1 multiplications, where v(e) is the number of binary ones in the exponent. The binary or squareand-multiply method given in Eqs. (7.44) and (7.45) can be implemented in the following algorithm:

Algorithm 7.16—Binary or square-and-multiply exponentiation for i in 0 .. m-1 loop b(i) := 0; end loop; c := a; b(0) := 1; for i in 0 .. m-1 loop if e(i) = 1 then b := LSBfirst(b,c,f); end if; c := LSBfirst_squarer(c,f); end loop;

where the result of the exponentiation is finally loaded at the bit vector b, and where the multiplication and squaring operations are computed with the functions LSBfirst and LSBfirst_squarer given in Algorithms 7.3 and 7.10, respectively. An executable Ada file SQandMult_exp.adb, including Algorithm 7.16, is available at www. arithmetic-circuits.org. A VHDL file exponentiation_sq_mult.vhd, modeling a sequential implementation of Algorithm 7.16, is available at www.arithmeticcircuits.org. The corresponding circuit is shown in Fig. 7.6. The entity declaration of the binary or square-and-multiply exponentiation circuit given in the VHDL file exponentiation_sq_ mult.vhd follows below.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s A (m – 1 : 0)

E (n – 1 : 0)

1 inic capt

m-bit register

0

inic ce_c

m-bit register

b (m – 1 : 0)

start_mul

new_c (m – 1 : 0)

c (m – 1 : 0)

start_sq Modular multiplier

done_mul

Modular squarer

done_sq

new_c (m – 1 : 0) new_b (m – 1 : 0) start n-bit shift register

inic shift_right

State machine (control )

E (0)

start_mul start_sq inic ce_c = shift_right capt = ce_c and e (0) done

FIGURE 7.6

Binary or square-and-multiply exponentiation.

entity exponentiation_sq_mult is port ( A: in std_logic_vector (M-1 downto 0); E: in std_logic_vector (N-1 downto 0); clk, reset, start: in std_logic; B: out std_logic_vector (M-1 downto 0); done: out std_logic ); end exponentiation_sq_mult;

The corresponding VHDL architecture is the following: inst_mult: interleaved_mult port map (A => cc, B => bb, clk => clk, reset => reset, start => start_mult, Z => new_B, done => done_mult); inst_square: classic_squarer port map (a => cc, c => new_c); counter_sq: process(reset, clk) begin if reset = ‘1’ then count_sq <= 0; done_sq <= ‘0’; elsif clk’ event and clk = ‘1’ then if start_sq = ‘1’ then count_sq <= 0; elsif count_sq = COUNT_SQ then done_sq <= ‘1’; else count_sq <= count_sq + 1; end if; end if;

197

198

Chapter Seven end process counter_sq; counter: process(reset, clk) begin if reset = ‘1’ then count <= 0; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then count <= 0; elsif shift_r = ‘1’ then count <= count+1; end if; end if; end process counter; sh_reg_e: process(reset, clk) begin if reset = ‘1’ then ee <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then ee <= e; elsif shift_r = ‘1’ then ee <= ‘0’ & ee(N-1 downto 1); end if; end if; end process sh_reg_e; register_c: process(reset, clk) begin if reset = ‘1’ then cc <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then cc <= a; elsif shift_r = ‘1’ then cc <= new_c; end if; end if; end process register_c; register_b: process(reset, clk) begin if reset = ‘1’ then bb <= (0 => ‘1’, others => ‘0’); elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then bb <= (0 => ‘1’, others => ‘0’); elsif shift_r = ‘1’ and ee(0) = ‘1’ then bb <= new_b; end if; end if; end process register_b; control_unit: process(clk, reset, current_state, ee(0)) begin case current_state is when 0 to 1 => inic <= ‘0’; shift_r <=’0’; done <=’1’; ce_c <=’0’; start_sq <= ‘0’; start_mult <= ‘0’; when 2 => inic <= ‘1’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘0’; start_sq <= ‘0’; start_mult <= ‘0’; when 3 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘1’; start_mult <= ee(0); when 4 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘0’; start_mult <= ‘0’; when 5 => inic <= ‘0’; shift_r <= ‘1’; done <= ‘0’; ce_c <= ‘1’; start_sq <= ‘0’; start_mult <= ‘0’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; --capture operands when 3 => current_state <= 4; --start operations when 4 => if (done_sq =’1’ and (ee(0) = ‘0’ or done_ mult =’1’)) then current_state <= 5; end if; when 5 => if count = N-1 then current_state <= 0; else current_state <= 3; end if; end case; end if; end process control_unit;

It must be noted that Algorithm 7.16 is the binary version of the square-and-multiply exponentiation mod f(x) given in Algorithm 5.8. Furthermore, for multiplication and squaring operations given in Algorithm 7.16, any of the algorithms given in Secs. 7.1 and 7.2 could be used. Therefore, if Montgomery multiplication and squaring algorithms are used, a Montgomery exponentiation method can be given [KA97]. This approach is based on binary method where standard multiplication and squaring operations in GF(2m) are simply replaced by Montgomery multiplication and squaring operations. The method also requires a small amount of pre- and postprocessing. Let r(x) and a(x) be a fixed element and an arbitrary element, respectively, of GF(2m). The Montgomery image of a(x) under r(x) is represented as aˆ(x) , and is defined as aˆ(x) = a(x) ⋅ r(x) , where “⋅” denotes standard multiplication modulo f(x). Given two Montgomery images aˆ(x) and bˆ(x) , their Montgomery product “×” is defined as aˆ( x) × bˆ(x) = aˆ(x) ⋅ bˆ(x) ⋅ r(x)−1

(7.46)

The Montgomery product of aˆ(x) and bˆ(x) is equal to the Montgomery image of the product a(x)⋅b(x), which is easily proved as cˆ = aˆ × bˆ = aˆ ⋅ bˆ ⋅ r −1 = (a ⋅ r ) ⋅ (b ⋅ r ) ⋅ r −1 = a ⋅ b ⋅ r = c ⋅ r . Let the exponent e be presented in its binary representation as an m-bit vector e = (e0, e 1, . . . , em − 1). In order to compute b = ae for a given field element a ∈ GF(2m), the Montgomery images of 1 and a using standard multiplications must be first computed. The Montgomery exponentiation algorithm based on the binary square and multiply method then computes b = a e using only Montgomery multiplication and squaring operations as follows [KA97]:

Algorithm 7.17—Montgomery exponentiation algorithm Input: Output:

a, r, e b = ae

199

200

Chapter Seven 1. 2. 3. 4. 5. 6.

ˆ b = 1.r ˆ = a.r a for i = 0 to m – 1 do if ei = 1 then ˆ b := ˆ b × ˆ a

ˆ b := ˆ b × ˆ b ˆ b := b × 1

The difference of Algorithm 7.17 from the binary method using standard multiplication and squaring is that in steps 4 and 5, Montgomery multiplication and squaring are performed, respectively. When a Montgomery operation is performed, the multiplicative factor r remains in place, that is, cˆ × cˆ = (c ⋅ r ) ⋅ (c ⋅ r ) ⋅ r −1 = (c ⋅ c) ⋅ r and cˆ × aˆ = (c ⋅ r ) ⋅ (a ⋅ r ) ⋅ r −1 = (c ⋅ a) ⋅ r . This multiplicative factor r is removed from bˆ in step 6, because bˆ × 1 = (b ⋅ r ) ⋅ 1 ⋅ r −1 = b , therefore obtaining the final result b = ae. Montgomery exponentiation method given above can be implemented in the following algorithm:

Algorithm 7.18—Bit-level montgomery exponentiation for i in 0 .. m-1 loop b(i) := 0; one(i) := 0; end loop; one(0) := 1; c := LSBfirst(f,f,f); c := bmult_montgomery(a,c,f); b:= f; for i in 0 .. m-1 loop if e(i) = 1 then b := bmult_montgomery(b,c,f); end if; c := bsquarer_montgomery(c,f); end loop; b := bmult_montgomery(b,one,f);

where the Montgomery multiplication and squaring operations are computed with the functions bmult_montgomery and bsquarer_ montgomery given in Algorithms 7.8 and 7.11, respectively, and where the standard multiplication modulo f(x) has been computed with the function LSBfirst given in Algorithm 7.3. Step 1 in Algorithm 7.17 is implemented in Algorithm 7.18 with the assignment b := f because the fixed element r(x) was chosen to be r(x) = xm. Therefore r is the element of the finite field represented by the polynomial r(x) mod f(x). If f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 ⇒ xm = fm − 1xm − 1 + . . . + f1x + f0, that is, r = (fm − 1, fm − 2, . . . , f1, f0), where fis are the coefficients of the irreducible polyno-mial f(x). In a similar way, step 2 (Algorithm 7.17) is implemented with c := LSBfirst(a,f,f) in Algorithm 7.18. The result of the exponentiation is finally loaded at the bit vector b. An executable Ada file Exp_montgomery.adb, including Algorithm 7.18, is available at www.arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s A VHDL file exponentiation_montgomery.vhd, modeling a sequential implementation of Algorithm 7.18, is available at www. arithmetic-circuits.org. The corresponding circuit is shown in Fig. 7.7. The entity declaration of the bit-level Montgomery exponentiation circuit given in the VHDL file exponentiation_montgomery. vhd is entity exponentiation_montgomery is port ( A: in std_logic_vector (M-1 downto 0); E: in std_logic_vector (N-1 downto 0); clk, reset, start: in std_logic; B: out std_logic_vector (M-1 downto 0); done: out std_logic ); end exponentiation_montgomery;

E (n – 1 : 0)

one (constant)

A (m – 1 : 0)

r mod f (constant)

r x r mod f (constant)

new_b (m – 1 : 0) new_c (m – 1 : 0)

0 1 2 .

control 1

0 1 2 3

m-bit register

m-bit register b (m – 1 : 0)

c (m – 1 : 0)

start_sq

start_mul Montgomery multiplier

done_mul

Montgomery squarer

done_sq

new_c (m – 1 : 0)

new_b (m – 1 : 0)

start n-bit shift register

control 2

inic shift_right E (0)

State machine (control )

start_mul start_sq ce_c = shift_right control 1 control 2 done

FIGURE 7.7

Montgomery exponentiation.

201

202

Chapter Seven The corresponding VHDL architecture follows: RR <= R_BY_R; -- precomputed constant (R*R mod F), R = 2^K inst_mult: montgomery_mult port map (A => cc, B => bb, clk => clk, reset => reset, start => strt_mul, Z => new_B, done => done_mult); inst_square: montgomery_squarer port map (A => cc, clk => clk, reset => reset, start => strt_sq, Z => new_c, done => done_sq); counter: process(reset, clk) begin if reset = ‘1’ then count <= 0; elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then count <= 0; elsif shift_r = ‘1’ then count <= count+1; end if; end if; end process counter; sh_reg_e: process(reset, clk) begin if reset = ‘1’ then ee <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if inic = ‘1’ then ee <= e; elsif shift_r = ‘1’ then ee <= ‘0’ & ee(N-1 downto 1); end if; end if; end process sh_reg_e; register_c: process(reset, clk) begin if reset = ‘1’ then cc <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if first = ‘1’ then cc <= r_by_r; elsif inic = ‘1’ then cc <= new_b; elsif shift_r = ‘1’ then cc <= new_c; elsif last = ‘1’ then cc <= (0 => ‘1’, others => ‘0’); end if; end if; end process register_c; register_b: process(reset, clk) begin if reset = ‘1’ then bb <= (others => ‘0’); elsif clk’ event and clk = ‘1’ then if first = ‘1’ then bb <= a; elsif inic = ‘1’ then bb <= F; elsif (shift_r = ‘1’ and ee(0) = ‘1’) or (capt = ‘1’) then bb <= new_b; end if; end if; end process register_b; b <= bb; control_unit: process(clk, reset, current_state, ee(0)) begin case current_state is when 0 to 1 => inic <=’0’; shift_r <=’0’; done <=’1’; first <=’0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘0’;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s when 2 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘1’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘0’; when 3 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘1’; last <= ‘0’; capt <= ‘0’; when 4 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’;last <= ‘0’; capt <= ‘0’; when 5 => inic <= ‘1’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’;last <= ‘0’; capt <= ‘0’; when 6 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘1’; strt_mul <= ee(0);last <= ‘0’; capt <= ‘0’; when 7 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘0’; when 8 => inic <= ‘0’; shift_r <= ‘1’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘0’; when 9 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘1’; capt <= ‘0’; when 10 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘1’; last <= ‘0’; capt <= ‘0’; when 11 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘0’; when 12 => inic <= ‘0’; shift_r <= ‘0’; done <= ‘0’; first <= ‘0’; strt_sq <= ‘0’; strt_mul <= ‘0’; last <= ‘0’; capt <= ‘1’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; when 3 => current_state <= 4; when 4 => if (done_mult = ‘1’) then current_state <= 5; end if; when 5 => current_state <= 6; when 6 => current_state <= 7; when 7 => if (done_sq = ‘1’ and (ee(0) = ‘0’ or done_mult = ‘1’))then current_state <= 8; end if;

203

204

Chapter Seven when 8 => if count = N-1 then current_state <= 9; else current_state <= 6; end if; when 9 => current_state <= 10; --Adjust B= B/R when 10 => current_state <= 11; when 11 => current_state <= 12; when 12 => if (done_mult = ‘1’) then current_state <= 0; end if; end case; end if; end process control_unit;

An iterative implementation is also given in the VHDL file exponentiation_montgomery_adv.vhd, that can be found at www. arithmetic-circuits.org. It must be noted that the combinational squarer module used in this iterative circuit has a significant delay; therefore squaring has to be performed in two clock cycles. For this reason, a counter is used in such a way that when the count value is two, a done signal will be activated.

7.4

Division The quotient of two polynomials in GF(2m) can be computed using the binary version of the binary algorithm. The binary algorithm for computing z(x) = g(x)h − 1(x) mod f(x) has been described in Sec. 6.2 (Algorithm 6.5). If p = 2, it can be simplified.

Algorithm 7.19—Binary algorithm with p = 2 a := f; b := h; c := zero; d := g; alpha := m; beta := m-1; while beta >= 0 loop if b(0) = 0 then b := shift_one(b); d := divide_by_x(d, f); beta := beta - 1; else old_b := b; old_d := d; old_beta := beta; b := shift_one(add(a, b)); d := divide_by_x(add(c, d),f); if alpha > beta then a := old_b; c := old_d; beta := alpha - 1; alpha := old_beta; else beta := beta - 1; end if; end if; end loop; z := c;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s The computation primitives are b(x)/x and c(x)x − 1 mod f(x) if b0 = 0 (a(x) + b(x))/x and (c(x) + d(x))x − 1 mod f(x) if b0 = 1 Given a polynomial α(x), the computation of α(x)x − 1 mod f(x), with f(x) = xm + fm − 1xm − 1 + fm − 2xm − 2 + . . . + f1x + 1 is performed as follows: = αm − 1x

α(x) ≡ α(x) + α0 f(x) . + . . + α1x + α0 + α0xm + α0 fm − 1xm − 1

+ αm − 2x + α0 fm − 2x + . . . + α0 f1x + α0 m−1

m−2

m−2

= α0xm + (αm − 1 + α0 fm − 1)xm − 1 + (αm − 2 + α0 fm − 2)xm − 2 + . . . + (α1 + α0 f1)x so that α(x)x − 1 mod f(x) = α0xm − 1 + (αm − 1 + α0 fm − 1)xm − 2 + (αm − 2 + α0 fm − 2)xm − 3 + . . . + (α + α f ) 1

0 1

The corresponding datapath is shown in Fig. 7.8. A generic VHDL model binary_algorithm_polynomials.vhd has been generated. The complete VHDL file is available at www.arithmeticcircuits.org. The entity declaration is

am bm–1 am–1 bm–2 am–2

b1 a1

dm–1 cm–1

dm–2 cm–2

d1 c1

d0 c0

0 0

1

0

1

0

1

...

0

1

bm–1

bm–2

0

ce_bd

initially: h(x ) 0

b0

bm–3 ...

1

0

fm–1

1

...

0

1

0

fm–2

1

b0

f1

b0 ce_ac

initially: f (x) ... am am–1

am–2

am–3

a0 ce_bd

initially: g (x) dm–1 dm–2

dm–3 ...

d0 ce_ac

initially: 0 ... cm–1

FIGURE 7.8

Binary algorithm: datapath.

cm–2

cm–3

c0

205

206

Chapter Seven entity binary_algorithm_polynomials is port( g, h: in std_logic_vector(m-1 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(m-1 downto 0); done: out std_logic ); end binary_algorithm_polynomials;

The VHDL architecture corresponding to the circuit of Fig. 7.8 follows: first_iteration: for i in 0 to m-2 generate next_b(i) <= (b(0) and (b(i+1) xor a(i+1))) or (not(b(0)) and b(i+1)); end generate; next_b(m-1) <= b(0) and a(m); next_d(m-1) <= (b(0) and (d(0) xor c(0))) or (not(b(0)) and d(0)); second_iteration: for i in 0 to m-2 generate next_d(i) <= (f(i+1) and next_d(m-1)) xor ((b(0) and (d(i+1) xor c(i+1))) or (not(b(0)) and d(i+1))); end generate; registers_ac: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then a <= f; c <= (others => ‘0’); elsif ce_ac = ‘1’ then a <= ‘0’&b; c <= d; end if; end if; end process registers_ac; registers_bd: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then b <= h; d <= g; elsif ce_bd = ‘1’ then b <= next_b; d <= next_d; end if; end if; end process registers_bd;

Additionally, the circuit includes components for storing and updating the variables alpha and beta as well as a control unit. It is important to note that the algorithms used for division can also be used for inversion.

7.5

Inversion The multiplicative inverse a − 1(x) of a(x) in the finite field GF(2m) is defined as the element that satisfies a(x)⋅a − 1(x) = 1, where “⋅” denotes multiplication in GF(2m). The most popular methods for finite field

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s inversion over GF(2m) are mainly based on Fermat’s theorem and on Euclid’s algorithm. Using Fermat’s theorem, the inverse of an element in GF(2m) can be found by successive squaring and multiplication. In normal basis representation of a Galois field, squaring is done by a simple cyclic shift. Hence, the algorithms based on Fermat’s theorem for inversion mainly choose this basis ([IT88], [Fen89], [ABMV93]). Euclid’s algorithm for polynomials calculates the greatest common divisor (gcd) polynomial of two polynomials. The algorithm can be extended for calculating the two polynomials, u(x) and w(x), that satisfy gcd(a(x),b(x)) = u(x) × a(x) + w(x) × b(x). The extended Euclid’s algorithm has been studied in Chap. 6. Let f(x) be the irreducible polynomial with degree m that defines the field, and a(x) be the polynomial representation of an element in the field. Since the greatest common divisor polynomial of a(x) and f(x) is 1, the multiplicative inverse a − 1(x) of a(x) can be obtained as u(x) mod f(x) by replacing b(x) with f(x), that is, gcd(a(x), f(x)) = u(x) × a(x) + w(x) × f(x). Therefore, 1 = u(x) × a(x) mod f(x), and finally a − 1(x) = u(x) mod f(x). An optimized algorithm for inversion in GF(2m) based on the extended Euclid’s algorithm is the following ([KTT07], [BCH93]). The algorithm tests only the mth coefficients of two polynomials in the computation of the gcd. The algorithm follows showing {Op1, Op2}, which means that the two operations, Op1 and Op2, are performed in parallel. Furthermore, rm and sm denote the mth coefficients of the polynomials r(x) and s(x), respectively, and d holds the difference of deg(r(x)) and deg(s(x)), where deg(⋅) represents the upper bound of the degree of the proper one [KTT07].

Algorithm 7.20—Algorithm for inversion in GF(2m) Input: Output: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.

a(x), f(x) u(x) = a-1(x) s(x) := f(x); v(x) := 0; r(x) := a(x); u(x) := 1; d := 0; for i = 1 to 2m do if rm = 0 then r(x) := x × r(x); u(x) := x × u(x); d = d + 1; else if sm = 1 then s(x) := s(x) − r(x); v(x) := v(x) − u(x); end if s(x) := x × s(x); if d = 0 then {r(x) := s(x), s(x) := r(x)}; {u(x) := x × v(x), v(x) := u(x)}; d := 1;

207

208

Chapter Seven 17. 18. 19. 20. 21. 22.

else u(x) := u/x; d := d − 1; end if end if end for

If u(x) and v(x) are hold in registers with m + 1 bits, then reductions modulo f(x) are not needed for u(x) and v(x). If functions rshiftm and lshiftm performing 1-bit right and left shifts, respectively, of bit vectors with m + 1 bits (from 0 to m) are available, and if function m2xvvm performing the bit-wise XOR of two bit vectors x and y with m + 1 bits (x0 XOR y0, x1 XOR y1, . . . , xm XOR ym) is also available, then if polynomials s(x), r(x), f(x), u(x), v(x), and a(x) are represented with bit vectors with m + 1 bits (type poly_vectorm), the above algorithm can be implemented as follows:

Algorithm 7.21—Extended Euclidean algorithm for inversion in GF(2m) for i in 0 .. m loop s(i) := f(i); r(i) := a(i); v(i) := 0; u(i) := 0; auxm(i) := 0; end loop; u(0) := 1; d := 0; for i in 1 .. 2*m loop if r(m) = 0 then r := rshiftm(r); u := rshiftm(u); d := d + 1; else if s(m) = 1 then s := m2xvvm(s,r); v := m2xvvm(v,u); end if; s := rshiftm(s); if d = 0 then auxm := s; s := r; r := auxm; auxm := v; v := u; u := rshiftm(auxm); d := 1; else u := lshiftm(u); d := d - 1; end if; end if; end loop;

where multiplications in steps 4, 5, 12, and 15 (Algorithm 7.20) are implemented with the rshiftm function, division in step 18 (Algorithm 7.20) is implemented with the lshiftm function, and subtractions in steps 9, 10 (Algorithm 7.20) are implemented with the m2xvvm function. An executable Ada file EEA_inversion.adb, including Algorithm 7.21, is available at www.arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s A generic VHDL model EEA_inversion.vhd has been generated. The complete VHDL file is available at www.arithmetic-circuits.org. The entity declaration is entity eea_inversion is port ( A: in std_logic_vector (M-1 downto 0); clk, reset, start: in std_logic; Z: out std_logic_vector (M-1 downto 0); done: out std_logic ); end eea_inversion;

The VHDL architecture corresponding to the EEA_inversion follows: Comb: process(r,s,u,v,d) begin if R(m) = ‘0’ then new_R <= R(M-1 downto 0) & ‘0’; new_U <= U(M-1 downto 0) & ‘0’; new_S <= S; new_V <= V; new_d <= d + 1; else if d = ZERO then if S(m) = ‘1’ then new_R <= (S(M-1 downto 0) xor R(M-1 downto 0)) & ‘0’; new_U <= (V(M-1 downto 0) xor U(M-1 downto 0)) & ‘0’; else new_R <= S(M-1 downto 0) & ‘0’; new_U <= V(M-1 downto 0) & ‘0’; end if; new_S <= R; new_V <= U; new_d <= (0=> ‘1’, others => ‘0’); else --d /= ZERO new_R <= R; new_U <= ‘0’ & U(M downto 1); if S(m) = ‘1’ then new_S <= (S(M-1 downto 0) xor R(M-1 downto 0)) & ‘0’; new_V <= (V xor U); else new_S <= S(M-1 downto 0) & ‘0’; new_V <= V; end if; new_d <= d - 1; end if; end if; end process;

209

210

Chapter Seven registers: process(clk, reset) begin if reset = ‘1’ or first_step = ‘1’ then r <= (‘0’ & A); s <= (‘1’ & F); u <= (0 => ‘1’, others => ‘0’); v <= (others => ‘0’); d <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if capture = ‘1’ then r <= new_r; s <= new_s; u <= new_u; v <= new_v; d <= new_d; end if; end if; end process;

A VHDL process models the combinational part. Additionally, a simple state machine, a counter, and registers are necessary to store intermediate data. The Almost Inverse Algorithm [SOOS95] is a modification of the binary Euclidean algorithm and computes a − 1(x)xk mod f(x) as an intermediate result. The inverse a − 1(x) is finally obtained by the reduction of xk. Algorithm 7.22 is a modified version of the Almost Inverse Algorithm given in [HLM00], where the inverse is produced directly. The algorithm performs a division of b(x) whenever u(x) is divided by x. If b(x) is not divisible by x, then b(x) is replaced by b(x) + f(x) before the division. Finally, b(x) = a − 1(x) mod f(x). The algorithm in [HLM00] follows:

Algorithm 7.22—Almost Inverse Algorithm (modified) for inversion in GF(2m) Input: Output: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

a(x), f(x) b(x) = a-1(x) v(x) := f(x); c(x) := 0; u(x) := a(x); b(x) := 1; while x divides u(x) do u(x) := u(x)/x if x divides b(x) then b(x) := b(x)/x else b(x) := (b(x) + f(x))/x if u(x) = 1 then return(b(x)) if deg(u(x)) < deg(v(x)) then {u(x) := v(x), v(x) := u(x)}; {b(x) := c(x), c(x) := b(x)}; u(x) := u(x) + v(x); b(x) := b(x) + c(x); Go to step 2.

If the functions degreem and unitym are available that compute the maximum degree of an (m + 1)-bit polynomial and determine if an (m + 1)-bit polynomial is the unity, and therefore represent the polynomials b(x), c(x), f(x), u(x), v(x), and a(x) with bit vectors with m + 1 bits (type poly_vectorm), then the above algorithm can be implemented as follows:

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s Algorithm 7.23—Modified Almost Inverse Algorithm for inversion in GF(2m) for i in 0 .. m loop v(i) := f(i); u(i) := a(i); b(i) := 0; c(i) := 0; auxm(i) := 0; inv(i) := 0; end loop; b(0) := 1; degree_v := degreem(v); degree_u := degreem(u); aux := 0; ext := 0; while ext = 0 loop while u(0) /= 1 loop u := lshiftm(u); degree_u := degree_u - 1; if b(0) /= 1 then b := lshiftm(b); else auxm := b; b := lshiftm(m2xvvm(b,f)); end if; end loop; ext := unitym(u); if ext = 1 then inv := b; end if; if degree_u < degree_v then auxm := v; v := u; u := auxm; aux := degree_u; degree_u := degree_v; degree_v := aux; auxm := c; c := b; b := auxm; end if; u := m2xvvm(u,v); degree_u := degreem(u); b := m2xvvm(b,c); end loop;

An executable Ada file MAIA_inversion.adb, including Algorithm 7.23, is available at www.arithmetic-circuits.org. The corresponding VHDL model MAIA_inversion.vhd has been generated. The entity declaration is entity maia_inversion is port ( A: in std_logic_vector (M-1 downto 0); clk, reset, start: in std_logic; Z: out std_logic_vector (M-1 downto 0); done: out std_logic ); end maia_inversion;

The VHDL architecture corresponding to the MAIA_inversion follows: vb_comp: process(b,c,u,v,degree_u) variable d: natural; begin if U(0) = ‘0’ then new_u <= ‘0’ & u(M downto 1);new_degree_u <= degree_u - 1;

211

212

Chapter Seven if B(0) = ‘0’ then new_b <= ‘0’ & b(M downto 1); else new_b <= ‘0’&(b(M downto 1) xor (F(M downto 1))); end if; else new_U <= U xor V; new_B <= B xor C; d := 0; --degree calculation for i in 0 to m loop if (U(i) xor V(i)) = ‘1’ then d := i; end if; end loop; new_degree_u <= conv_std_logic_vector(d,logM+1); end if; end process; uc_comp: process(b,c,u,v,degree_u,degree_v) begin if degree_u < degree_v then new_v <= u; new_c <= b; new_degree_v <= degree_u; else new_v <= v; new_c <= c; new_degree_v <= degree_v; end if; end process; degr_a: process(A) variable d: natural; begin d := 0; for i in 0 to m-1 loop if A(i)= ‘1’ then d := i; end if; end loop; degree_a <= conv_std_logic_vector(d,logM+1); end process; reg: process(clk, reset) variable d: natural; begin if reset = ‘1’ then u <= (others => ‘0’); v <= (others => ‘0’); b <= (others => ‘0’); c <= (others => ‘0’); degree_v <= (others => ‘0’); degree_u <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if first_step = ‘1’ then v <= (‘1’ & F); c <= (others => ‘0’); degree_v <= conv_std_logic_vector(M, logM+1); elsif ce_vc = ‘1’ then v <= new_v; c <= new_c; degree_v <= new_degree_v; end if; if first_step = ‘1’ then u <= ‘0’ & A; b <= ONE degree_u <= degree_a; elsif ce_ub = ‘1’ then u <= new_u; b <= new_b; degree_u <= new_degree_u; end if; end if; end process;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s end_inv <= ‘1’ when U = ONE else ‘0’; control_unit: process(clk, reset, current_state, count) begin case current_state is when 0 to 1 => first_step<=’0’; done<=’1’; ce_vc<=’0’; ce_ub<=’0’; when 2 => first_step <=’1’; done <=’0’; ce_vc <=’0’; ce_ub <=’0’; when 3 =>first_step <=’0’; done <=’0’; ce_vc <=U(0); ce_ub <=’1’; end case; if reset = ‘1’ then current_state <= 0; elsif clk’event and clk = ‘1’ then case current_state is when 0 => if start = ‘0’ then current_state <= 1; end if; when 1 => if start = ‘1’ then current_state <= 2; end if; when 2 => current_state <= 3; when 3 => if end_inv = ‘1’ then current_state <= 0; end if; end case; end if; end process control_unit;

7.6

Important Irreducible Polynomials The choice of the irreducible polynomial f(x) may ease the arithmetic operations over GF(2m), mainly the multiplication. Among the important irreducible polynomials usually selected, trinomials, pentanomials, ESPs (equally spaced polynomials), and AOPs (all-one polynomials) can be considered.

7.6.1

Equally Spaced Polynomials (ESPs)

A polynomial in the form f (x) = x ns + x( n − 1) s + + x s + 1 over the binary field GF(2), with m = ns, is called an equally spaced polynomial (also denoted as s-ESP) of degree m, where both n and s are integers and 1 ≤ s ≤ m/2. When s = 1, a 1-ESP is obtained and it is the same as the all-one polynomial (denoted as AOP). An AOP has the highest Hamming weight (i.e., the number of 1s) among all the polynomials of degree m. When s = m/2, then the least Hamming weight irreducible polynomial (i.e., trinomial) of degree m is obtained. For an s-ESP, the following expression is obtained ⎧xi + x s + i + . . . + x( n − 1) s + i ; 0 ≤ i < s xm + i = ⎨ xi − s ; s ≤ i ≤ m − 2 ⎩

(7.47)

Equation (7.47) can be used for the reduction of the complexity of the arithmetic operations over GF(2m) studied in previous sections.

213

214

Chapter Seven 0

s

0 x

x x

m–s x

x

x ...

x x

s

m–1

x

x

x x

x x x x x x x x

m–2

FIGURE 7.9

Matrix P for a general s-ESP.

For example, using Eq. (7.47), the matrix P given in Eqs. (7.21) and (7.22) for multiplication is obtained as [RH04]: ⎛ I I . . . Is ⎞ P=⎜ s s 0 s + 1⎟⎠ ⎝I m − s − 1

(7.48)

where Ij is the j × j unity matrix and 0s + 1 is a zero matrix with m – s – 1 rows and s + 1 columns. The graphical representation of P for a general s-ESP is given in Fig. 7.9, where nonzero entries of P are represented with “x” [RH04]. Using Eq. (7.48) for the computation of the multiplication given in Eq. (7.29) leads to the following expressions for the coordinates cj of the product C = D + PTE cj = gj + ej mod s

0≤j≤m–1

(7.49)

⎧⎪d j + e j + s ; 0 ≤ j ≤ m − s − 2 gj = ⎨ d ; m −s − 1 ≤ j ≤ m − 1 ⎩⎪ j

(7.50)

where

7.6.2

General Irreducible Polynomials

Different P matrices can be found for different irreducible polynomials. Furthermore, P matrix can be decomposed in a sum of matrices Pi depending on the irreducible polynomial selected for the field. Let f (x) = x m + x kt + . . . + x k2 + x k1 + 1 be an irreducible polynomial, with 1 ≤ k1 < k2 < . . . < kt ≤ m/2 and therefore with Hamming weight equal to t + 2. Using this irreducible polynomial, we see that x m = x kt + . . . + x k2 + x k1 + 1 . It must be noted that all the rows of the matrix P are the representations of xm + i, with 0 ≤ i ≤ m – 2 [RH04]. Therefore, row 0 of P

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s has 1s in the t + 1 columns 0, k1, k2, . . . , kt. The consecutive rows of P can be obtained by using a linear feedback shift register and thus the number of 1s in rows from 0 to m – kt – 1 is equal to t + 1, corresponding to the t + 1 segmented lines. This fact can be observed, for example, in the matrices P given in Figs. 7.11 and 7.12 for trinomials. The last column of P contains 1 in rows i = m – kj – 1, with j = t, . . . , 2, 1. When a row ends with a 1, the following row originates new lines in columns 0, k1, k2, . . . , kt, if there are no previous lines that pass these columns [RH04]. If there exists a previous line that passes the column of kj, 1 ≤ j ≤ t, then the previous line terminates in the column kj − 1 and no new line originates from column kj due to the XOR of two lines. This fact can be observed in row s and columns s, . . . , m – s in Fig. 7.9 for an s-ESP. If the lines of matrix P are divided into t + 1 sets, then P can be decomposed in a sum of matrices P = P0 + P1 + . . . + Pt where the entries are different from 0 of Pi, 0 ≤ i ≤ t start from column ki, assuming that k0 = 0 [RH04]. The graphical representations of the matrices P 0, P1, P2, and P3 for pentanomials f (x) = x m + x k3 + x k2 + x k1 + 1, 1 ≤ k 1 < k2 < k3 ≤ m/2 are shown in Fig. 7.10.

(a)

(b)

(c)

(d)

FIGURE 7.10 Submatrices of P = P0 + P1 + P2 + P3 for pentanomials f (x) = x m + x k3 + x k2 + x k1 +1, for 1 ≤ k1 < k2 < k3 ≤ m/2. (a) P0, (b) P1, (c) P2, (d) P3.

215

216

Chapter Seven If the product of two field elements A and B is computed as given in Eq. (7.29), that is, C = D + PTE, and if the vectors E(i) are defined as E(i) = (e0(i) , e1(i) , . . . , em(i)− 1 )T = Pi T E , then the product C can be written as C = D + E(0) + E(1) + E(2) + . . . + E(t)

(7.51)

Assuming that k1 ≠ 1 and using P0 as shown in Fig. 7.10a, the elements of E(0) in Eq. (7.51) are given as follows [RH04]:

e(j0)

⎧e j + e j + m − k + + e j + m − k + e j + m − k ; 0 ≤ j ≤ k1 − 2 t 2 1 ⎪ + + + ; e e e k j j + m − kt j + m − k2 1 − 1 ≤ j ≤ k2 − 2 ⎪ ⎪ =⎨ + ; k − 1 ≤ j ≤ k − 2 e e t−1 t j j + m − kt ⎪ ⎪ e j ; kt − 1 ≤ j ≤ m − 2 ⎪ 0 ; j = m−1 ⎩

(7.52)

By reusing the terms e(j0)s given in Eq. (7.52), the coordinates of E , for 1 ≤ i ≤ t, can be given as (i)

⎧⎪ 0 ; 0 ≤ j ≤ ki − 1 e(ji) = ⎨ (0) ; otherwise e ⎩⎪ j − ki

(7.53)

Using Eqs. (7.52) and (7.53), the coordinates of the product C given as Eq. (7.51) can be computed as follows [RH04]: ⎧ e(j0) ; 0 ≤ j ≤ k1 − 1 ⎪ e(j0) + e(j1) ; k1 ≤ j ≤ k2 − 1 ⎪ ⎪ ⎪ c j = d j + ⎨ ( 0) ( t − 1) ( 1) e + ej + + ej ; kt − 1 ≤ j ≤ kt − 1 ⎪j ⎪ e(j0) + e(j1) + + e(jt ); kt ≤ j ≤ m − 2 ⎪ (1) (2) (t ) ⎪⎩ e j + e j + + e j ; j = m − 1

(7.54)

7.6.3 All-One Polynomials (AOPs) An AOP is a polynomial in the form f(x) = xm + xm − 1 + . . . + x + 1, that is, with all its coefficients not null. An AOP is irreducible and therefore generates a field GF(2m) if and only if m + 1 is a prime and 2 is a primitive modulo m + 1 [MBGMVY93]. For example, for m ≤ 300, the AOP is irreducible for the following values of m: 2, 4, 10, 12, 18, 28, 36, 52, 58, 60, 66, 82, 100, 106, 130, 138, 148, 162, 172, 178, 180, 196, 210, 226, 268, and 292. For an AOP f(x), we see that xm = 1 + x + . . . + xm − 1, and therefore m+1 x = 1. Using this identity in Eqs. (7.21) and (7.22), one can find that the matrix P for AOPs is as follows:

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s ⎛1 ⎜1 P = ⎜0 ⎜ ⎜ ⎝⎜ 0

1 0 1 0

1 0 0 0

1 0 0 1

1⎞ 0⎟ 0⎟ ⎟ ⎟ 0⎟⎠

1 0 0 0

(7.55)

Using Eq. (7.55), we see that the product or Mastrovito matrix Z can be decomposed as the sum of the following two matrices Z1 and Z2 [KS98]: ⎛ a0 ⎜ a ⎜ 1 Z1 = ⎜ ⎜a ⎜ m−2 ⎜⎝ am − 1

0 a0 am − 3 am − 2

⎛ 0 am − 1 ⎜0 a m−1 ⎜ Z2 = ⎜ ⎜0 a m−1 ⎜ ⎝ 0 am − 1

am − 1 0 am − 4 am − 3

am − 2 am − 2 am − 2 am − 2

am − 2 am − 1 am − 5 am − 4

am − 3 am − 3 am − 3 am − 3

a2 ⎞ a3 ⎟ ⎟ ⎟ 0⎟ ⎟ a0 ⎟⎠

a1⎞ a1⎟ ⎟ ⎟ a1⎟ ⎟ a1⎠

(7.56)

(7.57)

In order to compute the product C = ZB = (Z1 + Z2)B, the two vectors D = Z1B and E = Z2B can be computed in parallel and then compute the result C = D + E. From Eq. (7.56), the coefficients dk from vector D can be computed as m−1

dk = ∑ i = 0 ak − i bi + ∑ i = k + 2 am − 1 − (i − k − 2)bi , k = 0,.. . ., m − 1 k

(7.58)

and the coefficients ek from vector E can be computed from Eq. (7.57) as m−1 e = e0 = e1 = . . . = em − 1 = ∑ i = 1 am − 1 − (i − 1)bi

(7.59)

because Z2 is a matrix with identical rows. Finally, the coefficients ck from the result C will be c k = e + dk

k = 0, . . . , m − 1

(7.60)

Algorithm 7.24—Mastrovito multiplication for AOPs for j in 0 .. m-1 loop c(j) := 0; d(j) := 0; end loop; e := 0; for k in 0 .. m-1 loop

217

218

Chapter Seven for i in 0 .. k loop d(k) := m2xor(d(k),m2and(a(k-i),b(i))); end loop; for i in k+2 .. m-1 loop d(k) := m2xor(d(k),m2and(a(m-1-(i-k-2)),b(i))); end loop; end loop; for i in 1 .. m-1 loop e := m2xor(e,m2and(a(m-1-(i-1)),b(i))); end loop; for i in 0 .. m-1 loop c(i) := m2xor(e,d(i)); end loop;

An executable Ada file mastrovito_multiplication_AOP.adb, including Algorithm 7.24, is available at www.arithmetic-circuits.org. The VHDL model mastrovito_AOP_multiplication.vhd that implements the algorithm has been generated. The entity declaration is entity mastrovito_AOP_multiplication is generic(M : natural := 8); port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end mastrovito_AOP_multiplication;

The VHDL architecture is the following: d1: for k in 0 to m-1 generate d2: process(d, a, b) variable aux: std_logic; begin aux := ‘0’; for i in 0 to k loop aux := aux xor (a(k-i) and b(i)); end loop; for i in k+2 to m-1 loop aux := aux xor (a(m-i+k+1)) and b(i)); end loop; d(k) <= aux; end process; end generate; e1: process(a, b) variable aux: std_logic; begin aux := (a(m-1) and b(1)); for i in 2 to m-1 loop aux := aux xor (a(m-i) and b(i)); end loop; e <= aux; end process; c1: for i in 0 to m-1 generate c(i) <= e xor d(i); end generate;

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s

7.6.4 Trinomials Let f(x) = xm + xk + 1 be an irreducible trinomial generating GF(2m). The trinomial f(x) has only three nonzero coefficients and no other binary irreducible polynomial has fewer nonzero coefficients. Irreducible trinomials have drawn significant attention because polynomials with a low Hamming weight can reduce the complexity of finite field arithmetic ([HK99], [HK00], [IST06], [RH04], [SK99], [Wu02], [ZP01]). Furthermore, irreducible trinomials are abundant for every degree m [Ser98]. For trinomials f(x) = xm + xk + 1, different matrices P are obtained for different values of k when a multiplication operation is considered [RH04]. In Fig. 7.11, the graphical representations of the location of the nonzeros of P are shown for irreducible trinomials with k = 1 and 1 < k < m/2, and the matrices P for m/2 < k < m and for k = m – 1 are also given in Fig. 7.11. It must be noted that trinomials with k = m/2, that is, f(x) = xm + xm/2 + 1, are (m/2)-ESPs. For m/2 < k ≤ m – 1, the following expressions for the coordinates cj of the product C = D + PTE given in Eq. (7.29) can be found using the matrices given in Fig. 7.12 [RH04]. m–1

(a)

(b)

FIGURE 7.11 Matrices P for trinomials f(x) = xm + xk + 1, with (a) k = 1, (b) 1 < k < m/2.

(a)

(b)

FIGURE 7.12 Matrices P for trinomials f(x) = xm + xk + 1, with (a) m/2 < k < m, (b) k = m − 1.

219

220

Chapter Seven hj ; 0 ≤ j ≤ k − 2 ⎧ ⎪ ek − 1 ; j = k − 1 ⎪ ⎪ e + h ; k ≤ j ≤ 2k − 2 c j = dj + ⎨ j j−k ⎪e + e ; 2 k − 1 ≤ j ≤ m − 2 j−k ⎪ j hm − k − 1 ; j = m − 1 ⎪⎩

(7.61)

where hjs can be obtained recursively as follows: ⎪⎧e j + e j + m − k ; k − 2 ≥ j ≥ 2 k − m − 1 hj = ⎨ e + hj + m − k ; 2 k − m − 2 ≥ j ≥ 0 ⎩⎪ j

(7.62)

It can also be found that for k = m – 1, Eq. (7.62) becomes ⎧⎪ m − 2 e ; 0 ≤ j ≤ m − 2 h j = ⎨∑ i = j i h0 ; j = m − 1 ⎪⎩

7.63)

Example 7.2 Multiplication in GF(24) for trinomial f(x) = x4 + x3 + 1

Let f(x) = x4 + x3 + 1 be the generating irreducible trinomial for GF(24). From Fig. 7.12b, the P matrix is as follows: ⎛1 0 0 1⎞ P = ⎜1 1 0 1⎟ ⎜1 1 1 1⎟ ⎝ ⎠

(7.64)

From Eq. (7.64), the coordinates of the product C = D + PTE given in Eq. (7.29) can be computed as follows: c0 c1 c2 c3

= d0 + e0 + (e1 + e2 ) = d1 + (e1 + e2 ) = d2 + e2 = d3 + e0 + (e1 + e2 )

= d0 + e0 + h1 = d1 + h1 = d2 + e2 = d3 + e0 + h1

= d0 + h0 = d1 + h1 = d2 + e2 = d3 + h0

(7.65)

where E and D vectors are computed as in Eq. (7.28). Coordinates in Eq. (7.65) can also be obtained using Eqs. (7.61) and (7.62) or Eq. (7.63). The following algorithm can be given for the computation of the product using an irreducible trinomial f(x) = xm + xk + 1, with m/2 < k ≤ m – 1, which implements the expressions given in Eqs. (7.61) and (7.62).

Algorithm 7.25—Mastrovito multiplication for trinomials d := vector_D(a,b); e := vector_E(a,b); for j in reverse (2*k-m-1) .. k-2 loop h(j) := m2xor(e(j),e(j+m-k)); end loop; for j in reverse 0 .. (2*k-m-2) loop h(j) := m2xor(e(j),h(j+m-k));

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s end loop; for j in 0 .. k-2 loop c(j) := m2xor(d(j),h(j)); end loop; c(k-1) := m2xor(d(k-1),e(k-1)); for j in k .. (2*k-2) loop if j < m-1 then c(j) := m2xor(d(j),m2xor(e(j),h(j-k))); end if; end loop; for j in (2*k-1) .. m-2 loop c(j) := m2xor(d(j),m2xor(e(j),e(j-k))); end loop; c(m-1) := m2xor(d(m-1),h(m-k-1));

An executable Ada file mastrovito_multiplication_v2_trinomials, adb, including Algorithm 7.25, is available at www.arithmetic-circuits .org. The corresponding VHDL code describing the circuit mastrovito_ trinomials_multiplication.vhd is also available at the example page.

7.6.5

Pentanomials

A polynomial with five nonzero coefficients, that is, f (x) = x m + x k3 + x k2 + x k1 + 1 , is called a pentanomial of degree m, where 1 ≤ k1 < k2 < k3 ≤ m – 1. Irreducible pentanomials have drawn significant attention also because using them can reduce the complexity of finite field arithmetic in GF(2m). Furthermore, it was proved in [Ser98] that there exists either an irreducible trinomial or pentanomial of degree m ∈ [2, 10,000], therefore an irreducible pentanomial can be used whenever an irreducible trinomial of degree m does not exist. Several classes of pentanomials have been considered in the literature ([ZP01], [RK03], [RH04], [IHT06]). Let us consider one of the special classes of irreducible pentanomials proposed in [ZP01] known as Class 1 pentanomials, for which k3 ≤ m/2. Let us also consider pentanomials such as k1 = k3 – k2. In such a case, the product C = D + PTE given in Eq. (7.29) can be computed as follows. Let hjs be intermediate in terms defined as follows [RH04]: hj = e j + m − k + e j + m − k 3

2

0 ≤ j ≤ k2 − 2

(7.66)

(0) The above terms can be used to generate e j , 0 ≤ j ≤ k2 – 2, given in Eq. (7.52) by substituting t = 3 in Eq. (7.52) as follows:

e(j0)

⎧e j + h j + e j + m − k ; 0 ≤ j ≤ k1 − 2 1 ⎪ ; e h k + 1 − 1 ≤ j ≤ k2 − 2 j j ⎪⎪ = ⎨ e j + e j + m − k ; k2 − 1 ≤ j ≤ k3 − 2 3 ⎪ e j ; k3 − 1 ≤ j ≤ m − 2 ⎪ 0 ; j = m−1 ⎪⎩

(7.67)

221

222

Chapter Seven It was proven in [RH04] that the following equalities hold: e(j0) + e(j1) = h j + k e(j2)

+

e(j3)

=

e(j0−)k 2

2 −m

+

; m − k2 ≤ j ≤ m − 2

e(j1−)k ; 2

k3 ≤ j ≤ m − 1

(7.68)

Let e(j01), 0 ≤ j ≤ m – 1, represent the elements of (P0 + P1)TE, where P0 and P1 are the submatrices shown in Fig. 7.10a and 7.10b, respectively. Then, substituting t = 3 in Eq. (7.54) and using Eq. (7.68), the coordinates of the product C = AB given in Eq. (7.29) as C = D + PTE can be found as follows [RH04]: ) c j = d j + e(j01) + e(j01 ; 0≤ j≤m−1 −k 2

(7.69)

) where e(j01 for j < k2 and −k = 0 2

e(j01)

⎧ e(j0) ; 0 ≤ j ≤ k1 − 1 ⎪ ( 0) (1) ⎪e + e j ; k1 ≤ j ≤ m − k2 − 1 =⎨ j ⎪ h j + k2 − m ; m − k 2 ≤ j ≤ m − 2 ⎪ e(1) ; j = m − 1 j ⎩

(7.70)

Example 7.3 Multiplication in GF(28) for class 1 pentanomial f(x) = x8 + x5 + x3 + x + 1

Let f(x) = x8 + x5 + x3 + x + 1 be the generating irreducible pentanomial for GF(28). From Fig. 7.10, the P matrix is as follows: ⎛1 ⎜0 ⎜0 ⎜ P = ⎜0 ⎜1 ⎜1 ⎜ ⎝0

1 1 0 0 1 0 1

0 1 1 0 0 1 0

1 0 1 1 1 1 1

1 1 0 1 0 0 1

0 1 1 0 1 0 0

0 0 1 1 0 1 0

0⎞ 0⎟ 0⎟⎟ 1⎟ 1⎟ 0⎟ ⎟ 1⎠

(7.71)

From Eq. (7.71), the product C = D + PTE given in Eq. (7.29) can be computed as follows: c0 c1 c2 c3 c4 c5 c6 c7

= d0 + (e0 + e 4 + e 5 ) = d1 + (e0 + e1 + e 4 + e6 ) = d2 + (e1 + e2 + e 5 ) = d3 + (e0 + e2 + e3 + e 4 + e 5 + e6 ) d4 + ( e 0 + e 1 + e 3 + e 6 ) = d5 + ( e 1 + e 2 + e 4 ) = d6 + (e2 + e3 + e 5 ) = d7 + (e3 + e 4 + e6 ) =

(7.72)

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s Coordinates in Eq. (7.72) can also be obtained using Eqs. (7.66) to (7.70). It must be noted that at least one class 1 irreducible pentanomial exists for every m in the range of 160 to 600 [RH04]. Therefore, class 1 irreducible pentanomials are good candidates for implementation of elliptic curve cryptosystems. The following algorithm can be given for the computation of the product using class 1 irreducible pentanomials f (x) = x m + x k3 + x k2 + x k1 + 1, with k3 ≤ m/2 and k1 = k3 – k2, that implements the expressions given in Eqs. (7.66) to (7.70).

Algorithm 7.26—Mastrovito multiplication for class 1 pentanomials d := vector_D(a,b); e := vector_E(a,b); for j in 0 .. k2-2 loop h(j) := m2xor(e(j+m-k3), e(j+m-k2)); end loop; for j in 0 .. k1-2 loop e0(j) := m2xor(e(j),m2xor(h(j),e(j+m-k1))); end loop; for j in k1-1 .. k2-2 loop e0(j) := m2xor(e(j),h(j)); end loop; for j in k2-1 .. k3-2 loop e0(j) := m2xor(e(j),e(j+m-k3)); end loop; for j in k3-1 .. m-2 loop e0(j) := e(j); end loop; e0(m-1) := 0; for j in 0 .. k1-1 loop e1(j) := 0; end loop; for j in k1 .. m-1 loop e1(j) := e0(j-k1); end loop; for j in 0 .. k1-1 loop e01(j) := e0(j); end loop; for j in k1 .. m-k2-1 loop e01(j) := m2xor(e0(j),e1(j)); end loop; for j in m-k2 .. m-2 loop e01(j) := h(j+k2-m); end loop; e01(m-1) := e1(m-1); for j in 0 .. m-1 loop if j < k2 then c(j) := m2xor(d(j),e01(j)); else c(j) := m2xor(d(j),m2xor(e01(j),e01(j-k2))); end if; end loop;

An executable Ada file mastrovito_multiplication_v2_pentanomials .adb, including Algorithm 7.26 and the corresponding VHDL description of the circuit mastrovito_pentanom_multiplication.vhd, is available at www.arithmetic-circuits.org.

7.7

FPGA Implementations Several circuits described in this chapter have been implemented within a Xilinx Spartan3 (speed grade-5) programmable device. The times (period, total time) are expressed in ns. Timing constraints were utilized when necessary. The parameters FFs and LUTs represent the number of flip-flops and look-up tables, respectively. Every slice includes two flip-flops and two look-up tables. All the source files are

223

224

Chapter Seven available at www.arithmetic-circuits.org. The results for m = 163 and m = 233 are using the NIST-recommended polynomials f(x) = x163 + x7 + x6 + x3 + 1 and f(x) = x233 + x74 + 1.

7.7.1

Classic Multipliers

The circuits were generated for specific polynomials and are fully combinational (see Table 7.1). m

LUTs

Slices

Total time

8

52

28

3

16

221

113

8

32

941

477

14

64

3,754

1,885

21

128

14,279

9,602

33

163

22,356

15,171

39

∞

233

∞ means that the circuit does not fit within the device.

TABLE 7.1

7.7.2

Cost and Delay of Classic Multipliers

Interleaved Multiplication

The circuits are for specific polynomials and are sequential implementations that produce some results per cycle (see Table 7.2). m

FFs

LUTs

Slices

Period

Cycles

Total time

8

32

36

20

3.1

8

25

32

108

115

62

3.5

32

112

64

208

211

114

3.9

64

250

128

405

405

216

4.8

128

614

163

527

511

287

5.0

163

815

233

765

725

420

5.0

233

1,165

TABLE 7.2

7.7.3

Cost and Delay of Interleaved Multipliers

Mastrovito Multipliers

The circuits were generated for specific polynomials and are fully combinational. The cost and delay of several Mastrovito multipliers are shown in Table 7.3.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s

m

LUTs

Slices

Total time

8

59

30

4

16

255

128

9

32

1,068

537

14

64

4,466

2,242

22

128

∞

163

∞

233

∞

∞ means that the circuit does not fit within the device.

TABLE 7.3

7.7.4

Cost and Delay of Mastrovito Multipliers

Mastrovito Multipliers, Second Version

The circuits were generated for specific polynomials and are fully combinational (see Table 7.4). m

LUTs

Slices

Total time

8

53

28

3

16

222

112

8

32

934

473

13

64

3,728

1,877

18

128

14,249

14,249

30

163

22,347

15,201

36

233

∞

∞ means that the circuit does not fit within the device.

TABLE 7.4 Cost and Delay for Second Version of Mastrovito Multipliers

7.7.5

Interleaved Multiplication, Advanced Version

The combinational circuits are area avaricious; on the other hand sequential circuits computing one bit of result at each cycle are slow. A trade-off between area and speed can be used computing G bits per clock cycle. The results of the cost and delay for the 163- and 233-bit NISTrecommended polynomials are shown in Tables 7.5 and 7.6, respectively.

7.7.6

Montgomery Multipliers

The circuits were generated for specific polynomials and results for fully combinational and sequential circuits are presented. The cost and delay of several combinational and sequential Montogomery multipliers are shown in Tables 7.7 and 7.8, respectively.

225

226

Chapter Seven m 163 163 163 163 163 163 163 163 163 163

G 1 2 4 6 8 11 13 15 33 55

FFs 509 527 531 538 555 546 515 528 560 589

LUTs 511 676 849 1,017 1,356 1,843 1,884 2,215 4,449 6,956

Slices 271 369 463 555 745 965 975 1,161 2,304 3,588

Period 5.0 4.5 4.8 4.8 5.0 5.2 5.7 5.8 7.5 9.7

Cycles 163 82 41 28 21 15 13 11 5 3

Total time 815 369 197 134 105 78 74 64 37 29

G is the number of bits computed per clock cycle.

TABLE 7.5

Cost and Delay of Interleaved Advanced Multipliers and f(x) = x163+ x7 + x6 + x3 + 1

m 233 233 233 233 233 233 233 233

G 1 2 4 8 15 16 32 56

FFs 763 769 794 780 880 879 932 1,321

LUTs 723 957 1,192 1,919 3,112 3,130 6,213 10,112

Slices 417 541 689 1,045 1,736 1,743 3,346 5,718

Period 6.4 5.6 5.5 5.7 5.9 5.9 7.5 11.5

Cycles 223 112 56 28 15 14 7 4

Total time 1,427 627 308 160 88 83 52 46

G is the number of bits computed per clock cycle.

TABLE 7.6 Cost and Delay of Interleaved Advanced Multipliers and f(x) = x233 + x74 + 1

m 8 16 32 64 128 163 233

LUTs 61 243 997 4,045 16,270

Slices 31 122 500 2,028 8,508 ∞ ∞

Total time 10 15 32 74 165

∞ means that the circuit does not fit within the device.

TABLE 7.7 Cost and Delay of Combinational Montgomery Multiplier

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s m 8 16 32 64 128 163 233

FFs 23 42 76 142 273 344 484

TABLE 7.8

7.7.7

LUTs 30 50 84 148 276 347 489

Slices 16 29 47 80 147 184 255

Period 3.1 3.5 3.6 5.9 6.3 7.4 7.5

Cycles 8 16 32 64 128 163 233

Total time 25 56 115 378 806 1,206 1,748

Cost and Delay of Sequential Montgomery Multiplier

Classic Squaring

The circuits were generated for specific polynomials and are fully combinational (see Table 7.9). m 8 16 32 64 128 163 233 TABLE 7.9

7.7.8

LUTs 8 17 47 170 339 165 153

Slices 5 9 24 87 173 86 99

Total time 2 2 3 3 4 3 3

Cost and Delay of Classic Squarers

LSB-First Squarer, Second Version

The circuits are for specific polynomials and are sequential implementations that produce a bit of result per cycle. The cost and delay of several LSB-first squarers are shown in Table 7.10. The results show that for a specific polynomial a fully combinational circuit uses fewer resources and is faster. m

FFs

LUTs

Slices

Period

Cycles

Total time

8

28

31

18

3.1

4

12

16

49

57

31

3.1

8

25

32

92

109

59

3.5

16

56

64

175

207

111

4.3

32

138

128

369

402

243

4.9

64

314

163

464

510

306

5.3

82

435

233

659

723

436

6.0

117

702

TABLE 7.10

Cost and Delay of LSB-First Squarer

227

228

Chapter Seven

7.7.9

Montgomery Squarer

The circuits were generated for specific polynomials and results for fully combinationals and sequential circuits are shown in Tables 7.11 and 7.12, respectively. m

LUTs

Slices

Total time

8

6

3

3

16

16

8

3

32

67

35

6

64

193

99

8

128

418

214

25

163

267

147

20

233

117

74

7

TABLE 7.11 Cost and Delay of Combinational Montgomery Squarer m

FFs

LUTs

Slices

Period

Cycles

Total time

8

23

24

14

3.1

8

25

16

41

44

25

3.2

16

52

32

73

79

41

3.4

32

109

64

141

147

80

3.4

64

218

128

274

271

147

3.9

128

500

163

361

341

199

4.3

163

701

233

542

484

309

4.4

233

1,025

TABLE 7.12

7.7.10

Cost and Delay of Sequential Montgomery Squarer

Binary Exponentiation

The circuits were generated for specific polynomials. The circuits are sequentials and the number of cycles depends on the amount of ones in exponents. The worst case is when exponent e = 2m − 1 (i.e., all ones). The cost and delay of binary exponentiation is shown in Table 7.13.

m

FFs

LUTs

Slices

Period

Average time (μs)

Worst time (μs)

8

68

82

48

4.0

0.22

0.35

163

1,042

1,124

614

5.0

69

135

233

1,539

1,477

865

5.0

140

275

TABLE 7.13

Cost and Delay of Binary Exponentiation

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s

7.7.11

Montgomery Exponentiation

The circuits were generated for specific polynomials. The circuits are sequential implementations (see Table 7.14). m 8 163 233

FFs 93 1,359 1,926

TABLE 7.14

7.7.12

LUTs 112 1,260 1,773

Slices 68 752 1,050

Period 3,8 5,4 5,9

Total time (μs) 0,38 147 326

Cost and Delay of Montgomery Exponentiation

Division

The circuits were generated for specific polynomials. The circuits are sequential implementations and use 2m cycles in worst case (see Table 7.15). m 8 16 32 64 128 163 233

FFs 43 77 146 275 537 679 962

TABLE 7.15

7.7.13

LUTs 64 96 170 300 571 726 1,013

Slices 42 68 124 223 430 544 763

Period 4.5 5.0 6.2 7.2 7.2 7.4 7.4

Cycles 16 32 64 128 256 326 466

Total time 72 160 397 922 1843 2412 3448

Cost and Delay of Binary Polynomial Divisor

Extended Euclidean Algorithm (EEA) for Inversion

The circuits were generated for specific polynomials. The circuits are sequential implementations and use 2m cycles (see Table 7.16). m

FFs

LUTs

Slices

Period

Cycles

Total time

8

49

110

58

4.2

16

67

16

82

239

127

4.8

32

154

32

147

441

235

5.4

64

346

64

291

879

460

5.6

128

717

128

581

1,612

923

5.8

256

1,485

163

693

1,927

984

6.7

326

2,184

233

980

2,742

1,401

7.5

466

3,495

TABLE 7.16

Cost and Delay of EEA for Inversion

229

230

Chapter Seven

7.7.14 Modified Almost Inverse Algorithm (MAIA) for Inversion The circuits were generated for specific polynomials. The circuits are sequential implementations (see Table 7.17).

m 8 16 32 64 128 163 233

FFs 48 86 150 280 540 678 963

TABLE 7.17

7.7.15

LUTs 105 199 377 751 1,527 1,931 2,763

Slices 55 108 199 398 812 1,026 1,461

Period 4.2 5.4 6.8 7.4 8.9 9.8 10.8

Cycles 16 32 64 128 256 326 466

Total time 67 173 435 947 2,278 3,195 5,033

Cost and Delay of MAIA for Inversion

Important Irreducible Polynomials

The circuits were generated for specific polynomials. Results are presented for AOPs, trinomials, and class 1 pentanomials. The circuits are fully combinational. The cost and delay of several multipliers using AOPs, trinomials, and class 1 pentanomials are shown in Tables 7.18, 7.19, and 7.20, respectively. m

LUTs

Slices

Total time

10

46

24

4

60

3,055

2,101

9

130

14,154

8,590

15

196

32,121

20,455

∞

∞ means that the circuit does not fit within the device.

TABLE 7.18

Cost and Delay for AOPs

m

LUTs

Slices

Total time

Polynomial

9

67

44

5

x 9 + x8 + 1

65

3,557

2,029

9

x65 + x47 + 1

129

13,978

8,266

∞

x129 + x83 + 1

167

23,370

14,362

∞

x167 + x108 + 1

∞ means that the circuit does not fit within the device.

TABLE 7.19

Cost and Delay for Trinomials

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s

m

LUTs

Slices

Total time

Polynomial

8

56

28

4

x8 + x4 + x3 + x + 1

64

3,480

2,011

10

x64 + x4 + x3 + x + 1

200

33,624

19,306

∞

x200 + x5 + x3 + x2 + 1

283

67,110

33,555

∞

x283 + x12 + x7 + x5 + 1

∞ means that the circuit does not fit within the device.

TABLE 7.20

7.8

Cost and Delay for Class 1 Pentanomials

Comments and Conclusions According to the implementation results, the following conclusions are obtained. 1. For modular multipliers, combinational circuits are too expensive in terms of area for big polynomials in cases that can’t be implemented in a single device. Sequential implementations need m (degree of f(x)) cycles to obtain a result and could be too slow. A trade-off can be obtained using a sequential circuit that computes G bits per cycle. Tables 7.5 and 7.6 show results for the 163- and 233-bits NIST-recommended polynomials. 2. Regarding squaring, combinational circuits are simpler and faster than the corresponding sequential circuits. 3. For exponentiation, the computation time depends on the number of ones in the exponent and the multiplication determines the worst time. For faster exponentiation, multiplication such as in Sec. 7.7.5 should be used. 4. For division-inversion, the binary division can be used for inversion with good results. The MAIA inversion has the critical path in the computation of the degree of polynomials. 5. For multipliers with special irreducible polynomials (AOPs, trinomials, pentanomials), combinational circuits have the same area problems as combinational multipliers with general irreducible polynomials, but with a lower complexity (area, delay).

7.9

References [ABMV93] G. B. Agnew, T. Beth, R. C. Mullin, and S. A. Vanstone. “Arithmetic operations in GF(2m).” Journal of Cryptology, vol. 6, no. 1, pp. 3–13, 1993. [Ara93] B. Arazi. “Architectures for Exponentiation Over GF(2n) Adopted for Smartcard Application.” IEEE Transactions on Computers, vol. 42, no. 4, pp. 494–497, April 1993. [BCH93] H. Brunner, A. Curiger, and M. Hofstetter. “On Computing Multiplicative Inverses in GF(2m).” IEEE Transactions on Computers, vol. 42, no. 8, pp. 1010– 1015, August 1993.

231

232

Chapter Seven [Fen89] G. L. Feng. “A VLSI Architecture for Fast Inversion in GF(2m).” IEEE Transactions on Computers, vol. 38, no. 10, pp. 1383–1386, October 1989. [GGKPP06] J. Guajardo, T. Güneysu, S. S. Kumar, C. Paar, and J. Pelzl. “Efficient Hardware Implementation of Finite Fields with Applications to Cryptography.” Acta Applicandae Mathematicae, ed. J. L. Imaña, vol. 93, nos. 1-3, pp. 75–118, September 2006, Springer Verlag, Netherlands. [HK00] A. Halbutogullari and Ç. K. Koç. “Mastrovito Multiplier for General Irreducible Polynomials.” IEEE Transactions on Computers, vol. 49, no. 5, pp. 503–518, May 2000. [HK99] A. Halbutogullari and Ç. K. Koç. “Mastrovito Multiplier for General Irreducible Polynomials.” Applied Algebra, Algebraic Algorithms, and ErrorCorrecting Codes, Lecture Notes in Computer Science, no. 1719, pp. 498–507, Springer-Verlag, Berlin, 1999. [HLM00] D. Hankerson, J. López, and A. Menezes. “Software Implementation of Elliptic Curve Cryptography Over Binary Fields.” Proceedings of Cryptographic Hardware and Embedded Systems (CHES 2000), LNCS 1965, pp. 1–24, August 2000. [IHT06] J. L. Imaña, R. Hermida, and F. Tirado. “Low Complexity Bit-Parallel Multipliers Based on a Class of Irreducible Pentanomials.” IEEE Transactions on VLSI Systems, vol. 14, no. 12, pp. 1388–1393, December 2006. [IST06] J. L. Imaña, J. M. Sánchez, and F. Tirado. “Bit-Parallel Finite Field Multipliers for Irreducible Trinomials.” IEEE Transactions on Computers, vol. 55, no. 5, pp. 520–533, May 2006. [IT88] T. Itoh, and S. Tsujii. “A Fast Algorithm for Computing Multiplicative Inverses in GF(2m) Using Normal Bases.” Information and Computation, vol. 78, no. 3, pp. 21–40, September 1988. [JSP98] S. K. Jain, L. Song, and K. K. Parhi. “Efficient Semisystolic Architectures for Finite-Field Arithmetic.” IEEE Transactions on Computers, vol. 6, no. 1, pp. 101–113, March 1998. [KA97] Ç. K. Koç and T. Acar. “Fast Software Exponentiation in GF(2k).” 13th IEEE Symposium on Computer Arithmetic, pp. 225–231, July 1997. [KA98] Ç. K. Koç and T. Acar, “Montgomery Multiplication in GF(2k).” Designs, Codes and Cryptography, vol. 14, no. 1, pp. 57–69, April 1998. [Knu81] D. E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, vol. 2. Addison-Wesley, MA, USA, 2d ed., 1981. [KO63] A. Karatsuba and Y. Ofman. “Multiplication of Multidigit Numbers on Automata.” Sov. Phys.-Dokl., vol. 7, no. 7, pp. 595–596, 1963. [KS98] Ç. K. Koç and B. Sunar. “Low-Complexity Bit-Parallel Canonical and Normal Basis Multipliers for a Class of Finite Fields.” IEEE Transactions on Computers, vol. 47, no. 3, pp. 353–356, March 1998. [KTT07] K. Kobayashi, N. Takagi, and K. Takagi. “An Algorithm for Inversion in GF(2m) Suitable for Implementation Using a Polynomial Multiply Instruction on GF(2).” 18th IEEE Symposium on Computer Arithmetic, pp. 105–112, June 2007. [Mas88] E. D. Mastrovito. “VLSI Designs for Multiplication over Finite Fields GF(2m).” Proc. Sixth Int’l Conf. Applied Algebra, Algebraic Algorithms, and ErrorCorrecting Codes (AAECC-6), pp. 297–309, July 1988. [Mas91] E. D. Mastrovito. “VLSI Architectures for Computation in Galois Fields.” PhD thesis, Linköping University, Dept. Electr. Eng. Linköping, Sweden, 1991. [MBGMVY93] A. J. Menezes, I. Blake, X. Gao, R. Mullin, S. Vanstone, and T. Yaghoobian. Applications of Finite Fields. Kluwer Academic Publisher, Boston, MA, 1993. [Mon85] P. L. Montgomery. “Modular Multiplication without Trial Division.” Mathematics of Computation, vol. 44, pp. 519–521, 1985. [Paa94] C. Paar. “Efficient VLSI Architectures for Bit Parallel Computation in Galois Fields.” PhD Thesis, Universität GH Essen, 1994. [Par99] K. K. Parhi. VLSI Digital Signal Processing Systems: Design and Implementation. John Wiley & Sons, New York, 1999.

O p e r a t i o n s o v e r G F ( 2 m) — P o l y n o m i a l B a s e s [PL07] S. Peter and P. Langendörfer. “An Efficient Polynomial Multiplier in GF(2m) and Its Application to ECC Designs.” Design, Automation and Test in Europe, pp. 1253–1258, 2007. [RH04] A. Reyhani-Masoleh and A. Hasan. “Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over GF(2m).” IEEE Transactions on Computers, vol. 53, no. 8, pp. 945–959, August 2004. [RK03] F. Rodríguez-Henríquez, and Ç. K. Koç. “Parallel Multipliers Based on Special Irreducible Pentanomials.” IEEE Transactions on Computers, vol. 52, no. 12, pp. 1535–1542, December 2003. [RMSC06] F. Rodríguez-Henríquez, G. Morales-Luna, N. Saqib, and N. Cruz-Cortés. “Parallel Itoh-Tsujii Multiplicative Inversion Algorithm for a Special Class of Trinomials.” Cryptology ePrint Archive, Report 2006/035, 2006. http://eprint. iacer.org/. [RSDK06] F. Rodríguez-Henríquez, N. Saqib, A. Díaz-Pérez, and Ç. K. Koç. Cryptographic Algorithms on Reconfigurable Hardware. Springer, New York, 2006. [SOOS95] R. Schroeppel, H. Orman, S. O’Malley, and O. Spatscheck. “Fast Key Exchange with Elliptic Curve Systems,” Advances in Cryptology – Crypto’95, LNCS 963, pp. 43–56, 1995. [Ser98] G. Seroussi. “Table of Low-Weight Binary Irreducible Polynomials.” HP Labs Technical Report HPL-98–135, August 1998. [SK99] B. Sunar and Ç. K. Koç. “Mastrovito Multiplier for All Trinomials.” IEEE Transactions on Computers, vol. 48, no. 5, pp. 522–527, May 1999. [SSTP88] P. A. Scott, S. J. Simmons, S. E. Tavares, and L. E. Peppard. “Architectures for Exponentiation in GF(2m).” IEEE Journal on Selected Areas in Communications, vol. 6, no. 3, pp. 578–586, April 1988. [Wan94] C.-L. Wang. “Bit-Level Systolic Array for Fast Exponentiation in GF(2m).” IEEE Transactions on Computers, vol. 43, no. 7, pp. 838–841, July 1994. [Wu02] H. Wu. “Bit-Parallel Finite Field Multiplier and Squarer Using Polynomial Basis.” IEEE Transactions on Computers, vol. 51, no. 7, pp. 750–758, July 2002. [ZP01] T. Zhang and K. K. Parhi. “Systematic Design of Original and Modified Mastrovito Multipliers for General Irreducible Polynomials.” IEEE Transactions on Computers, vol. 50, no. 7, pp. 734–749, July 2001.

233

This page intentionally left blank

CHAPTER

8

Operations over GF (2m)—Normal Bases

C

hapter 8 deals with the study of the arithmetic operations over the binary field GF(2m), where the elements of the finite field are represented in a normal basis. 0 1 m−1 An element β ∈ GF(2m), N = {β 2 , β 2 , . . ., β 2 } is called a normal basis 20 21 2m − 1 m of GF(2 ) over GF(2) if β , β , . . . , and β are linearly independent ([LN94], [MBGMVY93]). In such a case, we say that β generates the normal basis N, or β is a normal element of GF(2m) over GF(2). It is well known that there exists a normal basis in the field GF(2m) over GF(2) for all positive integers m. Using a normal basis, any element A ∈ GF(2m) can be represented as A=

m−1

∑ aiβ2 i= 0

i

m−1 = a0β + a1β 2 + . . . + am − 1β 2

(8.1)

where ai ∈ GF(2), 0 ≤ i ≤ m – 1 is the ith coordinate of A. In short, this normal basis representation of A is written as A = (a0, a1, . . . , am − 1). The simplest arithmetic operation in normal basis is squaring, carried out by a cyclic right shift, and hence almost free of cost in hardware. Such a cost advantage often makes normal basis a preferred choice of representation. However, a normal basis multiplication is not so simple. A circuit design for the multiplication of two finite field elements represented in a normal basis was first described by Massey and Omura [MO86]. Due to their creators, normal basis multipliers are sometimes referred to as Massey-Omura multipliers. Although the original description focuses on a bit-serial multiplier, bit-parallel versions are easily constructed. Bit-parallel normal basis multipliers for GF(2m) offer high modularity. However, their space complexities are considerably high in comparison to other GF(2m) multipliers, such as polynomial basis

235

236

Chapter Eight multipliers described in Chap. 7. Sequential normal basis multipliers for GF(2m) are much more area efficient than their bit-parallel counterparts, but in general take m iterations (or clock cycles) for one multiplication. Several methods have been proposed in the literature in order to obtain efficient, normal basis multipliers ([HWB93], [RH00], [RH02], [RH03a], [RH03b], [AA06], [WTSDOR85]). On the other hand, Mullin et al. [MOVW88] gave a lower bound on the complexity of normal bases and defined the normal bases that have this lower bound as optimal normal bases. They defined two special types of normal bases which are known as Type-I and Type-II optimal normal bases [MBGMVY93]. Gao and Lenstra [GL92] showed that all of these two types are the optimal normal bases in GF(2m). The use of these optimal normal bases can reduce considerably the complexity of the different arithmetic operations ([BRS98], [HWB93], [KKH03], [KS98], [RH00], [RH02], [RH03a], [RH03b], [RH05], [YKPKL05], and [YL04]).

8.1

Some Properties of Normal Bases Squaring in GF(2m) is a linear operation. That is, given any two elements α and β in GF(2m), (α + β)2 = α 2 + β 2 . Furthermore, as shown in m Chap. 1 for any element α ∈ GF(2m), α 2 = α . It is also readily seen m−1 2 4 2 that 1 = β + β + β + . . . + β for any element β in GF(2m). This implies that the normal basis representation of 1 is (1,1,1, . . . , 1). 0 1 m−1 Let N = {β 2 , β 2 , . . ., β 2 } be a normal basis of GF(2m) over GF(2), 0 1 m−1 where the elements β 2 , β 2 , . . . , β 2 are linearly independent over GF(2). A polynomial in GF(2)[x] is called an N-polynomial if it is irreducible and its roots are linearly independent over GF(2). It can be proven that the elements in a normal basis are exactly the roots of an N-polynomial. Hence, an N-polynomial is just another way of describing a normal basis [MBGMVY93]. An important problem is: Given an integer m and the ground field GF(2), construct a normal basis of GF(2m) over GF(2), or equivalently, construct an N-polynomial in GF(2)[x] of degree m. For practical applications, the normal basis should have a complexity as low as possible. The construction of low complexity normal bases will be treated later on in this chapter. If β is a root of any irreducible polynomial f(x) = xm + fm − 1xm − 1 0 1 m−1 . . + . + f1x + f0 of degree m over GF(2), the powers β 2 , β 2 , . . . , β 2 (i.e., the conjugates of β) are in GF(2m) and constitute a complete set of i roots of f(x) [Wan86]. If these elements β 2 , i = 0, . . . , m – 1, are linearly independent, then they constitute a normal basis of GF(2m) over GF(2) [MBGMVY93]. In general, the linear independence of the roots of f(x) is difficult to verify. A straightforward way to do this is as follows. Let β be a root of f(x). Then {1, β, β 2 , . . ., β m − 1 } form a polynomial basis of 0 1 m−1 GF(2m) over GF(2) and β 2 , β 2 , . . . , β 2 are all the roots of f(x) in

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s GF(2m). Now express each β 2 , 0 ≤ i ≤ m – 1, by m-dimensional vectors in the polynomial basis in the form i

m−1

β 2 = ∑ j = 0 bijβ j i

bij ∈ GF(2)

(8.2)

If the m × m matrix B = (b ij) composed by the above m vec0 1 m−1 tors is nonsingular, then β 2 , β 2 , . . . , β 2 are linearly independent, and hence f(x) is an N-polynomial ([Wan86], [MBGMVY93]). For large values of m, this method requires a great number of computations. However, in certain cases, there is a simple criterion that can be used to identify N-polynomials, as stated in the following. Let m = 2n and f(x) = xm + fm − 1xm − 1 + . . . + f1x + f0 be an irreducible polynomial over GF(2). Then f(x) is an N-polynomial if and only if the coefficient fm-1 ≠ 0 [MBGMVY93]. Equivalently, for m = 2n, a necessary and sufficient condition for the set 0 1 m−1 0 {β 2 , β 2 , . . . , β 2 } to be a normal basis of GF(2m) (and therefore, β 2 , 21 2m − 1 , are linearly independent and β is a normal element), shows β , . . ., β that the trace of β obeys the following relation [Wan86]: 2 m−1 Tr(β) = β + β 2 + β 2 + . . . + β 2 = 1

(8.3)

Furthermore, if β ∈ GF(2m) and βi = β 2 , i = 0, 1, . . . , m – 1, then β is a normal element and therefore generates a normal basis {β 0 , β1 , . . . , β m − 1 } of GF(2m) over GF(2) if and only if the following matrix i

Tr(β 0β1 ) ⎛ Tr(β 0β 0 ) ⎜ Tr(β β ) Tr(β1β1 ) 1 0 ⎜ ⎜ ⎜⎝Tr(β β ) Tr ( β β ) m−1 0 m−1 1

Tr(β 0β m − 1 ) ⎞ Tr(β1β m − 1 ) ⎟ ⎟ ⎟ Tr(β m − 1β m − 1 )⎟⎠

(8.4)

is nonsingular [MBGMVY93].

Example 8.1 Let f(x) = x8 + x4 + x3 + x2 + 1 be an irreducible polynomial

for GF(28). If β is a root of f(x), it can be found that the matrix B = (bij) 0 1 m−1 composed by its conjugates β 2 , β 2 , . . . , β 2 represented in the polynomial basis is singular, and hence β and its conjugates are not linearly independent and they don’t constitute a basis. Therefore, f(x) = x8 + x4 + x3 + x2 + 1 is not an N-polynomial and β is not a normal element. In this example m = 23. Therefore, this conclusion could also be reached using the fact that the trace of β is zero or because the coefficient fm − 1 = f7 = 0. In this case, β is not a normal element in GF(28). The elements 1 + β, 3 β , 1 + β 3, β + β 3, or β 253 are not normal elements either. On the other hand, the field element β = β 5 satisfies the normality requirements m−1 and it generates the normal basis {β , β 2, . . ., β 2 } . It can be proven that the elements 1 + β 5 and β + β 5 are also normal elements.

237

238 8.2

Chapter Eight

Squaring m−1

Suppose that N = {β 2 , β 2 , . . ., β 2 } is a normal basis of GF(2m) over GF(2). As given above, for any A and B ∈ GF(2m), (A + B)2 = A2 + B2 is m true because 2AB = 0. From Fermat’s theorem, that is, A 2 − 1 = 1, 0 1 m − 1 m A 2 = A holds. Therefore, if A = a0β 2 + a1β 2 + . . . + am − 1β 2 then 0

1

1 2 m 0 1 m−1 A 2 = a0β 2 + a1β 2 + . . . + am − 1β 2 = am − 1β 2 + a0β 2 + . . . + am − 2β 2 (8.5)

Therefore, in normal basis, when A = (a0, a1, . . . , am − 1), A2 = (am − 1, a0, . . . , am − 2). In other words, squaring is carried out by a simple cyclic right shift, and thus in arithmetic hardware it is almost free of cost. Assuming that the function function rshift(x: poly_vector) return poly_vector

performing 1-bit right shift is available, the normal basis squaring in GF(2m) of an element A = (a0, a1, . . . , am − 1) is easily given in the following algorithm:

Algorithm 8.1—Normal basis squaring a0 := a(m-1); a := rshift(a); a(0) := a0;

An executable Ada file NB_sq.adb, including Algorithm 8.1, is available at www.arithmetic-circuits.org. This algorithm has been also implemented in the function function NB_sq(a: poly_vector) return poly_vector

for its use in the remaining arithmetic operations in normal basis.

8.3

Multiplication In this section, the work originally described by Massey and Omura 0 1 m−1 [MO86] is reviewed. Let {β 2 , β 2 , . . ., β 2 } be a normal basis of GF(2m) over GF(2) and let A and B be any two elements represented in the j m−1 m−1 i normal basis as A = ∑ i = 0 aiβ 2 and B = ∑ j = 0 b jβ 2 , respectively. Let A and B be represented in vector notation by A = (a0, a1, . . . , am-1) and B = (b0, b1, . . . , bm − 1), respectively. Then the product C = AB = (c0, c1, . . . , cm−1) in vector notation. The last term cm − 1 of the product is some binary function of the components of A and B cm − 1 = h(a0, a1, . . . , am − 1; b0, b1, . . . , bm − 1)

(8.6)

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s Since squaring means a cyclic shift of an element in a normal basis representation, we have C2 = A2B2 = (am − 1, a0, . . . , am − 2)(bm − 1, b0, . . . , bm − 2) = (cm − 1, c0, . . . , cm − 2)

(8.7) 2

Hence, the last component cm − 2 of the product C is obtained by the same function h given in Eq. (8.6) operating on the components of A2 and B2, that is, cm − 2 = h(am − 1, a0, . . . , am − 2; bm − 1, b0, . . . , bm − 2). By squaring C repeatedly, we can find that: cm − 1 = h(a0, a1, . . . , am − 1; b0, b1, . . . , bm − 1) cm − 2 = h(am − 1, a0, . . . , am − 2; bm − 1, b0, . . . , bm − 2) ... c0 = h(a1, a2, . . . , am − 1; b1, b2, . . . , bm − 1)

(8.8)

The Eq. (8.8) define the Massey-Omura multiplier ([MO86], [WTSDOR85]). In normal basis representation, this multiplier has the property that the same logic function h that is used to compute the last component cm − 1 of the product, can be used to compute its remaining components cm − 2, cm − 3, . . . , c1, c0.

Normal basis multiplication for f(x) = x4 + x3 + 1

Example 8.2

Let f(x) = x + x3 + 1 be the generating irreducible polynomial for GF(24). This polynomial is an N-polynomial because if α is a root of f(x), then the elements α, α2, α4, and α8 are linearly independent and therefore, the set of roots {α, α2, α4, α8} constitutes a normal basis of GF(24). Alternatively, we can conclude that f(x) is an N-polynomial because m = 4 = 22 and fm − 1 = f3 ≠ 0 (see Sec. 8.1). Any two elements A and B in GF(24) can be expressed as A = a0α + 2 a1α + a2α4 + a3α8 and B = b0α + b1α2 + b2α4 + b3α8. Therefore, the product C = AB can be computed as follows: 4

C = AB = (a0 α + a1α 2 + a2 α 4 + a3 α 8 )(b0 α + b1α 2 + b2 α 4 + b3 α 8 ) = c0 α + c1α 2 + c2 α 4 + c3 α 8 = α 12 (a2 b3 + a3 b2 ) + α 10 (a1b3 + a3 b1 ) +α 9 (a3 b0 + a0 b3 ) + α 8 ( a2 b2 ) + α 6 ( a2 b1 + a1 b2 ) + α 5 (a2 b0 + a0 b2 ) (8.9) +α 4 (a1b1 ) + α 3 (a0 b1 + a1 b0 ) + α 2 (a0 b0 ) + α(a3 b3 ) The elements α12, α10, α9, α6, α5, and α3 have been created, and these elements must be represented in the normal basis. Using the fact that f(α) = α4 + α3 + 1 = 0, we have α4 = α3 + 1. Therefore, one can find α 12 = α 8 + α 4 + α 2 α 10 = α 8 + α 2 α9 = α8 + α4 + α α6 = α 4 + α2 + α α5 = α4 + α α3 = α8 + α2 + α

(8.10)

239

240

Chapter Eight Therefore, we have the following: c3 c2 c1 c0

= a2 b2 + a3 b2 + a2 b3 + a3 b1 + a1b3 + a3 b0 + a0 b3 + a1b0 + a0 b1 = a1 b1 + a2 b1 + a1 b2 + a2 b0 + a0 b2 + a2 b3 + a3 b2 + a0 b3 + a3 b0 = a0 b0 + a1b0 + a0 b1 + a1b3 + a3 b1 + a1b2 + a2 b1 + a3 b2 + a2 b3 = a3 b3 + a0 b3 + a3 b0 + a0 b2 + a2 b0 + a0 b1 + a1 b0 + a2 b1 + a1 b2

(8.11)

Comparing Eq. (8.11) to Eqs. (8.6) and (8.8), the function h is given by c3 = h(a0 , a1 , a2 , a3 ; b0 , b1 , b2 , b3 ) = a2 b2 + a3b2 + a2 b3 + a3b1 + a1b3 + a3b0 + a0b3 + a1b0 + a0b1

(8.12)

A sequential architecture for a GF(2m) Massey-Omura multiplier is shown in Fig. 8.1. In this scheme, two shift registers (implementing the cyclic shiftings of the operands) and a combinational block implementing the h function are needed. The product is computed after m clock cycles. For the particular case given in Example 8.1 for Massey-Omura normal basis multiplier generating irreducible polynomial f(x) = x4 + x3 + 1, the following algorithm implements the product given in Eq. (8.11):

Algorithm 8.2—Massey-Omura normal basis multiplication in GF(24) for i in reverse 0 .. 3 loop c(i) := h_function_GF2_4(a,b); a := NB_sq(a); b := NB_sq(b); end loop;

a0

a1

.... am –1

C0, C1, ..., Cm–1 h-function

b0

FIGURE 8.1

b1

....

bm–1

Sequential architecture of Massey-Omura multiplier.

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s where h_function_GF2_4 implements the h function given in Eq. (8.12) and NB_sq implements normal basis squaring. It must also be noted that Algorithm 8.2 can be implemented with the sequential architecture given in Fig. 8.1. An executable Ada file NB_seqmult_GF2_4.adb, including Algorithm 8.2, is available at www.arithmetic-circuits.org. Alternatively, a parallel architecture for a GF(2m) Massey-Omura multiplier could be easily implemented [WTSDOR85]. In this scheme, m identical logic function h that calculate simultaneously all components of the product is needed. The inputs of the m logic function h are connected directly to the components of A and B, as given in Eq. (8.8). The only difference in the connections to the components of A and B to a function h is that they are cyclically shifted versions of one another, as in Eq. (8.8). An efficient multiplication scheme with normal basis in GF(2m) was introduced in [RH00] and [RH03a]. In this approach, let A and B be any two elements of GF(2m) and be represented by the normal basis m−1 j m−1 i 0 1 m−1 N = {β 2 , β 2 , . . ., β 2 } as A = ∑ i= 0 aiβ 2 and B = ∑ j= 0 b jβ 2 , respectively. In vector notation, let A = (a0, a1, . . . , am − 1), B = (b0, b1, . . . , bm−1), and m−1 β = (β, β 2 , . . ., β 2 ) . Then the product C = AB = (c0, c1, . . . , cm − 1) in vector notation can be computed as C = AB = ( Aβ T )(β BT ) = AM BT

(8.13)

where T denotes vector transposition and where the multiplication matrix is defined by

M = β T β = (β 2 + 2 )im, j =−01 i

j

⎛ β 20 + 20 ⎜ β 21 + 2 0 =⎜ ⎜ ⎜⎝β 2m − 1 + 20

β2 + 2 1 1 β2 + 2 2 m − 1 + 21 β 0

1

0 m−1 ⎞ β2 + 2 m −1 ⎟ 1 β2 + 2 ⎟ ⎟ m − 1 m − 1 β 2 + 2 ⎟⎠

(8.14)

All entries of M belong to GF(2 m) and if they are written 0 1 m−1 with respect to the normal basis {β 2 , β 2 , . . ., β 2 } , then we see that [RH00] m−1 M = M 0β + M1β 2 + . . . + M m − 1β 2

(8.15)

where Mi’s are m × m matrices with entries in GF(2). Substituting Eq. (8.15) into Eq. (8.13), the coordinates of the product C can be found as follows:

ci = AMi BT = A(i)M 0B(i)

T

0 ≤ i ≤ m−1

(8.16)

241

242

Chapter Eight where A(i) = (ai , ai + 1 , . . ., ai − 1 ) is the i-fold left cyclic shift of A, and the T same is for B(i) ([HWB93], [RH00]). It is easy to prove that the numbers of nonzero entries in all Mi’s are equal [MBGMVY93]. Following Mullin et al. [MOVW88], the number of nonzero entries in Mi is known as the complexity of the normal basis N, which is defined by CN = H (Mi )

0≤i≤m−1

(8.17)

where H(Mi) refers to the Hamming weight, that is, the number of 1s in Mi. It can be proven ([RH00], [RH02]) that the multiplication matrix M given in Eq. (8.14) is symmetric and its diagonal entries 0 1 m−1 are the elements of the normal basis {β 2 , β 2 , . . ., β 2 } . Denoting j i m−1 M as M = (μ i , j )i , j= 0 , where μ i , j = μ j , i = β 2 + 2 , it is easy to see that μ i , j = μ i2− 1, j − 1 , 0 < i, j ≤ m − 1. In this way, given the m entries of the 0th row of the M matrix, the remaining entries (except the leftmost ones) can be generated by using some squaring operations (free of cost in normal basis representations). In Eq. (8.14), the exponents of β can be represented in the binary form using m bits where each exponent has only two 1s and zeros elsewhere. Formally, if the set of exponents of β in the M matrix are represented by R = {2i + 2j | 0 ≤ i, j ≤ m – 1, i ≠ j}, then it can be observed that the elements of R belong to the set of the ring of integers modulo 2m – 1. For these elements in R, it can also be proven [RH00] that the cyclic distance between the two 1s is in the range [1, v], where v = ⎣m/2⎦. If we let δ j = β1 + 2

j

j = 0, 1, . . . , v

(8.18)

then the entries of M can be obtained from δj’s as follows [RH03a]: ⎧⎪ δ 2j i− i , 0 < j−i ≤ v μ i , j = μ j ,i = ⎨ 2 j ⎪⎩δ m + i − j , v < j − i ≤ m − 1

(8.19)

Noting that ⎧⎪ δ , for m odd δm − 1 − v = ⎨ v , for m even δ ⎪⎩ v − 1

(8.20)

and δ v = δ 2v

v

for m even

(8.21)

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s then the multiplication matrix M can be written as [RH03a]: ⎛ δ0 ⎜ ⎜ δ1 ⎜ ⎜ ⎜ δv M = ⎜ 2v + 1 ⎜δm − 1 − v ⎜δ 2 v + 2 ⎜ m−2− v ⎜ ⎜ δ 2m − 1 ⎝ 1

δ1

δ 02

δv

2 δm − 1− v

δ 2v − 1 v δ 20

δ 2v

δ 12

v+1

2v + 2 δm − 1− v

δ 22

m−1

v

v δ 22

2v δm − 1− v

v+1

δ 2m − 1 − v δ 2v v δ 12

v+2

δ 2m − 2 − v 2v + 2

δm − 1 − v v δ 22 v+1

v+1 δ 12

v+2 δ 20

v+1 δ 2m − 2 − v

v+2 δ 2m − 3 − v

δ 20

v+1

δ 12

⎞ ⎟ δ2 ⎟ ⎟ ⎟ v δ 2m − 1 − v⎟ ⎟ 2v + 1 δm − 2 − v⎟ v+2 δ 2m − 3 − v⎟ ⎟ ⎟ m−1 ⎟ δ 20 ⎠ δ 12

m−1

2m − 1

(8.22) Therefore, M can be written as M = M(0) + M(1) + . . . + M(m − 1)

(8.23)

such that the nonzero entries of M(l), 0 ≤ l ≤ m – 1, belong to l l l {δ 20 , δ 12 , . . ., δ 2v }. Using the above equations, the following lemma was given in [RH03a].

Lemma 8.1 Let A and B be two elements of GF(2m) and C be their product. Then the product C is given as m−1 m−1 v ⎧ i −1 i forr m odd ai bi δ 20 + ∑ ∑ yi , j δ 2j , ⎪ ∑ ⎪ i=0 i = 0 j =1 C = ⎨m − 1 v−1 m − 1v − 1 i−1 i i ⎪ ai bi δ 20 + ∑ ∑ yi , j δ 2j + ∑ yi , v δ 2v , for m even ∑ ⎪ i=0 i=0 i = 0 j =1 ⎩

(8.24)

with yi , j = (ai + a((i+ j)) )(bi + b((i+ j)) )

1≤ j≤ v

0≤i≤m−1

(8.25)

where ((k)) means “k reduced modulo m.” Let hj, 1 ≤ j ≤ v, be the number of nonzero coordinates of the normal basis representation of δj, that is, hj = H(δj), and let w j ,1 , w j ,2 , … , w j , h j denote the positions of such nonzero coordinates in the normal basis representation of δj, that is, hj

δ j = ∑ β2

w j ,k

1≤ j ≤ v

(8.26)

k =1

where 0 ≤ w j ,1 < w j ,2 < . . . < w j , h ≤ m − 1 . Furthermore, for even values j m/2 of m, we have v = m / 2 and δ v = δ 2v . Therefore, in the normal basis

243

244

Chapter Eight representation of δv, its ith coordinate is equal to its ((m / 2 + i))th coordinate. Thus, hv is even and one can obtain that δv =

hv /2

∑ (β2

wv , k

+ β2

wv , k + v

v = m/2

)

(8.27)

k =1

Example 8.3 In order to illustrate the above terms, we use again the

field GF(24) generated by the irreducible polynomial f(x) = x4 + x3 + 1, as described in Example 8.2. If β is a root of f(x), then the set of roots {β, β2, β4, β8} constitutes a normal basis of GF(24). In this case, m = 4 and hence v = ⎣m/2⎦ = 2. Using Eq. (8.18), the 2 terms δ 0 = β 2 , δ 1 = β1 + 2 = β 3 , and δ 2 = β1 + 2 = β 5 . The M matrix given in Eq. (8.22) can be written as ⎛ δ0 ⎜δ M=⎜ 1 ⎜ δ2 ⎜ δ 23 ⎝ 1

δ1 δ 02 δ 12 δ 22

3 δ 12 ⎞ ⎛β 2 2⎟ ⎜ 3 δ2 ⎟ = ⎜β 22 δ 1 ⎟ ⎜β 5 9 3 δ 20 ⎟⎠ ⎝β

δ2 δ 12 2 δ 20 2 δ 12

β3 β4 β6 β10

β5 β6 β8 β12

β9 ⎞ β10 ⎟ ⎟ β12 ⎟ β16 ⎠

(8.28)

Using Eq. (8.23), M can also be decomposed as follows: M = M( 0 ) + M(1) + M( 2) + M( 3) ⎛δ 0 ⎜δ =⎜ 1 ⎜δ 2 ⎜⎝ 0

δ1 δ 2 0 0 0 0 0 0

0⎞ ⎛ 0 0 0⎟ ⎜ 0 δ 20 ⎟ +⎜ 0⎟ ⎜ 0 δ 12 0⎟⎠ ⎝⎜ 0 δ 22

0 δ 12 0 0

0 ⎞ ⎛0 δ 22⎟ ⎜ 0 ⎟ +⎜ 0 ⎟ ⎜0 0 ⎟⎠ ⎜⎝ 0

0 0 0 0 2 0 δ 20 2 0 δ 12

0⎞ ⎛ 0 0⎟ ⎜ 0 2⎟ +⎜ δ 12 ⎟ ⎜ 0 3 0 ⎟⎠ ⎜⎝δ 2 1

0 0 0 0

3 0 δ 12 ⎞ 0 0⎟ ⎟ 0 0⎟ 3 0 δ 20 ⎟⎠

(8.29) Furthermore, using Eqs. (8.26) and (8.27), the terms hj and wj,k can be determined as follows. If β is a root of f(x) = x4 + x3 + 1, f(β) = β4 + β3 + 1 = 0, and therefore β3 = β4 + 1 = β + β2 + β8, because 1 = β + β2 + β4 + β8 in normal basis. In the same way, β5 = ββ4 = β(β3 + 1) = β4 + β. It can be observed that these expressions were given in Eq. (8.10). Using the above expressions for β3 and β5, we have h1 = 3 and h2 = 2, respectively. Finally, from Eqs. (8.26) and (8.27), the terms wj,k’s can also be computed as follows: δ1 = β3 = β + β2 + β8 = β2 δ 2 = β 5 = β + β 4 = β2

δ 20

w2 , 1

w1 ,1

+ β2

+ β2 w2 , 2

w1 , 2

+ β2

w1 , 3

⇒ w1,1 = 0, w1, 2 = 1, w1, 3 = 3

⇒ w 2 , 1 = 0, w 2 , 2 = 2

(8.30)

Substituting Eqs. (8.26) and (8.27) into Eq. (8.24) and using i = β 2 , the following theorem was given in [RH03a].

i−1

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s Theorem 8.1 Let A and B be two elements of GF(2m) and C be their product. Then the product C is given as v hj ⎛ m − 1 ⎧ m−1 ⎞ i i ⎪ ∑ ai biβ 2 + ∑ ∑ ⎜ ∑ y((i − w )), jβ 2 ⎟ , for m odd j ,k ⎠ ⎪ i= 0 j=1 k =1 ⎝ i= 0 C=⎨ m−1 v − 1 hj ⎛ m − 1 ⎞ i ⎪ 2i a b β + ∑ ∑ ⎜⎝ ∑ y((i − w j ,k )), jβ2 ⎟⎠ + F , forr m even ⎪∑ i i j=1 k =1 i= 0 ⎩ i=0

(8.31)

where

F=

hv /2 v − 1

∑ ∑ y((i − w k =1 i = 0

(β 2 + β 2 i

v , k )), v

i+v

) and v = m / 2

(8.32)

It is important to note that for a normal basis, the representation of δj is fixed and so is wj,k, with 1 ≤ j ≤ v, 1 ≤ k ≤ hj. These terms can be computed for a given irreducible generating polynomial as described in Example 8.3. Using Eq. (8.32), the following algorithm for normal basis multiplication was given in [RH03a]:

Algorithm 8.3—Algorithm for normal basis multiplication in GF(2m) Input: Output: 1.

A, B ∈ GF(2m), wj,k, 1 ≤ j ≤ v, 1 ≤ k ≤ hj. C = AB Generate yi,j = (ai + a((i+j)))(bi + b((i+j))), 1 ≤ j ≤ v, 0 ≤ i ≤ m-1

2. 3. 4. 5. 6.

ci := aibi, 0 ≤ i ≤ m – 1, C := (c0, c1,..., cm-1) for j = 1 to v – 1 do T := (t0, t1,..., tm-1) = 0 for k = 1 to hj do ri := y((i-w )),j, 0 ≤ i ≤ m-1, j,k R := (r0, r1,..., rm-1) T := T + R end for C := C + T end for T := 0 If m is odd then s := hv, t := m else s:= hv/2, t := m/2 end if Generate yi, v = (ai + a(( v + i)))(bi + b(( v + i))), 0 ≤ i ≤ t-1 If m is even then yi+v,v := yi,v, 0 ≤ i ≤ m/2 – 1 end if for k = 1 to s do

7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

245

246

Chapter Eight 21. 22. 23.

ri := y((i − w v,k)),v, 0 ≤ i ≤ t − 1 if m is even then r m := ri, 0 ≤ i ≤ m / 2 - 1 i+

24. 25. 26. 27. 28.

2

R := (r0,r1,...,rm/2 end if T := T + R end for C := C + T

− 1,r0,r1,...,rm/2 − 1)

Assume that h_array is defined as an array of integers from 1 to m/2 holding the values hj, with 1 ≤ j ≤ v, representing the number of nonzero coordinates of the normal basis representation of δj. Assume that w_array is an array of integers (1 . . . m/2, 1 . . . m – 1) holding the values wj,k, with 1 ≤ j ≤ v, 1 ≤ k ≤ hj, where w j ,1 , w j ,2 , . . ., w j , hj denote the positions of the nonzero coordinates in the normal basis representation of δj. Then Algorithm 8.3 can be implemented as follows:

Algorithm 8.4—Normal basis multiplication in GF(2m) v := m/2; for i in 0 .. m-1 loop for j in 1 .. v loop yij(i,j) := m2and(m2xor(a(i),a((i+j) mod m)), m2xor(b(i),b((i+j)mod m))); end loop; end loop; for i in 0 .. m-1 loop c(i) := m2and(a(i),b(i)); end loop; for j in 1 .. v-1 loop for i in 0 .. m-1 loop t(i) := 0; end loop; for k in 1 .. h(j) loop for i in 0 .. m-1 loop r(i) := yij((i-w(j,k)) mod m,j); end loop; t := m2xvv(t,r); end loop; c := m2xvv(c,t); end loop; for i in 0 .. m-1 loop t(i) := 0; end loop; if (m rem 2) /= 0 then s := h(v); te := m; else s := h(v)/2; te := m/2; end if; for i in 0 .. te-1 loop yij(i,v) := m2and(m2xor(a(i),a((v+i) mod m)), m2xor(b(i),b((v+i) mod m))); end loop; if (m rem 2) = 0 then for i in 0 .. (m/2)-1 loop yij(i+v,v) := yij(i,v); end loop; end if; for k in 1 .. s loop

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s for i in 0 .. te-1 loop r(i) := yij((i-w(v,k)) mod m,v); end loop; if (m rem 2) = 0 then for i in 0 .. (m/2)-1 loop r(i+m/2) := r(i); end loop; end if; t := m2xvv(t,r); end loop; c:= m2xvv(c,t);

where yij_array is defined as an array of bits (0 . . . m – 1, 1 . . . m/2) used to hold the terms yi,j defined in Eq. (8.25). An executable Ada file NB_multiplier.adb, including Algorithm 8.4, is available at www. arithmetic-circuits.org. For a bit-parallel implementation of Algorithm 8.4, the yi,js and aibis must be generated using m(m + 1)/2 two-input AND gates and m(m − 1) two-input XOR gates, with corresponding delay TAND + TXOR. For the addition of ris and aibis in order to obtain cis, a total of v−1 ∑ j=1 h j + εhv = (CN − 1) 2 XOR gates will be needed, where the comv−1 plexity CN = 2(∑ j=1 h j + εhv ) + 1 and e is equal to 1 for m odd, and 0.5 for m even [RH03a]. If these gates are arranged as a binary tree, then the corresponding time complexity will be (⎢⎡log 2 (CN + 1) − 1⎥⎤)TXOR . Since CN is an odd integer, ⎡⎢log 2 (CN + 1)⎤⎥ = ⎡⎢log 2CN ⎤⎥ , thus the overall time complexity [RH03a] of the bit-parallel implementation is TNB _ multiplier = TAND + ⎢⎡log 2CN ⎤⎥ TXOR . A simple VHDL model for the normal basis multiplication algorithm is given in the file NB_multiplier.vhd, available at www. arithmetic-circuits.org. This model includes two processes for the computation of the product, where the yi,js must be generated. The datapath for this generation is shown in Fig. 8.2. The entity declaration of the combinational implementation of the normal basis multiplier given in the file NB_multiplier.vhd is entity NB_multiplier is port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end NB_multiplier;

The corresponding VHDL architecture follows: P1: process(a,b) variable yij: yij_array; variable caux,r,t1: std_logic_vector(M-1 downto 0); variable s,te,aux1: integer; begin for i in 0 to m-1 loop for j in 1 to v loop

247

248

Chapter Eight a0 a1

b0 b1

b0 b2

a0 a2

a0 av

b0 bv

...

y0·1

y0·2

am mod m am–1

...

bm mod m

y0·v

a(m–1+v) mod m

bm –1

am–1

b(m–1+v) mod m bm–1

...

ym–1·1

FIGURE 8.2

ym –1·v

Generation of yi,js in the normal basis multiplier.

yij(i)(j):=(a(i)xor a((i+j) mod m))and(b(i)xor b((i+j) mod m)); end loop; end loop; for i in 0 to m-1 loop caux(i) := a(i) and b(i); end loop; for j in 1 to v-1 loop for i in 0 to m-1 loop t1(i) := ‘0’; end loop; for k in 1 to h(j) loop for i in 0 to m-1 loop aux1 := (i - w(j)(k)) mod m; r(i) := yij(aux1)(j); end loop; for i in 0 to m-1 loop t1(i) := t1(i) xor r(i); end loop; end loop; for i in 0 to m-1 loop

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s caux(i) := caux(i) xor t1(i); end loop; end loop; c2 <= caux; end process; P2: process(a,b) variable yij: yij_array; variable r,t2: std_logic_vector(M-1 downto 0); variable s,te,aux2: integer; begin for i in 0 to m-1 loop t2(i) := ‘0’; end loop; if (m rem 2) /= 0 then s := h(v); te := m; else s := h(v)/2; te := m/2; end if; for i in 0 to te-1 loop yij(i)(v):=(a(i)xor a((v+i) mod m)) and (b(i)xor b((v+i) mod m)); end loop; if (m rem 2) = 0 then for i in 0 to (m/2)-1 loop yij(i+v)(v) := yij(i)(v); end loop; end if; for k in 1 to s loop for i in 0 to te-1 loop aux2 := (i - w(v)(k)) mod m; r(i) := yij(aux2)(v); end loop; if (m rem 2) = 0 then for i in 0 to (m/2)-1 loop r(i+m/2) := r(i); end loop; end if; for i in 0 to m-1 loop t2(i) := t2(i) xor r(i); end loop; end loop; t <= t2; end process; c <= c2 xor t;

8.4

Exponentiation For an arbitrary a in the finite field GF(2m), and an integer e (1 ≤ e ≤ 2m – 1), let b = ae, where b is in GF(2m). In general, an arbitrary integer power of an element a ∈ GF(2m) can be computed using the binary method [Knu81],

249

250

Chapter Eight also known as the square-and-multiply method, which breaks the exponentiation operation into a series of squaring and multiplication operations in GF(2m). The binary method [Knu81] has been already dealt with in Chaps. 5 and 7. In this method, repeated squaring of the partial results is used to reduce the required number of multiplications. Each integer exponent e can be presented in its binary representation as an m-bit vector as e = e0 + e12 + e222 + . . . + em − 12m − 1 = (e0, e1, . . . , em − 1). According to this method, we can obtain: m−1

b = a e = a∑ i= 0

ei 2i

= a e0 (a 2 )e1 ( a 2 )e2 . . . (a 2 2

m−1

e

) m−1 =

m−1

∏ Bi

(8.33)

i= 0

where ⎪⎧a 2 , if ei = 1 i ei Bi = (a 2 ) = ⎨ ⎩⎪ 1, if ei = 0 i

(8.34)

Thus, exponentiation can be accomplished by m successive multii plications. However, in normal basis, a 2 can be obtained by a cyclici−1 shift of the normal basis representation of a 2 . Hence, from Eq. (8.34), Bi is either the ith cyclically shifted version of a or 1. Therefore, the binary or square-and-multiply method given in Eqs. (8.33) and (8.34) can be implemented in the following algorithm:

Algorithm 8.5—Binary or square-and-multiply exponentiation in normal basis c := a; for i in 0 .. m-1 loop b(i) := 1; end loop; for i in 0 .. m-1 loop if e(i) = 1 then b := NB_multiplier(b,c,h,w); end if; c := NB_sq(c); end loop;

where NB_multiplier performs the normal basis multiplication presented in Algorithm 8.4 for a given field GF(2m) with values hj and wj,k, and where NB_sq implements the normal basis squaring. It must be noted that in normal basis representation, 1 = (1,1, . . . ,1). An executable Ada file NB_exp.adb, including Algorithm 8.5, is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 8.5 is shown in Fig. 8.3. The squaring operation is a simple signal rewiring, therefore the minimum clock period is equal to TNB_multiplier. The total computation time (worst case) is about T ≈ mTNB _ multiplier . A VHDL model for the normal basis binary exponentiation algorithm is given in the file NB_binary_exponentiation.vhd, available

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s E (m – 1 : 0)

A (m – 1 : 0) sq_c (m – 1 : 0) 1

m-bit shift register

inic shift_right

inic

m-bit register

ee0 bb

0

cc (m – 1 : 0) cc

ccm–1

ccm–2

ccm–3

cc0 ...

NB_multiplier

... sq_cm–1 sq_cm–2

sq_c1 sq_c0

NB_squarer mult_bc (m – 1 : 0) ee0 m-bit register

inic ce_c

bb (m – 1 : 0) B (m – 1 : 0)

FIGURE 8.3

Normal basis binary exponentiation datapath.

at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 8.5 is shown in Fig. 8.3. The entity declaration of the sequential implementation of the normal basis binary exponentiation algorithm given in the VHDL file NB_binary_exponentiation.vhd is entity NB_binary_exp is port ( a, e: in std_logic_vector(M-1 downto 0); clk, reset, start: in std_logic; done: out std_logic; b: out std_logic_vector(M-1 downto 0) ); end NB_binary_exp;

The VHDL architecture corresponding to the circuit of Fig. 8.3 follows: multiplier: NB_multiplier port map (a => bb, b => cc, c => mult_bc); register_C: process(clk)

251

252

Chapter Eight begin if reset = ‘1’ then cc <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then cc <= a; else cc <= sq_c; end if; end if; end process register_C; sh_register_E: process(clk) begin if reset = ‘1’ then ee <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then ee <= e; end if; if shift_r = ‘1’ then ee <= ‘0’ & ee(M-1 downto 1); end if; end if; end process sh_register_E; register_B: process(inic, clk) begin if inic = ‘1’ or reset = ‘1’ then bb <= (others => ‘1’); elsif clk’event and clk = ‘1’ then if ce_c = ‘1’ and ee(0) = ‘1’ then bb <= mult_bc; else bb <= bb; end if; end if; end process register_B; -- Squaring operation: sq_c(0) <= cc(M-1); sq_c_calc: for i in 1 to M-1 generate sq_c(i) <= cc(i-1); end generate; -b <= bb;

The complete model additionally includes a counter and a control unit. The above binary method has an obvious generalization: Use a base larger than two ([BGMW92], [Gor98]). In such a case, let l

e = ∑ ri mi

(8.35)

i= 0

The m-ary method for exponentiation computes ae using Eq. (8.35) by means of the following algorithm:

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s Algorithm 8.6—m-ary method for exponentiation Input: Output: 1. 2. 3. 4. 5. 6. 7.

A, e B = Ae Compute A2, A3,..., Am-1. B := 1 for d = l downto 0 by –1 do B := Bm B := BA rd end for return B

This method is particularly attractive if m = 2k, so that raising a to the mth power only involves k squarings ([Sti90], [Gat91]). Therefore, if the field GF(2m) is used with a normal basis, then the squarings can be done with just a cyclic shift. In this case, the 2k-ary method takes only ⎡m/k⎤ + 2k − 1 – 2 multiplications, since only odd powers up to 2k – 1 need to be computed. The 2k-ary method is given in the following algorithm [Gor98]:

Algorithm 8.7—2k-ary method for exponentiation Input: Output: 1. 2. 3. 4. 5. 6. 7. 8.

A, e B = Ae k Compute A3, A5,..., A 2 -1 . B := 1 for d = 2k – 1 downto 1 by –2 do for each i such that ri = d2ji do k ⋅i +j B := B(A d)2 i end for end for return B

We can illustrate the 2k-ary method with the following example:

Example 8.4 Assume that we want to compute the value of Ar = A3370. The binary representation of the power r = 3370 is r = 110100101010. Let k = 3, therefore the binary representation of r must be considered in groups of 3 bits each, in such a way that 3 3 3 2 3 1 3 0 r = 110 100 101 010 = r3 (2 ) + r2 (2 ) + r1 (2 ) + r0 (2 ) r3

r2

r1

r0

= 6(2 3 )3 + 4(2 3 )2 + 5(2 3 )1 + 2(2 3 )0 = 337010

(8.36)

The values of the ris are in the range r ∈ [0,1,2, . . . ,7]. From Algorithm 8.7, only the terms A3, A5, and A7 should be computed. The execution of Algorithm 8.7 will give the following intermediate results: j • d = 7 ⇒ There is not any ri in the form ri = d2 i .

• d = 5 ⇒ r1 = 5 = 5⋅20 ⇒ B = 1( A 5 )2

3⋅1+ 0

= ( A 5 )2

• d = 3 ⇒ r3 = 6 = 3⋅21 ⇒ B = (( A 5 )2 )( A 3 )2 3

3⋅3+1

3

= (( A 5 )2 )( A 3 )2 3

10

253

254

Chapter Eight • d=1⇒

(

)

r0 = 2 = 1⋅21 ⇒ B = ( A 5 )2 ( A 3 )2 ( A1 )2 r2 = 4 = 1⋅2 ⇒

3

10

3⋅0+1

(

)

= ( A 5 )2 ( A 3 )2 ( A)2 3

10

2

(

)

B = ( A 5 )2 ( A 3 )2 ( A)2 ( A1 )2 3

10

3⋅2+ 2

(

)

= ( A 5 )2 ( A3 )2 ( A)2 ( A)2 = A 3370 3

10

8

Assume that r_array has been defined as an array (0 . . . m/2 − 1) of integers holding the values of the coefficients ri of the exponent as in Eq. (8.36). Assume also that the functions function vectoint_k(x: poly_vector; k,p: integer) return integer function pow2j(q: integer) return integer function inttovect(x: integer) return poly_vector

Convert k bits (starting in the pth bit) belonging to a bit vector with m bits to its integer value; compute if the integer q is a power of 2 (i.e., if q = 2j); and convert an integer to its bit vector representation (with m bits) that are available. Then the following algorithm implements the 2k-ary method for exponentiation given in Algorithm 8.7 which computes Ae.

Algorithm 8.8—2k-ary method for exponentiation in normal basis for i in 0 .. m/2-1 loop r(i) := 0; end loop; e := inttovect(g); for i in 0 .. m/k-1 loop r(i) := vectoint_k(e,k,i*k); end loop; for i in 0 .. m-1 loop b(i) := 1; end loop; d := 2**k-1; while d >= 1 loop aux := NB_exp(a,d,h,w); for i in 0 .. m/k-1 loop q := r(i)/d; if r(i) = d*(2**pow2j(q)) then if k*i+pow2j(q) = 0 then aux1 := aux; elsif k*i+pow2j(q) = 1 then aux1 := NB_sq(aux); elsif k*i+pow2j(q) > 1 then aux1 := NB_sq(aux); for l in 1 .. (k*i+pow2j(q))-1 loop aux1 := NB_multiplier(aux1,aux1,h,w); end loop; end if; b := NB_multiplier(b,aux1,h,w); end if; end loop; d := d - 2; end loop;

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s where NB_exp performs the square-and-multiply exponentiation in normal basis given in Algorithm 8.5 for a field GF(2m) with values hj and wj,k. An executable Ada file NB_2kary_exp.adb, including Algorithm 8.8, is available at www.arithmetic-circuits.org.

8.5

Inversion Efforts in developing normal basis multiplicative inversion algorithms in finite fields GF(2m) have produced only a limited number of choices. The most popular methods for finite field inversion over GF(2m) are mainly based on Fermat’s theorem and on Euclid’s algorithm [Sun06]. Using Fermat’s theorem, the inverse of an element in GF(2m) can be found by successive squaring and multiplication. In normal basis representation of a Galois field, squaring is done by a simple cyclic shift. Hence, the algorithms based on Fermat’s theorem for inversion mainly choose this basis ([IT88], [Fen89], [WTSDOR85]). m m From Fermat’s theorem, that is, A 2 − 1 = 1, A 2 = A holds. Therefore, the inversion can be carried out by computing the exponentiam tion A −1 = A 2 − 2, for A ≠ 0 ∈ GF(2m). Since 2m – 2 = 2 + 22 + 23 + . . . + 2m − 1, −1 A can be expressed as ([WTSDOR85], [TYT01]) A −1 = A 2

m−

2

= (A 21 ) (A 22 ) (A 23 ) . . . (A 2m − 1 )

(8.37)

As stated, squaring can be realized in normal basis representation by a cyclic shift operation. The following algorithm [WTSDOR85] implements the inversion given in Eq. (8.37):

Algorithm 8.9—Inversion in normal basis b := NB_sq(a); for i in 0 .. m-1 loop c(i) := 1; end loop; k := 0; while k < m-1 loop d := NB_multiplier(b,c,h,w); k := k + 1; if k = m-1 then inv := d; end if; if k < m-1 then b := NB_sq(b); c := d; end if; end loop;

where NB_multiplier performs the normal basis multiplication given in Algorithm 8.4 for a field GF(2m) with values hj and wj,k, and where NB_sq implements the normal basis squaring. An executable Ada

255

256

Chapter Eight file NB_inversion.adb, including Algorithm 8.9, is available at www.arithmetic-circuits.org. An example of datapath corresponding to Algorithm 8.9 is shown in Fig. 8.4. The minimum clock period is TNB_multiplier, and the total computation time is about T ≈ mTNB _ multiplier . A VHDL model for the normal basis inversion algorithm is given in the file NB_inversion.vhd, which is available at www.arithmetic-circuits.org. The datapath corresponding to Algorithm 8.9 is shown in Fig. 8.4. The entity declaration of the sequential implementation of the normal basis inversion algorithm given in the VHDL file NB_inversion.vhd is entity NB_inversion is port ( a: in std_logic_vector(M-1 downto 0); clk, reset, start: in std_logic; done: out std_logic; inv: out std_logic_vector(M-1 downto 0) ); end NB_inversion;

The VHDL architecture corresponding to the circuit of Fig. 8.4 follows: multiplier: NB_multiplier port map (a => bb, b => cc, c => dd); sq_register: process(clk) begin if reset = ‘1’ then bb <= (others => ‘0’); elsif clk’event and clk = ‘1’ then if inic = ‘1’ then bb <= a; end if; if shift_r = ‘1’ then bb <= bb(M-2 downto 0) & bb(M-1); end if; end if; end process sq_register; register_C: process(inic, clk) begin if inic = ‘1’ or reset = ‘1’ then cc <= (others => ‘1’); elsif clk’event and clk = ‘1’ then if ce_c = ‘1’ then cc <= dd; else cc <= cc; end if; end if; end process register_C; inv <= dd;

The complete model additionally includes a counter and a control unit.

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s A (m – 1: 0)

1

0

inic

NB_squaring m-bit cyclic shift register

shift_r

bb (m – 1: 0)

NB_multiplier

dd (m – 1: 0) INV (m – 1 : 0) inic ce_c

m-bit register

cc (m – 1: 0)

FIGURE 8.4

Normal basis inversion datapath.

Perhaps the most popular inversion algorithm is the parallel Itoh and Tsujii algorithm ([IT88], [TYT01]) which is also derived from Ferm mat’s little theorem, that is, A 2 − 1 = 1, for A ≠ 0 ∈ GF(2m), from where 2m A = A holds. The basic idea used was to decompose the exponent m − 1 as follows: A− 1 = A2

m

−2

= (A 2

)2

m − 1 −1

(8.38)

The exponent 2m − 1 is further decomposed as follows: 1. If m is odd, then (2 m − 1 − 1) = (2( m − 1)/2 − 1)(2( m − 1)/2 + 1); therefore m−1 ( m − 1)/2 − 1 A2 = ( A2 )2( m − 1)/2 + 1 . 2. If m is even, then (2 m − 1 − 1) = 2(2 m − 2 − 1) + 1 = 2(2( m − 2)/2 − 1) × m− 1 ( m − 2 )/2 − 1)( 2( m − 2 )/2 + 1) + 1 (2( m − 2)/2 + 1) + 1; therefore A 2 = A 2( 2 . The Itoh-Tsujii inversion algorithm is shown in the following ([IT88], [AA06]):

257

258

Chapter Eight Algorithm 8.10—Itoh-Tsujii inversion algorithm in normal basis Input: Output: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.

A L = A-1 S := ⎣log2(m - 1) ⎦ - 1. P := A for i = S downto 0 do R := Shift (m – 1) to right by S bits Q := P Rotate Q to left by ⎣R/2⎦ bits T := PQ If last bit of R = 1 then Rotate T to left by 1 bit P := TA else P := T end if S := S – 1 end for Rotate P to left by 1 bit L := P return L

The Itoh-Tsujii algorithm achieves inversion by computing the expom nentiation A −1 = A 2 − 2 , using a clever recursive decomposition technique applied on the exponent. The efficiency of the algorithm is based on the efficient squaring property of the normal basis and on the reduction of the number of required multiplications to O(log m). It must be noted that the shift (left, right) and rotate operations in Algorithm 8.10 refer to a bit ordering from (m – 1) downto 0. Assume that the functions function log(x: integer) return integer function vectoint(x: poly_vector) return integer

computing ⎣log2x⎦, and converting a bit vector to its integer value are available. Then the following algorithm implements the Itoh-Tsujii inversion scheme given in Algorithm 8.10.

Algorithm 8.11—Itoh-Tsujii inversion algorithm in normal basis for GF(2m) r := inttovect(m-1); rint := vectoint(r); s := log(m-1) - 1; p := a; for i in reverse 0 .. s loop for i in 1 .. s loop r := lshift(r); end loop; q := p; for i in 1 .. rint/2 loop q := NB_sq(q); end loop;

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s t := NB_multiplier(p,q,h,w); if r(0) = 1 then t := NB_sq(t); p := NB_multiplier(t,a,h,w); else p := t; end if; end loop; p := NB_sq(p); l := p;

In Algorithm 8.11, the bit ordering used is from 0 to (m – 1). Therefore, lshift and NB_sq functions are used for the shift and rotate operations given in Algorithm 8.10. An executable Ada file NB_Itoh_Tsujii_inv.adb, including Algorithm 8.11, is available at www.arithmetic-circuits.org.

8.6

Optimal Normal Bases In Sec. 8.3, the complexity of a normal basis N of GF(2m) over GF(2) was defined in Eq. (8.17) by CN = H(Mi), with 0 ≤ i ≤ m – 1. It can be proven ([MBGMVY93], [MOVW88]) that the complexity of a normal basis is lower bounded by CN ≥ 2m – 1. Therefore, normal bases with CN = 2m – 1 are said to be optimal normal bases. Normal bases of low complexity are desirable in hardware or software implementations of finite fields. Only the following two optimal normal bases [MBGMVY93] exist in GF(2m): Type-I optimal normal basis: m + 1 is a prime p and 2 is a primitive modulo p (namely, the multiplicative order of 2 modulo p is m). Type-II optimal normal basis: 2m + 1 is a prime p and either a. 2 is a primitive modulo p, or b. p ≡ 3 (mod 4) and the multiplicative order of 2 modulo p is m (namely, 2 generates the quadratic residues of modulo p). Type-I optimal normal basis is generated by elements β ∈ GF(2m) of order p = m + 1. It can be observed that the minimal polynomial of β is the AOP (all-one-polynomial, studied in Chap. 7) f(x) = xm + xm-1 2 m−1 + . . . + x + 1 and the sets {β, β 2 , β 2 , . . ., β 2 } and {β, β 2 , β 3 , . . ., β m } are identical [MBGMVY93]. Thus, after suitable permutation, we can operate on elements in optimal normal basis representation as polynomials modulo f(x). Results expressed in terms of 1, β, β2, . . . , βm can be brought back to the desired basis set by using, when needed, the equality 1 = β + β2 + . . . + βm. So, in addition to being attractive for hardware applications, the Type-I optimal normal basis representation inherits the advantages of the polynomial representation. Type-II optimal normal basis is constructed using the normal elements β = γ + γ −1 [MBGMVY93], where γ is a primitive (2m + 1)th root of unity, that is, γ 2m + 1 = 1 and γ i ≠ 1 for any 1 ≤ i < 2m + 1. A Type-II

259

260

Chapter Eight optimal normal basis can be constructed if p = 2m + 1 is prime and if either of the above two conditions also holds. Complexities of the arithmetic operations studied in previous sections can be further reduced when optimal normal are considered. For example, the multiplication scheme given in Eq. (8.24), Lemma 8.1, can be optimized when Type-I optimal normal basis is used as given below [RH03a]. As stated, a Type-I optimal normal basis is generated by the roots of an irreducible AOP. The AOP f(x) = xm + xm − 1 + . . . + x + 1 is irreducible if m + 1 is prime and 2 is primitive modulo m + 1. Thus, the roots of an AOP, j β 2 , with j = 0, 1, . . . , m – 1, form a Type-I optimal normal basis if and only if m + 1 is prime and 2 is primitive modulo m + 1. The terms δjs, with 1 ≤ j ≤ m/2, can be determined in this case by ([RH02], [RH03a]) kj ⎧⎪ β2 , δj = ⎨ m − 1 2i ⎪⎩1 = ∑ i= 0 β ,

j = 1, 2, . . . , m / 2 − 1 j = m/2

(8.39)

where kj can be obtained from k

2 j + 1 ≡2 j mod (m + 1)

(8.40)

It must be noted that there exists a unique kj, 0 ≤ kj < m, establishing that Eq. (8.40) holds. Substituting Eq. (8.39) into Eq. (8.24) leads to the following expression of the product C [RH03a]: ⎛m − 1 ⎞ ⎞ v−1 ⎛m − 1 i i C = ⎜ ∑ ai biβ 2 ⎟ + ∑ ⎜ ∑ yi , jβ 2 ⎟ ⎠ ⎠ j=1 ⎝ i= 0 ⎝ i= 0

2

kj

⎛v − 1 ⎞ + ⎜ ∑ yi , v⎟ ⎠ ⎝ i= 0

(8.41)

where the right most summation results in 0 or 1, represented in normal basis by (0, 0, . . . , 0) and (1, 1, . . . , 1), respectively. Using Eq. (8.41), the following algorithm for Type-I optimal normal basis multiplication was given in [RH03a]:

Algorithm 8.12—Algorithm for Type-I optimal normal basis multiplication in GF(2m) Input: A, B ∈ GF(2m), kj, 1 ≤ j ≤ v – 1, v = m/2. Output: C = AB 1. Generate yi, j = (ai + a((i + j)))(bi + b((i + j))), 2. 3. 4.

1 ≤ j ≤ v, 0 ≤ i ≤ m - 1 Generate yi,v = (ai + a((v +i)))(bi + b((v +i))), 0 ≤ i ≤ v – 1 Initialize ci := aibi, 0 ≤ i ≤ m – 1, f := y0,v, f ∈ GF(2) for j = 1 to v – 1 do

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s 5. 6. 7. 8. 9. 10. 11. 12.

ri := yi,j, 0 ≤ i ≤ m – 1, R := (r0, r1,..., rm-1) kj

R := R2 C := C + R f := f + yj,v end for If f is 1 then C := C + (1,1,...,1) end if

Assume that k_array is defined as an array of integers from 1 to (m/2 – 1) that holds the values kj, with 1 ≤ j ≤ v – 1, computed using Eq. (8.40). Then Algorithm 8.12 can be implemented as follows:

Algorithm 8.13—Type-I optimal normal basis multiplication with AOPs v := m/2; for i in 0 .. m-1 loop r(i) := 0; one(i) := 1; end loop; for i in 0 .. m-1 loop for j in 1 .. v loop yij(i,j) := m2and(m2xor(a(i),a((i+j) mod m)), m2xor(b(i),b((i+j) mod m))); end loop; end loop; for i in 0 .. v-1 loop yiv(i) := m2and(m2xor(a(i),a((v+i) mod m)), m2xor(b(i),b((v+i) mod m))); end loop; for i in 0 .. m-1 loop c(i) := m2and(a(i),b(i)); end loop; f := yiv(0); for j in 1 .. v-1 loop for i in 0 .. m-1 loop r(i) := yij(i,j); end loop; for i in 1 .. k(j) loop r := NB_sq(r); end loop; c := m2xvv(c,r); f := m2xor(f,yiv(j)); end loop; if f = 1 then c := m2xvv(c,one); end if; kj

In Algorithm 8.13, the operation R := R 2 is accomplished by a kj-fold cyclic shift using normal basis squaring with an NB_sq function. An executable Ada file NB_T1_multiplier.adb, including Algorithm 8.13, is available at www.arithmetic-circuits.org. A VHDL file NB_T1_multiplier.vhd, which models the Type-I optimal normal basis multiplication given in Algorithm 8.13, is available at www.arithmetic-circuits.org. The corresponding entity declaration is

261

262

Chapter Eight entity NB_T1_multiplier is port ( a, b: in std_logic_vector(M-1 downto 0); c: out std_logic_vector(M-1 downto 0) ); end NB_T1_multiplier;

The VHDL architecture follows: yij_process: process(a,b) variable yij: yij_array; begin for i in 0 to m-1 loop for j in 1 to v loop yij(i)(j):=(a(i) xor a((i+j) mod m))and(b(i)xor b((i+j) mod m)); end loop; end loop; yij_s <= yij; end process; yiv_process: process(a,b) variable yiv: yiv_array; begin for i in 0 to v-1 loop yiv(i) := (a(i) xor a((v+i) mod m)) and (b(i) xor b((v+i) mod m)); end loop; yiv_s <= yiv; end process; c_s_process: process(a,b) variable caux: std_logic_vector(M-1 downto 0); begin for i in 0 to m-1 loop caux(i) := a(i) and b(i); end loop; c_s <= caux; end process; P1: process(yiv_s, yij_s, c_s) variable f: std_logic; variable r,r2,c_v: std_logic_vector(M-1 downto 0); begin f := yiv_s(0); c_v := c_s; for j in 1 to v-1 loop for i in 0 to m-1 loop r(i) := yij_s(i)(j); end loop; for i in 1 to k(j) loop -- Squaring r2(0) := r(m-1); for i in 1 to m-1 loop r2(i) := r(i-1); end loop; r := r2;

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s end loop; for i in 0 to m-1 loop c_v(i) := c_v(i) xor r(i); end loop; f := f xor yiv_s(j); end loop; if f = ‘1’ then for i in 0 to m-1 loop c_v(i) := not c_v(i); end loop; end if; c_aux <= c_v; end process; c <= c_aux;

For Type-I optimal normal basis, another approach was given in [KS98]. As stated, for Type-I optimal normal basis with an AOP as gen2 m−1 erating polynomial, the sets {β, β 2 , β 2 , . . . , β 2 } and {β, β 2 , β 3 , . . ., β m } are identical [MBGMVY93]. Furthermore, the basis {β, β 2 , β 3, . . ., β m } is a shifted version of the polynomial basis. An element of the field GF(2m) in the normal basis representation can be converted to the shifted polynomial representation using a permutation of the binary coordinates. The root β of an AOP has the property of βm + 1 = 1. Hence the conversion

A=

m−1

m

∑ aiβ2 = ∑ a′iβi i=0

i

(8.42)

i =1

can be performed using the following permutation [KS98]: a′ 2i mod( m + 1) = ai

for i = 0, 1, . . ., m − 1

(8.43)

Therefore, in order to perform a Type-I optimal normal basis multiplication using this method, the inputs A and B represented in the normal basis are taken. Then they must be converted to the shifted polynomial basis using the permutation given in Eq. (8.43), and a polynomial basis multiplication for AOPs is performed using the equations and algorithms given in Chap. 7 (Section 7.6.3). At the end of this computation, the result F = AB/β2 is obtained and represented in the polynomial basis as F = f0 + f1β + f2β 2 + . . . + fm − 1β m − 1

(8.44)

where the coefficients fis are the outputs of the polynomial basis multiplier given in Section 7.6.3. Using Eq. (7.60), the coefficients

263

264

Chapter Eight fi = e + di, for i = 0, 1, . . . , m – 1. Then the product F must be multiplied by β2 in order to obtain G = Fβ2 as G = (d0 + e)β 2 + (d1 + e)β 3 + . . . + (dm − 1 + e)β m + 1

(8.45)

Using the fact that βm + 1 = β + β2 + . . . + βm

(8.46)

and substituting Eq. (8.46) into Eq. (8.45), the following final expression is obtained: G = g 0β + g1β 2 . . . + g m − 1β m = (dm − 1 + e)β + (d0 + e + dm − 1 + e)β 2 + (d1 + e + dm − 1 + e)β 3 + . . . + (dm − 2 + e + dm − 1 + e)β m

(8.47)

= (dm − 1 + e)β + (d0 + dm − 1 )β 2 + (d1 + dm − 1 )β 3 + . . . + (dm − 2 + dm − 1 )β m

where the coordinates are computed as: g 0 = dm − 1 + e gi = di − 1 + dm − 1

i = 1, 2, . . ., m − 1

(8.48)

Therefore, the expressions given in Eq. (8.48) are the coordinates of G in the shifted polynomial basis. Now if the inverse of the permutation given in Eq. (8.43) is applied to G, then the coordinates of the product C in the Type-I optimal normal basis are finally obtained. It must be noted that the implementation of the permutation and inverse permutation operations are accomplished simply by wiring. Thus, the Type-I optimal normal basis multiplier has exactly the same area and delay complexities as that of the polynomial basis multiplier for AOPs given in Section 7.6.3.

8.7

FPGA Implementations Several circuits described in this chapter have been implemented within a Xilinx Spartan3 (speed grade-5) programmable device. The times (period, total time) are expressed in ns. The parameters FFs and LUTs represent the number of flip-flops and look-up tables. Every slice includes two flip-flops and two look-up tables. All the source files are available at www.arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s

8.7.1

Multiplier

The circuits are fully combinational. The cost and delay of several multipliers are shown in Table 8.1.

m

LUTs

Slices

Total time

4

12

6

4

5

25

13

5

13

195

101

5

17

413

213

5

23

459

237

5

29

725

374

5

∞

163

∞ means that the circuit does not fit within the device.

TABLE 8.1 Cost and Delay of Multiplication

8.7.2

Exponentiation

The circuits given in Table 8.2 are sequential implementations.

m

FFs

LUTs

Slices

Period

4

17

33

17

3.5

5

20

43

22

3.5

13

45

232

117

5.9

17

58

459

233

6.8

23

122

519

309

7.6

29

204

796

506

8.9

163

∞

∞ means that the circuit does not fit within the device.

TABLE 8.2 Cost and Delay of Exponentiation

265

266

Chapter Eight

8.7.3

Inversion

The circuits given in Table 8.3 are sequential implementations.

m

FFs

LUTs

Slices

Period

4

14

32

17

3.4

5

16

41

22

3.9

13

33

222

112

6.0

17

42

444

225

6.9

23

100

500

300

7.3

29

190

770

507

8.7

∞

163

∞ means that the circuit does not fit within the device.

TABLE 8.3 Cost and Delay of Inversion

8.7.4 Type-I Optimal Normal Basis Multiplier with AOPs The circuits are fully combinational. The cost and delay of several Type-I ONB multipliers are shown in Table 8.4.

m

LUTs

Slices

Total time

10

87

44

5

28

551

276

5

36

888

445

5

58

2,300

1,151

5

66

2,947

1,475

5

TABLE 8.4 Cost and Delay of Type-I ONB Multiplication with AOPs

8.8

Comments and Conclusions The experimental results show that combinational normal basis multipliers are more complex (in area) than combinational polynomial basis multipliers given in Chap. 7. Furthermore, combinational Type-I ONB multipliers with AOPs have lower area complexity than combinational normal basis multipliers, and similar complexity to polynomial basis counterparts.

O p e r a t i o n s o v e r G F ( 2 m) — N o r m a l B a s e s

8.9

References [AA06] T. F. Al-Somani and A. Amin. “Hardware Implementations of GF(2m) Arithmetic Using Normal Basis.” Journal of Applied Sciences, vol. 6, no. 6, pp. 1362–1372, 2006. [BGMW92] E. F. Brickell, D. M. Gordon, K. S. McCurley, and D. B. Wilson. “Fast Exponentiation with Precomputation: Algorithms and Lower Bounds.” Advances in Cryptology: Eurocrypt’92, Lecture Notes in Computer Science LNCS 658, pp. 200–207, 1992. [BRS98] I. F. Blake, R. M. Roth, and G. Seroussi. “Efficient Arithmetic in GF(2n) through Palindromic Representation.” Technical Report, Hewlett-Packard, HPL98-134, August 1998. [Fen89] G. L. Feng. “A VLSI Architecture for Fast Inversion in GF(2m).” IEEE Transactions on Computers, vol. 38, no. 10, pp. 1383–1386, October 1989. [GL92] S. Gao Jr. and H. W. Lenstra. “Optimal Normal Bases.” Designs, Codes, and Cryptography, vol. 2, pp. 315–323, 1992. [Gat91] J. von zur Gathen. “Efficient exponentiation in finite fields.” Proc. of the 32nd IEEE Symposium on the Foundations of Computer Science, pp. 384–391, 1991. [Gor98] D. M. Gordon. “A survey of fast exponentiation methods.” Journal of Algorithms, vol. 27, no. 1, pp. 129–146, April 1998. [HWB93] M. A. Hasan, M. Z. Wang, and V. K. Bhargava. “A Modified MasseyOmura Parallel Multiplier for a Class of Finite Fields.” IEEE Transactions on Computers, vol. 42, no. 10, pp. 1278–1280, October 1993. [IT88] T. Itoh and S. Tsujii. “A Fast Algorithm for Computing Multiplicative Inverses in GF(2m) Using Normal Basis.” Information and Computing, vol. 78, pp. 171–177, 1988. [KKH03] S. Kwon, C. H. Kim, and C. P. Hong. “Efficient Exponentiation for a Class of Finite Fields GF(2m) Determined by Gauss Periods.” CHES 2003, LNCS 2779, pp. 228–242, 2003. [Knu81] D. E. Knuth. The Art of Computer Programming, vol. 2: Seminumerical Algorithms, vol. 2. Addison-Wesley, MA, USA, 2d ed., 1981. [KS98] Ç. K. Koç and B. Sunar. “Low-Complexity Bit-Parallel Canonical and Normal Basis Multipliers for a Class of Finite Fields.” IEEE Transactions on Computers, vol. 47, no. 3, pp. 353–356, March 1998. [LN94] R. Lidl and H. Niederreiter. Introduction to Finite Fields and Their Applications. Cambridge University Press, Cambridge, 1994. [MBGMVY93] A. J. Menezes, I. Blake, X. Gao, R. Mullin, S. Vanstone, and T. Yaghoobian. Applications of Finite Fields. Kluwer Academic Publisher, Boston, MA, 1993. [MO86] J. L. Massey and J. K. Omura. “Computational Method and Apparatus for Finite Field Arithmetic.” US Patent No. 4,587,627. 1986. [MOVW88] R. C. Mullin, I. M. Onyszchuk, S. A. Vanstone, and R. M. Wilson. “Optimal Normal Bases in GF(pn).” Discrete Applied Mathematics, vol. 22, pp. 149–161, 1988/1989. [RH00] A. Reyhani-Masoleh and M. Anwar Hasan. “On Efficient Normal Basis Multiplication.” INDOCRYPT 2000, LNCS 1977, pp. 213–224, 2000. [RH02] A. Reyhani-Masoleh and M. Anwar Hasan. “A New Construction of Massey-Omura Parallel Multiplier over GF(2m).” IEEE Transactions on Computers, vol. 51, no. 5, pp. 511–520, May 2002. [RH03a] A. Reyhani-Masoleh and M. Anwar Hasan. “Efficient Multiplication Beyond Optimal Normal Bases.” IEEE Transactions on Computers, vol. 52, no. 4, pp. 428–439, April 2003. [RH03b] A. Reyhani-Masoleh and M. Anwar Hasan. “Low Complexity Sequential Normal Basis Multipliers over GF(2m).” 16th IEEE Symposium on Computer Arithmetic – ARITH’03, pp. 188–195, June 2003. [RH05] A. Reyhani-Masoleh and M. Anwar Hasan. “Low Complexity Word-Level Sequential Normal Basis Multipliers.” IEEE Transactions on Computers, vol. 54, no. 2, pp. 98–110, February 2005.

267

268

Chapter Eight [Sti90] D. R. Stinson. “Some observations on parallel algorithms for fast exponentiation in GF(2m).” SIAM Journal on Computing, vol. 19, pp. 711–717, 1990. [Sun06] B. Sunar. “A Euclidean Algorithm for Normal Bases.” Acta Applicandae Mathematicae, vol. 93, pp. 57–74, September 2006. [TYT01] N. Takagi, J. Yoshiki, and K. Takagi. “A Fast Algorithm for Multiplicative Inversion in GF(2m) Using Normal Basis.” IEEE Transactions on Computers, vol. 50, no. 5, pp. 394–398, May 2001. [Wan86] C. C. Wang, “A Generalized Algorithm to Design Finite Field Normal Basis Multipliers.” The Telecommunications and Data Acquisition Progress Report 42-87, pp. 125–139, July-September 1986. [WTSDOR85] C. C. Wang, T. K. Truong, H. M. Shao, L. J. Deutsch, J. K. Omura, and I. S. Reed. “VLSI Architectures for Computing Multiplications and Inverses in GF(2m).” IEEE Transactions on Computers, vol. c-34, no. 8, pp. 709-717, August 1985. [YKPKL05] D. J. Yang, C. H. Kim, Y. Park, Y. Kim, and J. Lim. “Modified Sequential Normal Basis Multipliers for Type II Optimal Normal Bases.” ICCSA 2005, LNCS 3481, pp. 647–656, 2005. [YL04] H. S. Yoo and D. Lee. “Computation of Multiplicative Inverses in GF(2m) Using Palindromic Representation.” ICCSA 2004, LNCS 3043, pp. 510–516, 2004.

CHAPTER

9

Operations over GF (2m)—Other Bases

P

olynomial and normal bases studied in Chaps. 7 and 8, respectively, are the two most used bases for representation of the field elements over the binary field GF(2m). However, there are other bases that can be used for the efficient computation of arithmetic operations over GF(2m).

9.1

Dual Bases The definition of dual bases is based on the trace function and the concept of duality [LN83]. The duality was already introduced in Chap. 1, where the definition of trace function was also given. A nonzero linear function h from GF(2m) to GF(2) is a function such that for all χ, δ ∈ GF(2m) and c ∈ GF(2), h(χ + δ) = h(χ) + h(δ) and h(c⋅χ) = c⋅h(χ) hold. The trace function is a linear function from GF(2m) to GF(2) in i such a way that the trace of β ∈ GF(2m) is defined to be Tr(β) = ∑mi =−01 β 2 . As shown in Chap. 1, two bases {λ 0 , λ 1 , . . ., λ m− 1 } and {μ 0 , μ 1 , . . ., μ m− 1 } are said to be dual to one another if the following condition is satisfied by the trace values of the basis elements [Ber82]: ⎧1, if i = j Tr(λ i μ j ) = ⎨ ⎩0, if i ≠ j

(9.1)

Let {λ 0 , λ 1 , . . . , λ m− 1 } be a basis of GF(2m) and let {μ 0 , μ 1 , . . ., μ m− 1 } be its dual basis. Then a field element A ∈ GF(2m) can be represented in the dual basis by the following expansion [HTDR88]: A=

m−1

m−1

i=0

i=0

∑ aiμ i = ∑ Tr( Aλi )μ i

(9.2)

269

270

Chapter Nine Using Eq. (9.2), the multiplication of two field elements can be given as follows [HTDR88]. Let {λ 0 , λ 1 , . . ., λ m − 1 } be a basis of GF(2m) and let {μ 0 , μ 1 , . . ., μ m − 1 } be its dual basis. Then the product C = AB of two field elements A, B ∈ GF(2m) can be represented in the dual basis as follows:

C=

m−1

∑ i=0

m−1

m−1

i=0

i=0

ci μ i = ∑ Tr(C λ i )μ i =

∑ Tr( ABλi )μ i

(9.3)

where ci = Tr(Cλi) is the ith coefficient of the product in the dual basis, m−1 A = ∑ i = 0 ai μ i, and B = ∑mi =−0 1 bi λ i . Therefore, the element B is represented in the basis {λ 0 , λ 1 , . . ., λ m− 1 }, as long as the element A and the product C are represented in the dual basis {μ 0 , μ 1 , . . ., μ m − 1 } . Some bases other than the dual basis can still achieve the dual basis style. These bases, normally referred to as weakly dual bases, are obtained by inserting a fixed nonzero field element into the trace function that defines the dual basis. Weakly dual bases can be defined in the following ([WH98], [WH01]). Let {λ 0 , λ 1 , . . ., λ m − 1 } and {μ 0 , μ 1 , . . ., μ m − 1 } be two bases for GF(2m) and β ∈ GF(2m), β ≠ 0. Then, the bases are said to be weakly dual to each other. {μ 0 , μ 1 , . . ., μ m− 1 } is a weakly dual basis of {λ 0 , λ 1 , . . ., λ m − 1 } if Tr(βλ i μ j ) = δ ij , i, j = 0, 1, . . ., m − 1, where δ ij is the Kronecker delta function, which is equal to 1 if i = j and 0 otherwise ([WH98], [WHB98], [WH01]). It must be noted that this condition for the trace function is equal to the condition given in Eq. (9.1) simply by inserting the element β into the trace function. Therefore, when β = 1, {λ 0 , λ 1 , . . ., λ m − 1 } and {μ 0 , μ 1 , . . ., μ m − 1 } they are said to be a pair of dual basis. The above idea of the trace function can be extended to include any general linear function [FBT96a] in such a way that any linear function h from GF(2m) to GF(2) can be considered to be of the form h(z) = Tr(β z), ∀ z ∈GF(2 m ), for some β in GF(2m). As a result, there are 2m linear functions h from GF(2m) to GF(2). Furthermore, these functions are of the form

h(z) =

m−1

∑ xi zi

∀ z ∈GF(2 m )

(9.4)

i=0

where xi ∈ GF(2) and the addition is performed modulo 2 [FBT96a]. Let {λ 0 , λ 1 , . . ., λ m − 1 } be a basis for GF(2m) and let h be a nonzero l i n e a r f u n c tion from GF(2 m ) to GF(2). Then t he dual b asis o f {λ 0 , λ 1 , . . ., λ m − 1 } with respect to the function h is the basis {μ 0 , μ 1 , . . ., μ m − 1 } so that h(λiμj) = 1 if i = j, 0 otherwise [KL99].

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s When {λ 0 , λ 1 , . . ., λ m − 1 } is the polynomial basis {1, α , α 2 , . . ., α m − 1 }, linear functions of the form h(z) = zi , i ∈ {0, 1, . . . , m − 1} are very useful because the values of h(z), because every z in GF(2m) can be directly obtained from the polynomial basis representation of z without any further computation. From this fact, the following new definition of duality was given in [FBT96a]. Let {λ 0 , λ 1 , . . ., λ m− 1 } and {μ 0 , μ 1 , . . ., μ m− 1 } be two bases for GF(2m), h is a nonzero linear function from GF(2m) to GF(2), and β ∈ GF(2m), β ≠ 0. Then the two bases are said to be dual with respect to h and β if h( βλ i μ j ) = δ ij

i, j = 0, 1, . . ., m − 1

(9.5)

where δ ij is the Kronecker delta function, and {λ 0 , λ 1 , . . ., λ m− 1 } and {μ 0 , μ 1 , . . ., μ m− 1 } are the polynomial and dual bases, respectively. In this case, any element A ∈ GF(2m) can be represented in the dual basis as follows [FBT96a]: A=

m−1

∑ h( Aβλi )μ i

(9.6)

i=0

By using different values of h and β in Eq. (9.5), for any given basis there are now 2m − 1 dual basis instead of only one dual basis. Therefore, the most convenient dual basis can be selected in such a way that the complexity of the dual to polynomial basis conversion could be reduced. For certain GF(2m), β can be chosen so that the dual basis is just a reordering of the polynomial basis [MKW89]. Morii et al. [MKW89] considered the case where h is the trace function. However, these results were extended for any general function h in [Fen93], [FBT96a]. Using the above definition of duality, the following multiplication algorithm over GF(pm) was given in [Fen93], [FBT96a]. Let A, B, C ∈ GF(pm) so that C = AB. Let α be a root of the defining irreducible polynomial f(x) for the field, and let h be a linear function from GF(pm) to GF(p). If β ∈ GF(pm), represents B over the polynomial basis as B = ∑ im= −0 1 biα i , then the multiplication can be obtained as [FBT96a] ⎛ h( Aβ) h( Aβα) ⎜ h( Aβα) h( Aβα 2 ) ⎜ ⎜ ⎜⎝ h( Aβα m − 1 ) h( Aβα m )

h( Aβα m − 1 ) ⎞ ⎛ b0 ⎞ ⎛ h(C β) ⎞ h( Aβα m ) ⎟ ⎜ b1 ⎟ ⎜ h(C βα) ⎟ ⎟⎜ ⎟ =⎜ ⎟ ⎟ ⎜ ⎟⎜ ⎟ m−1 2m − 2 ⎟ ⎜b ⎟ )⎠ h( Aβα )⎠ ⎝ m − 1⎠ ⎝ h(C βα

(9.7)

A multiplier using dual basis was first proposed by Berlekamp [Ber82]. Using the above multiplication algorithm, the Berlekamp multiplier can be obtained [Ber82].

271

272

Chapter Nine For GF(2m), if a j = h( Aβα j ), j = 0, 1, . . ., 2m − 2 and c j = h(C βα j ), j = 0, 1, . . ., m − 1, then Eq. (9.7) can be rewritten as follows [FBT96a]: ⎛ a0 ⎜ a ⎜ 1 ⎜ ⎜⎝ am − 1

a1 a2 am

am − 1 ⎞ ⎛ b0 ⎞ ⎛ c0 ⎞ am ⎟ ⎜ b1 ⎟ ⎜ c1 ⎟ ⎟⎜ ⎟ =⎜ ⎟ ⎟⎜ ⎟ ⎜ ⎟ a2 m − 2⎟⎠ ⎜⎝ bm − 1⎟⎠ ⎜⎝ cm − 1⎟⎠

(9.8)

If h and β are taken as in Eq. (9.5), aj and cj ( j = 0, 1, . . . , m – 1) in Eq. (9.8) are the dual basis coefficients of A and C, respectively. Therefore, if the terms aj (with j = m, m + 1, . . . , 2m – 2) can be generated, then Eq. (9.8) represents a dual basis multiplication algorithm. If f(x) is the generating irreducible polynomial for GF(2m), then m−1 ⎛ ⎛ m−1 ⎞⎞ m − 1 am = h( Aβα m ) = h ⎜ A ⎜β ∑ fiα i⎟⎟ = ∑ fi h Aβα i = ∑ fi ai ⎜⎝ ⎝ i = 0 ⎠⎟⎠ i = 0 i=0

(

)

(9.9)

where fi’s are the coefficients of f(x). In general, it can be proven that [FBT96a] am + k =

m−1

∑ fi ai + k ,

k = 0, 1, . . ., m − 2

(9.10)

i=0

where ai’s, with i = 0, 1, . . . , m – 1, are the dual basis coefficients of A. Therefore, the above equations compute the product C of the field elements A and B, where A and C are represented in dual basis, and B is represented in polynomial basis. Using the above equations, the following algorithm implements the dual-basis multiplication.

Algorithm 9.1—Dual basis multiplication for k in 0 .. m-2 loop for i in 0 .. m-1 loop A(m + k) := m2xor(A(m + k),m2and(F(i),A(i + k))); end loop; end loop; for j in 0 .. m-1 loop for i in 0 .. m-1 loop C(j) := m2xor(C(j),m2and(A(j + i),B(i))); end loop; end loop;

In Algorithm 9.1, A has been defined as a bit vector from 0 to 2m – 2. An executable Ada file dual_mult.adb, including Algorithm 9.1, is available at www.arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s It must be noted that, from the above equations, the irreducible polynomial f(x) selected for the field GF(2m) influences the multiplication complexity. Furthermore, field elements are represented only in a given basis (normally the polynomial basis). Therefore, convenient dual basis must be selected in such a way that the complexity of the dual to polynomial (and polynomial to dual) basis conversion could be reduced. Morii et al. [MKW89] demonstrated that when the irreducible polynomial for GF(2m) is a trinomial of the form f(x) = xm + xk + 1 (m > k) or a pentanomial of the form f(x) = xm + xk + 2 + xk + 1 + xk + 1 (m > k + 2), then optimal dual bases can be found [FBT96a]. When the defining irreducible polynomial is a trinomial of the form f(x) = xm + xk + 1 (m > k) and if β is selected in Eq. (9.5) so that [FBT96a] i= k−1 ⎧1, h(βα i ) = ⎨ i m 0 , = 0 , 1 , . . . , − 1 (with i ≠ k − 1) ⎩

(9.11)

then the optimal dual basis to the polynomial basis is ([MKW89], [Fen93]) {α k − 1 , α k − 2 , . . ., α , 1, α m − 1 , α m − 2 , . . ., α k }

(9.12)

Therefore, the dual basis is merely a permutation of the polynomial basis and the basis conversion can be implemented simply by wiring. On the other hand, if the defining irreducible polynomial for GF(2m) is a pentanomial of the form f(x) = x m + x k + 2 + x k + 1 + x k + 1 (m > k + 2) and if β is selected in Eq. (9.5) in such a way that [FBT96a] ⎧1, i = 0, k h(βα i ) = ⎨ 0 , = 1 , . . . , − 1 ( with i ≠ k ) i m ⎩

(9.13)

then the optimal dual basis to the polynomial basis is ([MKW89], [Fen93]) {α k , α k − 1 , α k − 2 , . . ., α , 1 + α k , α k + 1 + α m − 1 , α m − 2 , α m − 3 , . . ., α k + 2 , α k + 1 } (9.14) Therefore, the dual basis can be obtained from the polynomial basis with two mod 2 additions and a reordering of basis coefficients. Optimal dual basis for m = 2, 3, . . . , 10, using irreducible trinomials and pentanomials, were given in [FBT96a].

273

274

Chapter Nine For example, let f(x) = x8 + x4 + x3 + x2 + 1 be the defining irreducible polynomial for GF(28) and let β = α253. Then the optimal dual basis is {α 2 , α , 1 + α 2 , α 3 + α 7 , α 6 , α 5 , α 4 , α 3 } in accord with Eqs. (9.13) and (9.14) as given in [FBT96a]. In such a case, if pi and di represent the coordinates of an element in the polynomial and dual bases, then the conversion from dual to polynomial basis can be performed as p0 = d2, p1 = d1, p2 = d0 + d2, p3 = d3 + d7, p4 = d6, p5 = d5, p6 = d4, p7 = d3. Conversely, the conversion from polynomial to dual basis can be performed as d0 = p0 + p2, d1 = p1, d2 = p0, d3 = p7, d4 = p6, d5 = p5, d6 = p4, d7 = d3 + d7. In the following algorithm, the dual basis multiplication over GF(28) using f(x) = x8 + x4 + x3 + x2 + 1 with β = α253 is given. In Algorithm 9.2, conversions from dual to polynomial (and polynomial to dual) basis are performed in such a way that the product C and the input elements A and B are represented in the polynomial basis.

Algorithm 9.2—Dual basis multiplication for GF(28) with f(x) = x8 + x4 + x3 + x2 + 1 Ad(0) := m2xor(A(0),A(2)); Ad(1) := A(1); Ad(2) := A(0); Ad(3) := A(7); Ad(4) := A(6); Ad(5) := A(5); Ad(6) := A(4); Ad(7) := m2xor(A(3),A(7)); for i in m .. 2*m-2 loop Ad(i) := 0; end loop; for i in 0 .. m-1 loop Cd(i) := 0; C(i) := 0; end loop; for k in 0 .. m-2 loop for i in 0 .. m-1 loop Ad(m + k) := m2xor(Ad(m + k),m2and(F(i),Ad(i + k))); end loop; end loop; for j in 0 .. m-1 loop for i in 0 .. m-1 loop Cd(j) := m2xor(Cd(j),m2and(Ad(j + i),B(i))); end loop; end loop; C(0) := Cd(2); C(1) := Cd(1); C(2) := m2xor(Cd(0),Cd(2)); C(3) := m2xor(Cd(3),Cd(7)); C(4) := Cd(6); C(5) := Cd(5); C(6) := Cd(4); C(7) := Cd(3);

In Algorithm 9.2, Ad and Cd represent the elements A and C, expressed in the dual basis, where Ad has been defined as a bit vector from 0 to 2m – 2. An executable Ada file dual_mult_conv_8.adb, including Algorithm 9.2, is available at www.arithmetic-circuits.org. Squaring over the dual basis can also be computed using the above formulation as follows [FBT96b]. Let X, Y ∈ GF(2m) so that Y = X2 and let {μ 0 , μ 1 , . . ., μ m− 1 } be a basis for GF(2m). Moreover, let α be a root of the defining irreducible polynomial f(x) for the field, let β ∈ GF(2m); and let h be a linear function from GF(2m) to GF(2). If X −1 is represented over this basis by X = ∑ mi=−01 xi μ i , then X 2 = ∑ m xi μ i 2 i=0 and [FBT96b]

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s h(βY α j ) = h(βX 2α j ) ⎛ ⎛m − 1 ⎞ ⎞ m−1 = h ⎜β ⎜ ∑ x i μ i 2⎟ α j⎟ = ∑ xi h(βμ i 2α j ) ⎜⎝ ⎝ i = 0 ⎠ ⎟⎠ i = 0

j = 0, 1, . . ., m − 1

(9.15)

If now the basis {μ 0 , μ 1 , . . ., μ m − 1 } is taken to be the dual basis to the polynomial basis with respect to h and β, then yi = h(β Y α i ) , where Y = ∑mi =−0 1 yi μ i . Therefore the dual basis representation of Y is given as follows [FBT96b]:

( ) ( )

⎛ h(βμ 2 ) h βμ 12 0 ⎛ h(βY ) ⎞ ⎜ ⎜ h(βY α) ⎟ ⎜ h(βμ 20α) h βμ 12α = ⎟ ⎜ ⎜ ⎟ ⎜ ⎝ h(βY α m − 1 )⎠ ⎜⎜ h(βμ 2α m−1 ) h βμ 2α m−1 ⎝ 0 1

(

h(βμ 2m − 1 ) ⎞ ⎛ x ⎞ 0 ⎟ h(βμ 2m − 1α) ⎟ ⎜ x1 ⎟ ⎜ ⎟ ⎟⎜ ⎟ ⎟ ⎜ ⎟ h(βμ 2m − 1α m − 1 )⎟⎠ ⎝ xm − 1⎠

)

(9.16) For example, let f(x) = x4 + x + 1 be the defining irreducible polynomial for GF(24) and let α be a root of f(x). If h is the least significant polynomial coefficient and β = 1, then the optimal dual basis to the polynomial basis {1, α , α 2 , α 3 } is {1, α 3 , α 2 , α} accordingly with Eqs. (9.11) and (9.12) [FBT96a]. Therefore, using Eq. (9.16), the coordinates yi of the square of X, Y = X2 can be given as follows [FBT96b] ⎛ y 0⎞ ⎛ 1 0 ⎜ y ⎟ ⎜0 1 ⎜ 1⎟ = ⎜ ⎜ y 2⎟ ⎜ 0 1 ⎜⎝ y ⎟⎠ ⎝ 0 0 3

1 0 0 1

0⎞ ⎛ x0⎞ 0⎟ ⎜ x1⎟ ⎜ ⎟ 1⎟⎟ ⎜ x2⎟ 0⎠ ⎜⎝ x3⎟⎠

(9.17)

where X and Y are represented in the dual basis. Using Fermat’s theorem, the inverse of an element over the dual basis in GF(2m) can be found by successive squaring and multiplicam m tion. From Fermat’s theorem, that is, X 2 − 1 = 1, X 2 = X holds. Therem fore, the inversion can be carried out by computing X −1 = X 2 − 2 , for −1 X ≠ 0 ∈ GF(2m). Since 2m – 2 = 2 + 22 + 23 + . . . + 2m − 1, X can be expressed as [FBT96b] X −1 = X 2

m

−2

= (X 2 )(X 2 )(X 2 ) ⋅ ⋅ (X 2 1

2

3

m−1

)

(9.18)

Assume that the function function vector;

dual_sq_GF4(Ad:

poly_vector)

return

poly_

275

276

Chapter Nine how to perform the squaring over the dual basis in GF(24) as given in Eq. (9.17) is available. Assume also that the function function dual_mult(Ad,B,F: poly_vector) return poly_ vector;

how to perform the dual basis multiplication as given in Algorithm 9.1 is also available, where the input operand B is represented in the polynomial basis, the input operand Ad and the product are represented in the dual basis, and F is the defining irreducible polynomial for GF(2m). Then the following algorithm implements the inversion INV = A − 1 given in Eq. (9.18) over the dual basis {1, α 3 , α 2 , α } for the field GF(24) generated by f(x) = x4 + x + 1.

Algorithm 9.3—Inversion in dual basis for GF (24) with f (x) = x4 + x + 1 ad(0) := a(0); ad(1) := a(3); ad(2) := a(2); ad(3) := a(1); bd := dual_sq_GF4(ad); c(0) := 1; k := 0; while k < m-1 loop dd := dual_mult(bd,c,F); k := k + 1; if k = m-1 then invd := dd; end if; if k < m-1 then bd := dual_sq_GF4(bd); cd := dd; c(0) := cd(0); c(1) := cd(3); c(2) := cd(2); c(3) := cd(1); end if; end loop; inv(0) := invd(0); inv(1) := invd(3); inv(2) := invd(2); inv(3) := invd(1);

In Algorithm 9.3, the input operand A is given in the polynomial basis. Therefore, a polynomial to dual basis conversion first is performed in order to obtain Ad. Furthermore, the function dual_ mult performs the dual basis multiplication with one of its operands represented in the polynomial basis. Therefore, a dual to polynomial basis conversion (from Cd to C) must also be performed. Finally, the result is given in the dual basis (INVd) and a conversion from dual to polynomial basis is performed in order to obtain INV represented in the polynomial basis. It must be noted that Algorithm 9.3 is the dual basis version of the inversion algorithm in normal basis given in Algorithm 8.9. An executable Ada file dual_inversion.adb including Algorithm 9.3 is available at www. arithmetic-circuits.org.

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s

9.2 Triangular Bases m

Let f (x) = ∑ i = 0 fi x i be an irreducible binary polynomial of degree m, with fm = f0 = 1 and fi ∈ {0,1} for 0 < i < m, and let α ∈ GF(2m) be a root of f(x). The set Λ = {λ 0 , λ 1 , . . ., λ m − 1 } of m elements is called the triangular basis of the polynomial basis Ω = {1, α , α 2 , . . ., α m − 1 } if λi =

m−1

∑ f j + 1α j − i

0≤i≤m−1

(9.19)

j=i

where fi’s is the coefficient of the irreducible polynomial f(x) ([HB95], [Has98], [FBF97], [HW00]). For any element A ∈ GF(2m), its coordinates with respect to the polynomial basis Ω and the triangular basis Λ can be denoted as AΩ = (a0 , a1 , . . ., am − 1 ) and AΛ = (a 0 , a1 , . . ., a m − 1 ) . Since A = ∑mi =−0 1 aiα i = ∑mi =−0 1 ai λi , the conversion of the Λ coordinates to the corresponding Ω coordinates can be performed as follows [HW00] ⎧ a 0 , i=0 ⎪ am − 1 − i = ⎨ i ai − j fm − j , 1 ≤ i ≤ m − 1 ⎪∑ ⎩j = 0

(9.20)

while the conversion of the Ω to Λ coordinates can be performed as [HW00] am − 1 , i=0 ⎧ ⎪ i−1 ai = ⎨ ⎪am − 1 − i + ∑ ai − 1− j fm − 1 − j , 1 ≤ i ≤ m − 1 j=0 ⎩

(9.21)

Equation (9.21) can also be expressed as AΛT = T AΩT , where T denotes vector transposition and where the transformation matrix T is given as [Has98] ⎛0 0 ⎜0 0 ⎜0 0 T = ⎜ ⎜ ⎜0 1 ⎜ ⎝ 1 t1

0 0 0 t1 t2

0 0 1 tm − 4 tm − 3

0 1 t1 tm − 3 tm − 2

⎞ ⎟ ⎟ ⎟ ⎟ tm − 2 ⎟ ⎟ tm − 1 ⎠ 1 t1 t2

with t j = ∑ ij=−01 fm− j+iti , 0 ≤ i ≤ m − 1, and t0 = 1 [Has98].

(9.22)

277

278

Chapter Nine Triangular bases have been used by several authors for the implementation of GF(2m) arithmetic operations ([Has95], [Has98], [FBF97], [HB95], [IHT06], [IST06]). For example, an efficient algorithm for the computation of the GF(2m) inversion over the polynomial basis was given in [HW00]. This algorithm uses the above given expressions and it is based on solving simultaneous linear equations over GF(2) ([HB92], [Has95]). The inversion algorithm is given as follows. In order to compute the inverse B of an element A ∈ GF(2m), with A and B represented in the polynomial basis, assume that hi terms in GF(2) are defined as follows [HW00]: ⎧ am − 1 , i=0 ⎪ i−1 ⎪ hi = ⎨am − 1 − i + ∑ j = 0 hi − 1 − j fm − 1 − j , 1 ≤ i ≤ m − 1 ⎪ m−1 m ≤ i ≤ 2m − 1 ⎪ ∑ j = 0 hi − 1− j fm − 1− j , ⎩

(9.23)

Also assume that α ∈ GF(2m) is a root of the irreducible polynomial m f (x) = ∑ i=0 fi x i , B is the inverse of A, and gi is the ith coordinate of the element G = B + αm with respect to the polynomial basis. Then the following equation holds ([Has95], [HB92], [HW00]): ⎛ s0 ⎜ s ⎜ 1 ⎜ ⎜ sm − 2 ⎜ ⎝ sm − 1

s1 s2 sm − 1 sm

sm − 1 ⎞ ⎛ g 0 ⎞ ⎛ sm ⎞ sm ⎟ ⎜ g1 ⎟ ⎜ sm + 1 ⎟ ⎟ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ = ⎜ ⎟ s2 m − 3 ⎟ ⎜ g m − 2 ⎟ ⎜ s2 m − 2 ⎟ ⎜ ⎟ ⎟ s2 m − 2 ⎠ ⎜⎝ g m − 1 ⎠⎟ ⎝ s2 m − 1 ⎠

(9.24)

where si terms in GF(2) are defined as ⎧ h, 0 ≤ i ≤ 2m − 2 si = ⎨ i + 1 , h i = 2m − 1 ⎩ i

(9.25)

Using the above equations, the following algorithm for the computation of A − 1 = B = G + αm was given in [HW00]:

Algorithm 9.4—Algorithm for inversion over GF(2m) Input: Output:

A ∈ GF(2m) in polynomial basis. B = A -1 in polynomial basis.

1. P(x) = ∑i = 0 pixi = 1 m

2. Q(x) = ∑i = 0 q ixi = 1, m

L = 0,

s0 = a m − 1

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s 3. for i = 1 to 2m do 4.

d = ∑jL = 0 pjsi − 1 − j

5.

e =

6.

d ⎞ ⎛ P(x) ⎞ ⎛P(x)⎞ ⎛1 ⎜⎝Q(x)⎟⎠ = ⎜⎝e 1 − e⎟⎠ ⎜⎝xQ(x)⎟⎠

7.

L = e(i − L) + (1 − e)L

8.

i−1 ⎧a m − 1 − i + ∑j = 0 si − 1 − jfm − 1 − j, 1 ≤ i ≤ m − 1 m − 1 ⎪ ∑j = 0 si − 1 − jfm − 1 − j, m ≤ i ≤ 2m − 2 ⎪ si = ⎨ m−1 1 + ∑ s f , i = 2m − 1 j= 0 i − 1− j m − 1− j ⎪ ⎪⎩ don't care i = 2m

{

d, if i − 1 − 2L ≥ 0 0, otherwise

9. end for 10. G = (p1, p2,..., p m − 1, p m) 11. B = (p1 + fm − 1, p2 + fm − 2,..., p m − 1 + f1, p m + f0)

Assume that the function bit2int converting a bit into its corresponding integer is available. Assume also that the functions m2xvvm and m2abvm performing the bit-wise XOR of two bit vectors x and y with m + 1 bits (x0 XOR y0, x1 XOR y1, . . . , xm XOR ym), and returning the multiplication of a bit x by a bit vector y with m + 1 bits (x AND y0, x AND y1, . . . , x AND ym), respectively, are also available. Furthermore, assume that the product xQ(x) given in step 6 in Algorithm 9.4 is simply implemented using the function rshiftm performing 1-bit right shift of a bit vector with m + 1 bits. Then, Algorithm 9.4 can be implemented as follows:

Algorithm 9.5—Inversion over GF(2m) -- Computation of h and s h(0) := a(m-1); for i in 1 .. m-1 loop for j in 0 .. i-1 loop h(i) := m2xor(h(i),m2and(h(i-1-j),f(m-1-j))); end loop; h(i) := m2xor(a(m-1-i),h(i)); end loop; for i in m .. 2*m-1 loop for j in 0 .. m-1 loop h(i) := m2xor(h(i),m2and(h(i-1-j),f(m-1-j))); end loop;

279

280

Chapter Nine end loop; for i in 0 .. 2*m-2 loop s(i) := h(i); end loop; s(2*m-1) := m2xor(h(2*m-1),1); -- Algorithm p(0) := 1; q(0) := 1; L := 0; s(0) := a(m-1); for i in 1 .. 2*m loop d := 0; for j in 0 .. L loop d := m2xor(d,m2and(p(j),s(i-1-j))); end loop; if (i-1-2*L) >= 0 then e := d; else e := 0; end if; Paux := P; Qaux := Q; P := m2xvvm(Paux,m2abvm(d,rshiftm(Qaux))); Q := m2xvvm(m2abvm(e,Paux),m2abvm(m2xor(1,e),rshiftm (Qaux))); L := bit2int(e)*(i - L) + bit2int(m2xor(1,e))*L; if i in 1..m-1 then s(i) := 0; for j in 0 .. i-1 loop s(i) := m2xor(s(i),m2and(s(i-1-j),f(m-1-j))); end loop; s(i) := m2xor(a(m-1-i),s(i)); elsif i in m..2*m-2 then s(i) := 0; for j in 0 .. m-1 loop s(i) := m2xor(s(i),m2and(s(i-1-j),f(m-1-j))); end loop; elsif i = 2*m-1 then s(i) := 0; for j in 0 .. m-1 loop s(i) := m2xor(s(i),m2and(s(i-1-j),f(m-1-j))); end loop; s(i) := m2xor(1,s(i)); end if; end loop; for i in 0 .. m-1 loop g(i) := p(m-i); end loop; for i in 0 .. m-1 loop b(i) := m2xor(g(i),f(i)); end loop;

where the first instructions in Algorithm 9.5 compute the hi and si terms given in Eqs. (9.23) and (9.25), respectively. An executable Ada file poly_inv_triangular_inv.adb, including Algorithm 9.5 is available at www.arithmetic-circuits.org. Algorithm 9.4 can be used for performing the inversion of an element A ∈ GF(2m), A−1 = B, with A represented in the triangular basis and B represented in the polynomial basis as follows [HW00]. The first m elements of the sequence of si terms in Algorithm 9.4 are essentially the coordinates of A in the triangular basis: s0 = am − 1 in step 2. Assignment of si for 1 ≤ i ≤ m − 1 in step 8 of Algorithm 9.4 correspond to Eq. (9.21). Therefore, si = ai , 0 ≤ i ≤ m − 1 in Algorithm 9.4. If A is represented in triangular basis with coefficients ai , 0 ≤ i ≤ m − 1, then

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s B = A−1, with A represented in the triangular basis, can be performed by simply assigning si = ai , 0 ≤ i ≤ m − 1 in Algorithm 9.4. In any case, B is represented in the polynomial basis. Hence, the following algorithm implements the inversion in triangular basis:

Algorithm 9.6—Inversion over triangular basis for GF(2m) -- Computation of h and s h(0) := a(m-1); for i in 1 .. m-1 loop for j in 0 .. i-1 loop h(i) := m2xor(h(i),m2and(h(i-1-j),f(m-1-j))); end loop; h(i) := m2xor(a(m-1-i),h(i)); end loop; for i in m .. 2*m-1 loop for j in 0 .. m-1 loop h(i) := m2xor(h(i),m2and(h(i-1-j),f(m-1-j))); end loop; end loop; for i in 0 .. 2*m-2 loop s(i) := h(i); end loop; s(2*m-1) := m2xor(h(2*m-1),1); -- Algorithm p(0) := 1; q(0) := 1; L := 0; s(0) := a(0); for i in 1 .. 2*m loop d := 0; for j in 0 .. L loop d := m2xor(d,m2and(p(j),s(i-1-j))); end loop; if (i-1-2*L) >= 0 then e := d; else e := 0; end if; Paux := P; Qaux := Q; P := m2xvvm(Paux,m2abvm(d,rshiftm(Qaux))); Q := m2xvvm(m2abvm(e,Paux),m2abvm(m2xor(1,e),rshiftm (Qaux))); L := bit2int(e)*(i - L) + bit2int(m2xor(1,e))*L; if i in 1..m-1 then s(i) := a(i); elsif i in m..2*m-2 then s(i) := 0; for j in 0 .. m-1 loop s(i) := m2xor(s(i),m2and(s(i-1-j),f(m-1-j))); end loop; elsif i = 2*m-1 then s(i) := 0; for j in 0 .. m-1 loop s(i) := m2xor(s(i),m2and(s(i-1-j),f(m-1-j))); end loop; s(i) := m2xor(1,s(i)); end if; end loop; for i in 0 .. m-1 loop g(i) := p(m-i); end loop; for i in 0 .. m-1 loop b(i) := m2xor(g(i),f(i)); end loop;

281

282

Chapter Nine In Algorithm 9.6, A is in triangular basis and B = A−1 is in polynomial basis. Assume that the conversion from polynomial to triangular basis is implemented using Eq. (9.21) as follows: for i in 0 .. m-1 loop a(i) := 0; end loop; a(0) := apol(m-1); for i in 1 .. m-1 loop for j in 0 .. i-1 loop a(i) := m2xor(a(i),m2and(a(i-1-j),f(m-1-j))); end loop; a(i) := m2xor(apol(m-1-i),a(i)); end loop;

where Apol and A represent the element in the polynomial and triangular basis, respectively. Then an executable Ada file triangular_ inv.adb, including an algorithm implementing Algorithm 9.6 and the above basis conversion (hence with the input A given in polynomial basis), is available at www.arithmetic-circuits.org. Multiplication in triangular basis can be performed using the above equations. Let C be the product of A and B, C = AB, with A, B, C ∈ GF(2m) and where A and C are represented in the triangular basis. Applying Eqs. (9.23) and (9.25) to an algorithm for multiplication given in [HB95], the following expression was given in [HW00]: ⎛ s0 ⎜ s ⎜ 1 ⎜ ⎜ sm − 2 ⎜ ⎝ sm − 1

s1 s2 sm − 1 sm

sm − 1 ⎞ ⎛ b0 ⎞ ⎛ c0 ⎞ sm ⎟ ⎜ b1 ⎟ ⎜ c1 ⎟ ⎟⎜ ⎟ ⎟ ⎜ ⎟⎜ ⎟ = ⎜ ⎟ s2 m − 3 ⎟ ⎜ bm − 2 ⎟ ⎜ cm − 2 ⎟ ⎟ s2 m − 2 ⎠ ⎜⎝ bm − 1 ⎠⎟ ⎜⎝ cm − 1 ⎟⎠

(9.26)

which leads to the following algorithm for multiplication [HW00]:

Algorithm 9.7—Algorithm for multiplication over triangular basis for GF(2m) Input: A ∈ GF(2m) in triangular basis, B ∈ GF(2m) in polynomial basis Output: C = AB in triangular basis

j, 1. sj = a j = 0, 2. c

0 ≤ j ≤ m −1 0 ≤ j ≤ m −1

3. for i = 0 to m 4.

-

1 do

if bi ≠ 0 then

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s 5.

c j = c j + si + j,

6.

end if

0 ≤ j ≤ m −1

si + m = ∑jm =−01 si + jfj 7. 8. end for Assume that the conversion from polynomial to triangular basis is implemented using Eq. (9.21), and the conversion from triangular to polynomial basis is implemented using Eq. (9.20) as follows c(m-1) := ctr(0); for i in 1 .. m-1 loop for j in 0 .. i loop c(m-1-i) := m2xor(c(m-1-i),m2and(ctr(i-j),f(m-j))); end loop; end loop;

where Ctr and C represent the element in the triangular and polynomial basis, respectively. Then the following algorithm implements Algorithm 9.7, where A, B, and C are represented in the polynomial basis and where bases conversions are performed where required.

Algorithm 9.8—Multiplication over triangular basis for GF(2m) -- Basis conversion: Polynomial (A) to Triangular (Atr) atr(0) := a(m-1); for i in 1 .. m-1 loop for j in 0 .. i-1 loop atr(i) := m2xor(atr(i),m2and(atr(i-1-j),f(m-1-j))); end loop; atr(i) := m2xor(a(m-1-i),atr(i)); end loop; ----------------------------------for i in 0 .. m-1 loop s(i) := atr(i); c(i) := 0; ctr(i) := 0; end loop; for i in 0 .. m-1 loop if b(i) /= 0 then for j in 0 .. m-1 loop ctr(j) := m2xor(ctr(j),s(i + j)); end loop; end if; for j in 0 .. m-1 loop s(i + m) := m2xor(s(i + m),m2and(s(i + j),f(j))); end loop; end loop; -- Basis conversion: Triangular (Ctr) to Polynomial (C) c(m-1) := ctr(0); for i in 1 .. m-1 loop

283

284

Chapter Nine for j in 0 .. i loop c(m-1-i) := m2xor(c(m-1-i),m2and(ctr(i-j),f(m-j))); end loop; end loop;

An executable Ada file triangular_mult.adb, including Algorithm 9.8, is available at www.arithmetic-circuits.org.

9.3

References [Ber82] E. R. Berlekamp. “Bit-Serial Reed-Solomon Encoders.” IEEE Transactions on Computers, vol. 82, pp. 869–874, November 1982. [Fen93] S. T. J. Fenn. “Optimised Algorithms and Circuit Architectures for Performing Finite Field Arithmetic in Reed-Solomon Codecs.” PhD Thesis, University of Huddersfield, 1993. [FBF97] R. Furness, M. Benaissa, and S. T. J. Fenn. “Generalized triangular basis multipliers for the design of Reed-Solomon codecs.” IEEE Workshop on Signal Processing Systems—SIPS 97, pp. 202–211, November 1997. [FBT96a] S. T. J. Fenn, M. Benaissa, and D. Taylor. “GF(2n) Multiplication and Division Over the Dual Basis.” IEEE Transactions on Computers, vol. 45, no. 3, pp. 319–327, March 1996. [FBT96b] S. T. J. Fenn, M. Benaissa, and D. Taylor. “Finite Field Inversion over the Dual Basis.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 4, no. 1, pp. 134–137, March 1996. [Has95] M. A. Hasan. “Shift Register Synthesis for Multiplicative Inversion over GF(2m).” Proceedings of International Symposium of Information Theory, p. 49, 1995. [Has98] M. A. Hasan. “Double-Basis Multiplicative Inversion over GF(2m).” IEEE Transactions on Computers, vol. 47, no. 9, pp. 960–970, September 1998. [HB92] M. A. Hasan and V. K. Bhargava. “Bit-Serial Systolic Divider and Multiplier for Finite Fields GF(2 m).” IEEE Transactions on Computers, vol. 41, no. 8, pp. 972–980, August 1992. [HB95] M. A. Hasan and V. K. Bhargava. “Architecture for a low complexity rateadaptive Reed-Solomon encoder.” IEEE Transactions on Computers, vol. 44, no. 6, pp. 938–942, June 1995. [HTDR88] I. S. Hsu, T. K. Truong, L. J. Deutsch, and I. S. Reed. “A Comparison of VLSI Architecture of Finite Field Multipliers Using Dual, Normal, or Standard Bases.” IEEE Transactions on Computers, vol. 37, no. 6, pp. 735–739, June 1988. [HW00] M. A. Hasan and A. G. Wassal. “VLSI Algorithms, Architectures, and Implementation of a Versatile GF(2m) Processor.” IEEE Transactions on Computers, vol. 49, no. 10, pp. 1064–1073, October 2000. [IHT06] J. L. Imaña, R. Hermida, and F. Tirado. “Low Complexity Bit-Parallel Multipliers Based on a Class of Irreducible Pentanomials.” IEEE Transactions on VLSI Systems, vol. 14, no. 12, pp. 1388–1393, December 2006. [IST06] J. L. Imaña, J. M. Sánchez, and F. Tirado. “Bit-Parallel Finite Field Multipliers for Irreducible Trinomials.” IEEE Transactions on Computers, vol. 55, no. 5, pp. 520–533, May 2006. [KL99] B. S. Kaliski and M. Liskov. “Efficient Finite Field Basis Conversion Involving Dual Basis.” Proceedings of Cryptographic Hardware and Embedded Systems (CHES 1999), LNCS 1717, pp. 135–143, 1999. [LN83] R. Lidl and H. Niederreiter. Finite Fields. Addison-Wesley, Reading, MA, 1983. [MKW89] M. Morii, M. Kasahara, and D. L. Whiting. “Efficient Bit-Serial Multiplication and the Discrete-Time Wiener-Hopft Equation over Finite

O p e r a t i o n s o v e r G F ( 2 m) — O t h e r B a s e s Fields.” IEEE Transactions on Information Theory, vol. 35, pp. 1177–1183, November 1989. [WH01] H. Wu and M. A. Hasan. “Efficient Exponentiation Using Weakly Dual Basis.” IEEE Transactions on Computers, vol. 9, no. 6, pp. 874–879, December 2001. [WH98] H. Wu and M. A. Hasan. “Low-Complexity Bit-Parallel Multipliers for a Class of Finite Fields.” IEEE Transactions on Computers, vol. 47, no. 8, pp. 883–887, August 1998. [WHB98] H. Wu, M. A. Hasan, and I. F. Blake. “New Low-Complexity Bit-Parallel Finite Field Multipliers Using Weakly Dual Basis.” IEEE Transactions on Computers, vol. 47, no. 11, pp. 1223–1234, November 1998.

285

This page intentionally left blank

CHAPTER

10

An Example of Application—Elliptic Curve Cryptography

T

his final chapter gives an example of finite-field application, namely, the implementation of the scalar product (point multiplication) over an elliptic curve. It is the basic computation primitive of elliptic curve cryptography. The definition of the corresponding operations depends on the particular field, but they always amount to combinations of arithmetic operations (add, subtract, multiply, square, and divide) over the chosen field, so that their hardware implementation can be carried on using the VHDL models defined in the preceding chapters. The definition of the elliptic curve operations follows the presentation of [HMV04].

10.1

Public-Key Cryptography Public-key cryptography is a ciphering/deciphering method using distinct keys for ciphering (public key) and deciphering (private key). Among the most used are the RSA and the discrete logarithm systems. In the first case [ARS78], two primes p and q are chosen. The public key is a pair (n, e) of naturals where n = pq, e belongs to the interval 0 < e < (p − 1)(q − 1) and e is relatively prime with (p − 1)(q − 1). The private key is d = e −1 mod (p − 1)(q − 1). It can be shown that xed ≡ x mod n, for any natural x. The encryption/decryption algorithm follows: Giving a message mes represented in the form of a natural belonging to the interval 0 < mes < n, compute the ciphered text c = mese mod n. In order to decrypt c, compute cd mod n. Observe that knowing the public key (n,e), the computation of the private key amounts to decomposing n under the form n = pq and then calculating d = e − 1 mod (p − 1)(q − 1). Nowadays, the factorization problem is intractable for key sizes greater than 1024 bits. In the second case (discrete logarithm), a finite group (G,*,1) is defined and some element g of G is chosen. Let n be the order of g.

287

288

Chapter Ten Thus, the set {1, g, g2, . . . , gn − 1} is a cyclic subgroup of G. The private key is a natural x belonging to the interval 0 < x < n, and the public key is the element y of the cyclic subgroup defined by y = gx. The message mes must be represented under the form of an element of G. The encryption algorithm is the following: Randomly choose a natural k belonging to 0 < k < n, compute c1 = gk and c2 = mes · yk. The ciphered text is made up of c1 and c2. In order to decrypt the message, compute c2 · (c1x) − 1. Observe that knowing the public key y, the computation of the private key x amounts to calculating loggy, presumably a very hard problem. In the basic version of the discrete logarithm scheme [ElG85] G is the set of natural {1, 2, . . . , p − 1}, where p is a prime, so that all operations are performed modulo p. Nevertheless, other groups can be used. Consider for example, an elliptic curve E over the binary extension field GF(2m) defined as being the set of elements (x,y) of GF(2m) × GF(2m) so that y2 + xy = x3 + ax + b, where a and b are elements of GF(2m). It can be demonstrated that the set of points of E, plus the so-called point at infinity ∞, is a group (E, + , ∞) whose basic operation (under additive notation with neutral element ∞) will be defined in Sec. 10.2. A discrete logarithm scheme can then be defined by choosing a point P of E whose order is equal to n so that the set {∞, P, 2P, . . . , (n − 1)P} is a cyclic subgroup of E. The private key is a natural d belonging to the interval 0 < d < n, and the public key is the element Q of the cyclic subgroup defined by Q = dP. A simple encryption/decryption algorithm would be the following: Given a message mes represented by a point M of E, randomly choose a natural number k belonging to 0 < k < n, and compute C1 = kP and C2 = M + kQ. The ciphered text is made up of C1 and C2. In order to decrypt the message, compute C2 − dC1. Actually, other encryption/decryption schemes are used, avoiding among others the embedding of mes within E. Nevertheless, the operations to be performed are similar. Observe that knowing the public key Q, the computation of the private key amounts to looking for a natural d such as dP = Q, presumably a very hard problem. Nowadays, this problem is intractable for key sizes greater than 160 bits.

10.2

Elliptic Curve over a Finite Field Given a finite field K, an elliptic curve E over K is defined by a Weierstrass equation ([HMV04], [BSS99]) y2 + a1xy + a3y = x3 + a2x2 + a4x + a6

(10.1)

where a1, a2, a3, a4, and a6 belong to K and satisfy some additional conditions [HMV04, Chap. 3]. Given an extension field L of K, the corresponding elliptic curve E(L) is defined by the following relation: E(L) = {(x,y) ∈ L x L: y2 + a1xy + a3y = x3 + a2x2 + a4x + a6} ∪ {∞} ∞ being an additional point called point at infinity.

(10.2)

An Example of Application—Elliptic Curve Cryptography After some changes of variables, the elliptic curves can be classified into five classes: 1. If the characteristic of K is not equal to 2 or 3 then the simplified Weierstrass equation is y2 = x3 + ax + b

(10.3)

where a and b belong to K, and 4a3 + 27b2 ≠ 0. 2. If the characteristic of K is equal to 2, then a first simplified Weierstrass equation is y2 + xy = x3 + ax2 + b

(10.4)

where a and b belong to K, and b ≠ 0. Such a curve is said to be nonsupersingular. 3. If the characteristic of K is equal to 2, another simplified Weierstrass equation is y2 + cy = x3 + ax + b

(10.5)

where a, b, and c belong to K, and c ≠ 0. Such a curve is said to be supersingular. 4. If the characteristic of K is equal to 3, then a first simplified Weierstrass equation is y2 = x3 + ax2 + b

(10.6)

where a and b belong to K, a ≠ 0 and b ≠ 0. Such a curve is said to be nonsupersingular. 5. If the characteristic of K is equal to 3, another simplified Weierstrass equation is y2 = x3 + ax + b

(10.7)

where a and b belong to K, and a ≠ 0. Such a curve is said to be supersingular. It has been demonstrated (Hasse theorem, [HMV04]) that the number of points of E(L) belongs to the following interval: q + 1 − 2q1/2 ≤ #E(L) ≤ q + 1 + 2q1/2

(10.8)

where q is the number of elements of L. Thus, for great values of q, the number of points is approximately equal to the number of field elements: #E(L) ≅ q

(10.9)

For practical applications the curves to be considered belong to types 1, 2, or 3, that is, L = GF(pm) with p > 3 or L = GF(2m).

289

290 10.3

Chapter Ten

Group Law Given an elliptic curve E(L), an addition operation can be defined so that E(L) is an abelian group ([HMV04]). The addition definition depends on the type of curve. In all cases the point at infinity ∞ is the neutral element (or identity). Let P and Q be elements of E(L). 1. If L = GF(pm) where p > 3 (equation y2 = x3 + ax + b) then P+∞=∞+P=P

(10.10)

(x, y) + (x, − y) = ∞

(10.11)

if P = (x1, y1), Q = (x2, y2), P ≠ Q and P ≠ − Q, then P + Q = (x3, y3) where x3 = [(y2 − y1)/(x2 − x1)]2 − x1 − x2, y3 = [(y2 − y1)/(x2 − x1)](x1 − x3) − y1 (10.12) if P = (x1, y1) and P ≠ − P, that is, y1 ≠ 0, then P + P = (x3, y3) where x3 = [(3x12 + a)/2y1]2 − 2x1, y3 = [(3x12 + a)/2y1](x1 − x3) − y1 (10.13) 2. If L = GF(2m), nonsupersingular case (equation y2 + xy = x3 + ax2 + b) then P+∞=∞+P=P

(10.14)

(x, y) + (x, x + y) = ∞

(10.15)

if P = (x1, y1), Q = (x2, y2), P ≠ Q and P ≠ − Q, then P + Q = (x3, y3) where x3 = λ2 + λ + x1 + x2 + a, y3 = λ(x1 + x3) + x3 + y1, λ = (y1 + y2)/(x1 + x2)

(10.16)

if P = (x1, y1) and P ≠ − P, that is, x1 ≠ 0, then P + P = (x3, y3) where x3 = λ2 + λ + a = x12 + b/x12, y3 = x12 + λx3 + x3, λ = x1 + y1/x1 (10.17) 3. If L = GF(2m), supersingular case (equation y2 + cy = x3 + ax + b) then P+∞=∞+P=P

(10.18)

(x, y) + (x, y + c ) = ∞

(10.19)

An Example of Application—Elliptic Curve Cryptography if P = (x1, y1), Q = (x2, y2), P ≠ Q and P ≠ − Q, then P + Q = (x3, y3) where x3 = [(y1 + y2)/(x1 + x2)]2 + x1 + x2, y3 = [(y1 + y2)/(x1 + x2)](x1 + x3) + y1 + c

(10.20)

if P = (x1, y1) and P ≠ −P, then P + P = (x3, y3) where x3 = [(x12 + a)/c]2, y3 = [(x12 + a)/c](x1 + x3) + y1 + c

(10.21)

The following formal algorithm computes P + Q. The function adding computes Eq. (10.12), (10.16), or (10.20), while the function doubling computes Eq. (10.13), (10.17), or (10.21):

Algorithm 10.1—Computation of R = P + Q if P = point_at_infinity then R := Q; elsif Q = point_at_infinity then R := P; elsif P = -Q then R := point_at_infinity; elsif P = Q then R := doubling(P); else R := adding(P,Q); end if;

Example 10.1 Consider the case of a nonsupersingular curve defined by the equation y2 = x3 – x, that is, Eq. (10.4) with a = − 1 and b = 0, over K = GF(5). According to Table 10.1, E(GF(5)) contains 7 points (0,0), (1,0), (4,0), (2,1), (2,4), (3,2), (3,3) plus the point at infinity. x

x3−x mod 5

y

y2 mod 5

0

0

0

0

1

0

1

1

2

1

2

4

3

4

3

4

4

0

4

1

TABLE 10.1

Solutions of y2 = x3 − x

Examples of additions: 1. Compute (2,1) + (0,0): x3 = [(0 − 1)/(0 − 2)]2 − 2 − 0 = 4 − 2 = 2, y3 = [(0 − 1)/(0 − 2)](2 − 2) − 1 = 4

291

292

Chapter Ten that is, (2,1) + (0,0) = (2,4) 2. Compute (2,4) − (2,1) = (2,4) + (2, − 1) = (2,4) + (2,4): x3 = [(3 · 22 − 1)/2 · 4]2 − 2 · 2 = 22 − 4 = 0, y3 = [(3 · 22 − 1)/2 · 4](2 − 0) − 4 = 2 · 2 − 4 = 0, that is, (2,4) − (2,1) = (0,0)

10.4

Point Multiplication 10.4.1

Definition

Point multiplication is the basic operation of elliptic curve cryptography: given a natural k and a point P of E(L), kP = P + P + . . . + P (k times)

∀k > 0 and 0P = ∞

(10.22)

Assume that the number of points #E(L) of the chosen elliptic curve can be factored under the form #E(L) = nh

(10.23)

where n is a prime and h (the cofactor) is small, so that n ≅ q [see Eq. (10.9)]. Because the order of an element divides the order of the group, the order of P is at most n, and the values of k should be limited to the set {0, 1, . . . , n − 1}.

Example 10.2 Consider the same curve as in Example 10.1 and compute k(2,4) for k in {1,2, . . . , 7}. 1(2,4) = (2,4) 2(2,4) = (2,4) + (2,4) = (0,0) 3(2,4) = (0,0) + (2,4) = − (0,0) – ( 2,1) = − ((0,0) + (2,1)) = − (2,4) = (2,1) 4(2,4) = 2(0,0) = ∞ 5(2,4) = (2,4) 6(2,4) = (0,0) 7(2,4) = (2,1)

An Example of Application—Elliptic Curve Cryptography Observe that the order of (2,4) is equal to 4, that is, a divisor of the order 8 of the group.

10.4.2

Basic Algorithms

Let (kt − 1, kt − 2, . . . , k0) be the binary representation of k, that is, k = kt − 1 · 2t − 1 + kt − 2 · 2t − 2 + . . . + k0 · 20

(10.24)

t ≅ log2n ≅ log2q

(10.25)

with

Then kP can be computed according to the following scheme ([HMV04]): kP = ( . . . 2(2(2∞ + kt − 1P) + kt − 2P) + . . . ) + k0P

(10.26)

The corresponding formal algorithm follows:

Algorithm 10.2—Point multiplication (Q = kP), left to right Q := point_at_infinity; for i in 1 .. t loop Q := Q + Q; if k(t-i) = 1 then Q := Q + P; end if; end loop;

Comment 10.1 During the execution of step i > 1, and before the

conditional computation of Q + P, the value of Q is k’P where k’ = kt − 1 · 2i − 1 + kt − 2 · 2i − 2 + . . . + kt − (i − 1) · 21. If k is smaller than the order of P, then k’P = P would imply that k’ = 1, which is impossible as k’ is even, and k’P = − P would imply that k’ + 1 = 0, which is impossible. Another computation scheme is ([HMV04]) kP = k0P + k1(2P) + k2(22P) + . . . + kt − 1(2t − 1P)

(10.27)

to which corresponds another formal algorithm:

Algorithm 10.3—Point multiplication (Q = kP), right to left Q := point_at_infinity; for i in 0 .. t-1 loop if k(i) = 1 then Q := Q + P; end if; P := P + P; end loop;

Comment 10.2 During the execution of step i > 0, and before the

conditional computation of Q + P, the value of Q is k’P where k’ = ki − 1 · 2i − 1 + ki − 2 · 2i − 2 + . . . + k0 · 20 and P has been substituted by 2iP. If k is smaller than the order of P, then k’P = 2iP would imply that k’ = 2i,

293

294

Chapter Ten which is impossible as k’ is smaller than 2i, and k’P = − 2iP would imply that k’ + 2i = 0, which is impossible. The basic Algorithms 10.2 and 10.3 consist of t doubling function calls and at most t adding function calls, where the order of magnitude [Eq. (10.25)] of t is log2q = mlog2p if L = GF(pm). Both functions are relatively complex [Eqs. (10.12), (10.13), (10.16), (10.17), (10.20), (10.21)] and include divisions. For that reason numerous alternative algorithms have been proposed. Some of them are briefly introduced in the next section.

10.4.3 10.4.3.1

Some Alternative Methods Nonadjacent Forms

Several strategies have been proposed for reducing the pointmultiplication computation time. One of them is based on the fact that subtracting is as efficient as adding so that a signed-bit representation of k can be considered. Then, among all the signed-bit representations of k, a so-called nonadjacent form (NAF) is chosen ([HMV04]), that is an expression k = k’t’ − 1 · 2t’ − 1 + k’t’ − 2 · 2 t’ − 2 + . . . + k’0 · 20

(10.28)

where k’i ∈ { − 1, 0, 1}, the length t’ of the representation is at most one more than the length t of the binary representation, and no two consecutive signed bits are nonzero. The maximum number of adding function calls will be about t/2 instead of t, and the average number of adding function calls can be proven to be equal to t/3. A generalization of the preceding idea consists of expressing k in a nonbinary signed-digit base, for example, k = k’’t’’ − 1 · 2t’’ − 1 + k’’t’’ − 2 · 2 t’’ − 2 + . . . + k’’0 · 20 where k’’i is an odd natural belonging to the interval − 2w − 1< k’’i < 2w − 1, the length t’’ of the representation is at most one more than the length t of the binary representation, and the average density of nonzero digits among all the representations of this type is 1/(w + 1). The set of values {iP for i in 3, 5, . . . , 2w-1 − 1} must be first computed. Then the average number of adding function calls is equal to t/(w + 1). It is the so-called window NAF method.

10.4.3.2

Projective Coordinates

The goal is to avoid divisions. Given two naturals c and d, a subset of L3 can be associated to every element (x, y) of L2: (x, y) → {(X, Y, Z) | Z ≠ 0, X = xZc, Y = yZd}

(10.29)

Conversely, an element (X, Y, Z) of L3, with Z ≠ 0, corresponds the element (x, y) of L2 and is defined by x = X/Zc

y = Y/Zd

(10.30)

An Example of Application—Elliptic Curve Cryptography As a matter of fact, the preceding relation defines an equivalence relation over the set of elements of L3 whose z-coordinate is nonzero: two elements (X1, Y1, Z1) and (X2, Y2, Z2) are equivalent if they are mapped to the same element of L2, that is, X1/Z1c = X2/Z2c

and

Y1/Z1d = Y2/Z2d

(10.31)

Each equivalence class is called a projective point, while the elements (x, y) of L2 are called affine points. Consider now an elliptic curve, for example, a nonsupersingular curve over L = GF(2m) defined by Eq. (10.4), and assume that c = 1 and d = 2 (López-Dahab projective coordinates, [LD99]). Substituting x by X/Z and y by Y/Z2 in Eq. (10.4) the following equation is obtained: Y 2/Z4 + XY/Z3 = X 3/Z3 + aX2/Z2 + b

(10.32)

Y 2 + XYZ = X 3Z + aX 2Z2 + bZ4

(10.33)

equivalent to

It is the projective form of Eq. (10.4), and for every solution (x, y) of Eq. (10.4), there corresponds an equivalence class of solutions (X, Y, Z) of Eq. (10.33). All the elements of the form (X, 0, 0) are also solutions of Eq. (10.33). They are associated to the point at infinity ∞ in the two-dimension domain. The basic idea for avoiding divisions is to define the point-addition and point-doubling operations using the projective coordinates. Assume that the result of an operation involving two points (x1, y1) and (x2, y2) is (x3, y3). In affine coordinates x3 and y3 are some of the computed formulas presented in Sec. 10.3. Within those formulas substitute x1, y1, x2, y2 with X1/Z, Y1/Z2, X2/Z, Y2/Z2, so that x3 and y3 are now expressed under the form x3 = Nx/Dx and y3 = Ny/Dy where the numerators and denominators are functions of X1, Y1, Z1, X2, Y2, and Z2. The same point can be represented in the projective domain under the form ((Nx/Dx)Z3, (Ny/Dy)Z32, Z3) where Z3 can be any nonzero element. Then choose for Z3 an element so that Dx divides Z3 and Dy divides Z32. As an example compute the doubling formulas. According to Eq. (10.17) λ = x1 + y1/x1 = X1/Z1 + (Y1/Z12)/(X1/Z1) = (X13Z1 + X1Y1Z1)/( X12Z12) x3 = x12 + b/x12 = (X1/Z1)2 + b/(X1/Z1)2 = (X14 + bZ14)/( X12Z12) y3 = x12 + λx3 + x3 = (X1/Z1)2 + (X13Z1 + X1Y1Z1)(X14 + bZ14)/( X12Z12)2 + (X14 + bZ14)/( X12Z12)

295

296

Chapter Ten Then choose Z3 = X12Z12

(10.34)

X3 = x3Z3 = X14 + bZ14

(10.35)

so that

Y3 = y3Z32 = (X1/Z1)2Z32 + (X13Z1 + X1Y1Z1)X3 + X3Z3 = X14Z3 + (X13Z1 + X1Y1Z1)X3 + X3Z3 = (X14 + X3)Z3 + (X13Z1 + X1Y1Z1)X3 = bZ14Z3 + (X13Z1 + X1Y1Z1)X3 As (X1, Y1, Z1) satisfies Eq. (10.33) X13Z1 + X1Y1Z1 = Y12 + aX12Z12 + bZ14 Y3 = bZ14Z3 + (Y12 + aZ3 + bZ14)X3

(10.36)

Thus, the point-doubling operation is executed with formulas in Eqs. (10.34), (10.35), and (10.36). The corresponding computation primitives are finite-field addition, multiplication, and squaring. The point-adding formulas are somewhat more complex. If Z2 = 1 the final result is the following [HMV04]: X3 = A2 + D + E

Y3 = (E + Z3)F + G

Z3 = C2

(10.37)

where A = Y2Z12 + Y1 E = AC

B = X2Z1 + X1 F = X3 + X2Z3

C = Z1B

D = B2(C + aZ12)

G = (X2 + Y2)Z32

Once again the corresponding computation primitives are finitefield addition, multiplication, and squaring. The negation is computed as follows (Z ≠ 0): − (X, Y, Z) = (X, XZ + Y, Z)

(10.38)

Actually, the corresponding affine points are − (X/Z, Y/Z2) and (X/Z, (XZ + Y)/Z2) = (X/Z, X/Z + Y/Z2) = − (X/Z, Y/Z2) [according to Eq. (10.15)]. To summarize, the elliptic curve operations are executed as follows: Substitute every curve point (x, y) by the projective point (x, y, 1), substitute ∞ by (1, 0, 0), and execute all the necessary operations within the projective domain. If the result of a sequence of operations

An Example of Application—Elliptic Curve Cryptography is (X, Y, Z) with Z ≠ 0, the corresponding curve point is (X/Z, Y/Z2), and if Z = 0 the result is ∞. Thus, the only finite field divisions are those corresponding to the final projective-to-affine decoding. If the point-multiplication is computed with Algorithm 10.2, the second point-addition operand is always P, so that its initial projective representation (xP, yP, 1) is not modified during the algorithm execution. This justifies the fact that the point-addition formulas [Eq. (10.37)] have been defined with Z2 = 1. This is an example of mixed coordinate representation: Q is represented in projective coordinates, that is, under the form (XQ, YQ, ZQ), while P is represented in affine coordinates under the form (xP, yP).

10.4.3.3

Montgomery Algorithm

Assume again that k = kt − 12t − 1 + kt − 22t − 2 + . . . + k121 + k020, and define the partial sums s0 = 0 s1 = kt − 120 s2 = kt − 121 + kt − 220 ...

(10.39)

st = kt − 12t − 1 + kt − 22t − 2 + . . . + k121 + k020 = k Thus, sj = 2sj − 1 + kt - j

∀j = 1, 2, . . . , t

(10.40)

The algorithm consists of computing at each step sjP and (sj + 1)P in function of sj − 1P and (sj − 1 + 1)P ([HMV04]). If kt - j = 0 then sjP = 2(sj − 1P), (sj + 1)P = (2sj − 1 + 1)P = sj − 1P + (sj − 1 + 1)P,

(10.41)

and if kt - j = 1 then sjP = (2sj − 1 + 1)P = sj − 1P + (sj − 1 + 1)P (sj + 1)P = (2sj − 1 + 2)P = 2(sj − 1 + 1)P

(10.42)

Initially define s0P = ∞

and

(s0 + 1)P =P

The corresponding formal algorithm follows:

Algorithm 10.4—Montgomery point-multiplication algorithm A := point_at_infinity; B := P; for j in 1 .. t loop if k(t-j) = 0 then A := 2A; B := A + B; else A := A + B; B := 2B;

(10.43)

297

298

Chapter Ten end if; end loop; R := A;

Every step of Algorithm 10.4 includes one point-doubling and one point-addition. Therefore, the complexity is similar to that of the classical point-multiplication algorithms. Nevertheless, the fact that at each step the value of both sjP and sjP + P are known allows simplifying the computation in the case of the nonsupersingular curve y2 + xy = x3 + ax2 + b over GF(2m). The following property is used: If A = (xA, yA) ≠ ∞ and B = (xB, yB) ≠ ∞ are two different points of the curve and if A ≠ − B, then the x-coordinates x A + B and xA − B of A + B and A − B are related by the following relation xA + B = xA − B + xB(xA + xB) − 1 + (xB(xA + xB) − 1)2. If furthermore A = sjP and B = (sj + 1)P for some j, then A - B = - P, xA-B = xP, and xA + B = xP + xB(xA + xB) − 1 + (xB(xA + xB) − 1)2

(10.44)

If P is assumed to be different from ∞, A = sjP and B = (sj + 1)P is always different. If at some step xA = xB, then A = −B and A + B= ∞. Regarding the point-doubling, according to Eq. (10.17) with b = 1 xA + A = xA2 + b/xA2 if xA ≠ 0

and

(0, yA) + (0, yA) = (0, yA) − (0, yA) = ∞

(10.45)

and similar relations hold for computing B + B. Thus, Algorithm 10.4 can be executed with the x-coordinates of the successive A and B points. A final step will compute the missing y-coordinate of the result. It is based on the following property: If P = (xP, yP), where xP ≠ 0, kP = (xA, yA) and (k + 1)P = (xB, yB), then yA = xP − 1(xA + xP)[(xA + xP)(xB + xP) + xP2 + yP] + yP

(10.46)

Therefore, Montgomery Algorithm 10.4 consists of t point-additions and t point-doubling operations, where the order of magnitude [Eq. (10.25)] of t is m. The number of operations is the same as in basic Algorithms 10.2 and 10.3. Nevertheless, in most point-operations the y-coordinate is not computed, so that the algorithm complexity is roughly halved. The Montgomery method could also be used in projective or mixed coordinates. For that, Eqs. (10.44), (10.45), and (10.46) must be expressed using projective coordinates for A and B. Assume that c = d = 1 (standard projective coordinates) so that xA = XA/ZA and xB = X B/ZB. Then, according to Eq. (10.44), xA + B = xP + XBZA/(XAZB + XBZA) + (XBZA/(XAZB + XBZA))2 Let ZA + B = (XAZB + XBZA)2

(10.47)

An Example of Application—Elliptic Curve Cryptography so that XA + B = xPZA + B + XBZA(XAZB + XBZA) + (XBZA)2 = xPZA + B+ XAXBZAZB

(10.48)

According to Eq. (10.45) xA + A = (XA/ZA)2 + b(ZA/XA)2 = (XA4 + b ZA4)/(XA2ZA2) Let ZA+A = XA2ZA2

(10.49)

XA + A = xA + AZA + A = XA4 + b ZA4

(10.50)

so that

Finally, according to Eq. (10.46), yA = (xP + XA/ZA)[(XA + xPZA)(XB + xPZB) + (xP2 + yP)ZAZB](xPZAZB) − 1 + yP

10.4.3.4

(10.51)

Frobenius Map

The point-doubling operation can be avoided in the case of the two following Koblitz curves over GF(2m) ([Sol00], [HMV04]): E0: y2 + xy = x3 + 1

(10.52)

E1: y2 + xy = x3 + x2 + 1

(10.53)

For that define the Frobenius map τ from Ec(GF(2m)) to Ec(GF(2m)), with c = 0 or 1: τ(∞) = (∞)

τ(x, y) = (x2, y2)

(10.54)

It can be demonstrated that 2P = − τ2(P) + μτ(P) with μ = 1 if c = 1

and

(10.55)

μ = − 1 if c = 0

Thus the point-doubling operation amounts to squaring operations in GF(2m) for computing τ(P) and τ2(P) and a point-addition. In fact a generalized version of Eq. (10.55) can be defined. Given two integers a and b, define an application α = a + bτ from Ec(GF(2m)) to Ec(GF(2m)): α(P) = aP + bτ(P)

(10.56)

299

300

Chapter Ten Now, look for two integers a’ and b’ so that α(P) = α’(τ(P)) + rP

with

α’ = a’ + b’τ and r ∈ { −1, 0, 1} (10.57)

According to Eqs. (10.55), (10.56), and (10.57) aP + bτ(P) = (a’ + b’τ)τ(P) + rP = a’τ(P) + b’τ2(P) + rP = a’τ(P) + μb’τ(P) − 2b’P + rP = (a’ + μb’)τ(P) − (2b’ − r)P

(10.58)

Hence, a = r − 2b’ and b = a’ + μb’, that is, b’ = (r − a)/2

and

a’ = b − μb’ = b + μ(a − r)/2

(10.59)

If a is even, then choose r = 0 so that b’ = − a/2

a’ = b + μa/2

(10.60)

If a is odd, choose r in such a way that a’ is even, that is, 2a’ = 2b + μa − μr is a multiple of 4. For that, consider the binary representations ( . . . a1 1) and ( . . . b1 b0) of a and b. Then 2b + a = ( . . . b0 0) + ( . . . a1 1) = ( . . . a1 ⊕ b0 1) 2b − a = ( . . . b0 0) + ( . . . a1 1) = ( . . . a1 ⊕ b0 ⊕ 1 1) where ⊕ stands for the modulo 2 sum. If a1 ⊕ b0 = 0, choose r = 1 so that 2b + a − r = ( . . . 0 1) − 1 = ( . . . 0 0) if μ = 1

and

2b − a + r =( . . . 1 1) + 1 = ( . . . 0 0) if μ = − 1 If a1 ⊕ b0 = 1, choose r = − 1 so that 2b + a − r = ( . . . 1 1) + 1 = ( . . . 0 0) if μ = 1

and

2b − a + r = ( . . . 0 1) − 1 = ( . . . 0 0) if μ = − 1 To summarize: 1. If a is even, then r = 0, b’ = − a/2, and a’ = b − μb’ = b + μa/2. 2. If a is odd and a1 ⊕ b0 = 0, then r = 1, b’ = − (a − 1)/2, and a’ = b − μb’ = b + μ(a − 1)/2. 3. If a is odd and a1 ⊕ b0 = 1, then r = − 1, b’ = − (a + 1)/2 and a’ = b − μb’ = b + μ(a + 1)/2. Observe that if a is odd, then a − 2b = ( . . . a1 1) − ( . . . b0 0) = ( . . . a1 ⊕ b0 1), so that (a − 2b) mod 4 is equal to 1 if a1 ⊕ b0 = 0 and equal to 3 if a1 ⊕ b0 = 1. So, an alternative definition of r, when a is odd, is r = 2 − ((a − 2b) mod 4)

(10.61)

An Example of Application—Elliptic Curve Cryptography Equation (10.57) defines a kind of integer division of α by τ, that is, α = α’τ + r

with r ∈ {−1, 0, 1}

(10.62)

By repeatedly using the preceding relations, an expression of α can be computed: α = α1τ + r0 α1 = α2τ + r1

(10.63)

... αt − 1 = αtτ + rt − 1 with ri ∈ { − 1, 0, 1}. Thus (multiply the second equation by τ, the third one by τ2, and so on, and sum up the t equations) α = r0 + r1τ + . . . + rt − 1τt − 1 + αtτt

(10.64)

It can be demonstrated that after a finite number of steps, t, α t = 0. Consider the particular case where a = k and b = 0, that is, α(P) = kP. Then, according to Eq. (10.64) with αt = 0, kP = rt − 1τt − 1(P) + rt − 2τt − 2(P) + . . . + r1τ(P) + r0P

with

ki ∈ { −1, 0, 1}

(10.65)

The following algorithm computes this τ-ary representation of k.

Algorithm 10.5—s-ary representation of k a := k; b := 0; i := 0; while a /= 0 or b /= 0 loop if a mod 2 = 0 then r(i) := 0; else r(i) := 2 – ((a – 2*b) mod 4); end if; old_a := a; a := b + mu*(old_a – r(i))/2; b := (r(i) – old_a)/2; i := i+1; end loop;

Regarding the maximum value of t in the particular case where a = k and b = 0, it has been demonstrated that t ≈ 2log2k

(10.66)

Example 10.3 Express 17P under the form of Eq. (10.65). Initially α = 17 + 0τ, that is, a = 17 and b = 0. Then a = 17, b = 0: r(0) = 2 – (17 mod 4) = 1 a = 0 + μ(17 − 1)/2 = 8μ, b = (1 – 17)/2 = − 8: r(1) = 0 a = − 8 + μ(8μ − 0)/2 = − 4, b = (0 – 8μ)/2 = − 4μ: r(2) = 0

301

302

Chapter Ten a = − 4μ + μ( − 4 − 0)/2 = − 6μ, b = (0 + 4)/2 = 2: r(3) = 0 a = 2 + μ( − 6μ − 0)/2 = − 1, b = (0 + 6μ)/2 = 3μ: r(4) = 2 – (−1 – 6μ mod 4) = 1 a = 3μ + μ( − 1 − 1)/2 = 2μ , b = (1 + 1)/2 = 1: r(5) = 0 a = 1 + μ(2μ − 0)/2 = 2 , b = (0 − 2μ)/2 = − μ: r(6) = 0 a = − μ + μ(2 − 0)/2 = 0 , b = (0 − 2)/2 = − 1: r(7) = 0 a = − 1 + μ(0 − 0)/2 = − 1 , b = (0 − 0)/2 = 0: r(8) = 2 – (−1 mod 4) = − 1 a = 0 + μ( − 1 + 1)/2 = 0 , b = ( − 1 + 1)/2 = 0 Thus, 17P = P + τ 4(P) − τ 8(P) The following formal point-multiplication algorithms, in which the function frobenius computes Eq. (10.54), are directly deduced from [Eq. (10.65)]. In both cases it is assumed that the representation [Eq. (10.65)] of kP has been previously computed.

Algorithm 10.6—Point multiplication (Q = kP), Koblitz curve, left to right Q := point_at_infinity; for i in 1 .. t loop Q := frobenius(Q); if r(t-i) = 1 then Q := Q+P; elsif r(t-i) = -1 then Q := Q-P; end if; end loop;

Algorithm 10.7—Point multiplication (Q = kP), Koblitz curve, right to left Q := point_at_infinity; for i in 0 .. t-1 loop if r(i) = 1 then Q := Q + P; elsif r(i) = -1 then Q := Q-P; end if; P := frobenius(P); end loop;

Comment 10.3 During the execution of step i > 0 of the preceding

algorithm and before the computation of Q + P or Q-P, the value of Q is ri − 1τi − 1(P) + ri − 2τi − 2(P) + . . . + r1τ(P) + r0P and P has been substituted by τi(P). If ri − 1τi − 1(P) + ri − 2τi − 2(P) + . . . + r1τ(P) + r0P = riτi(P), then − riτi(P) + ri − 1τi − 1(P) + ri − 2τi − 2(P) + . . . + r1τ(P) + r0P = ∞

(10.67)

An Example of Application—Elliptic Curve Cryptography and if ri − 1τi − 1(P) + ri − 2τi − 2(P) + . . . + r1τ(P) + r0P = − riτi(P), then riτi(P) + ri − 1τi − 1(P) + ri − 2τi − 2(P) + . . . + r1τ(P) + r0P = ∞

(10.68)

If k is smaller than the order of P, it can be shown that the preceding Eqs. (10.67) and (10.68) are never satisfied unless ri = ri − 1 = . . . = r1 = r0 = 0. Thus, if Q ≠ ∞ the values of Q + P and Q − P are computed with Eqs. (10.15) and (10.16). Assume that the following procedure, computing x3 and y3 according to Eq. (10.16), has been previously defined: procedure point_addition(x1, y1, x2, y2, f: in Polynomial; x3, y3: out Polynomial);

The following executable algorithm computes kP.

Algorithm 10.8—Point multiplication (Q = kP), Koblitz curve, right to left Q_infinity := true; for i in 0 .. t-1 loop if r(i) = 1 then if Q_infinity then xQ := xP; yQ := yP; Q_infinity := false; else point_addition(xP, yP, xQ, yQ, f, new_xQ, new_yQ); xQ := new_xQ; yQ := new_yQ; end if; elsif r(i) = -1 then if Q_infinity then xQ := xP; yQ := add(xP, yP); Q_infinity := false; else point_addition(xP, add(xP, yP), xQ, yQ, f, new_xQ, new_yQ); xQ := new_xQ; yQ := new_yQ; end if; end if; xP := product_mod_f(xP, xP, f); yP := product_mod_f (yP, yP, f); end loop;

The base-τ conversion Algorithm 10.5 successively computes r0, r1, . . . , rt − 1, and the right-to-left point-multiplication Algorithm 10.7 successively uses r0, r1, . . . , rt − 1. So, both algorithms can be executed in parallel.

Algorithm 10.9—Point multiplication (Q = kP), Koblitz curve, s-ary representation of k Q_Infinity := True; A := K; B := 0; while ((A /= 0) or (B /= 0)) loop if A mod 2 = 0 then R_I := 0; elsif 2 - ((A - 2*B) mod 4) = 1 then

303

304

Chapter Ten R_I := 1; if Q_Infinity then Xq := Xp; Yq := Yp; Q_Infinity := False; else Point_Addition(Xp, Yp, Xq, Yq, F, New_Xq, New_ Yq); Xq := New_Xq; Yq := New_Yq; end if; else R_I := -1; if Q_Infinity then Xq := Xp; Yq := Add(Xp, Yp); Q_Infinity := False; -- Q := Q-P = -P else Point_Addition(Xp, Add(Xp, Yp), Xq, Yq, F, New_Xq, New_Yq); Xq := New_Xq; Yq := New_Yq; end if; end if; Xp := Product_Mod_F(Xp, Xp, F); Yp := Product_Mod_F(Yp, Yp, F); --update a and b Old_A := A; A := B + Mu*(Old_A - R_I)/2; B := (R_I - Old_A)/2; end loop; Xr := Xq; Yr := Yq;

An executable Ada file frobenius_point_multiplication.adb, including Algorithm 10.9, is available at www.arithmetic-circuits.org. To summarize, doubling has been substituted by squaring, a simple operation over a binary field. Furthermore, among two successive coefficients ri, at least one is equal to 0. Thus, according to Eq. (10.66), an upper bound s of the number of nonzero coefficients ri is given by s ≈ log2k ≈ m

(10.69)

Thus, the computation of kP includes at most m complex operations (adding or subtracting), and the total computation time should be roughly half the computation time of that of the basic algorithms.

10.5

Example of Implementation As an example of application of the finite field operations, pointmultiplication Algorithm 10.9 over a particular elliptic curve will be implemented. Consider the Koblitz curve y2 + xy = x3 + x2 + 1

(10.70)

over GF(2) and the extension field L = GF(2163). A polynomial representation based on the irreducible polynomial f(z) = z163 + z7 + z6 + z3 + 1 will be used.

(10.71)

An Example of Application—Elliptic Curve Cryptography

10.5.1

Computation Resources

The computation primitives for executing the elliptic-curve operations are: addition, multiplication, division, and squaring over GF(2m). The first one amounts to the component-by-component addition of the corresponding polynomials. The corresponding circuit is made up of m XOR gates, and its computation time is equal to 1 clock cycle. For multiplying, the generic interleaved_mult.vhd model of Chap. 7 (Sec. 7.1.2) can be used. For dividing, a simplified version of binary Algorithm 6.5, adapted to the case where p = 2, is described in App. C. The corresponding generic model is binary_algorithm_polynomials.vhd. Appendix C also includes a specific algorithm for squaring over GF(2163). The corresponding model is square_163_7_6_3.vhd.

10.5.2

Point Addition

A datapath for computing Eq. (10.16) λ = (y1 + y2)/(x1 + x2)

x3 = λ2 + λ + x1 + x2 + 1

y3 = λ(x1 + x3) + x3 + y1 is shown in Fig. 10.1. According to Eq. (10.69) and to the structure of the datapath, the computation time is approximately equal to Tpoint-addition ≈ m(Tmod-f-product + Tmod-f-division)

y1 y2

x1 x2

(10.72)

x1 x3 lambda

mod f(x ) divider

start_div div_done

start_mult mult_done

mod f (x ) multiplier

lambda x3 y1

square lambda_square

1

x3

FIGURE 10.1

Point addition.

y3

305

306

Chapter Ten A VHDL model has been generated. The complete VHDL file K163_addition.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity K163_addition is port( x1, y1, x2, y2: in std_logic_vector(m-1 downto 0); clk, reset, start: in std_logic; x3: inout std_logic_vector(m-1 downto 0); y3: out std_logic_vector(m-1 downto 0); done: out std_logic ); end K163_addition;

The VHDL architecture corresponding to the circuit of Fig. 10.1 is the following: divider_inputs: for i in 0 to m-1 generate div_in1(i) <= y1(i) xor y2(i); div_in2(i) <= x1(i) xor x2(i); end generate; divider: binary_algorithm_polynomials port map( g => div_in1, h => div_in2, clk => clk, reset => reset, start => start_div, z => lambda, done => div_done ); lambda_square_computation: classic_squarer port map( a => lambda, c => lambda_square ); x_output: for i in 1 to 162 generate x3(i) <= lambda_square(i) xor lambda(i) xor div_in2(i); end generate; x3(0) <= not(lambda_square(0) xor lambda(0) xor div_in2(0)); multiplier_inputs: for i in 0 to 162 generate mult_in2(i) <= x1(i) xor x3(i); end generate; multiplier: interleaved_mult port map( a => lambda, b => mult_in2, clk => clk, reset => reset, start => start_mult, z => mult_out, done => mult_done ); y_output: for i in 0 to 162 generate y3(i) <= mult_out(i) xor x3(i) xor y1(i); end generate;

The complete model additionally includes a control unit.

10.5.3

Point Multiplication

As has been seen in Sec. 10.4.3.4, the branching conditions of Algorithm 10.9 can be expressed in function of the least significant bits of a and b:

An Example of Application—Elliptic Curve Cryptography if a0 = 0, then a’ = b + a/2 = b + ⎣a/2⎦ and b’ = − ⎣a/2⎦ if a0 = 1 and a1 ⊕ b0 = 0, then a’ = b + (a − 1)/2 = b + ⎣a/2⎦ and b’ = − ⎣a/2⎦ if a0 = 1 and a1 ⊕ b0 = 1, then a’ = b + (a + 1)/2 = b + ⎣a/2⎦ + 1 and b’ = − ( ⎣a/2⎦ + 1) The corresponding modified algorithm, in which div_2(a) computes ⎣a/2⎦, is the following.

Algorithm 10.10—Point multiplication (Q = kP), Koblitz curve, s-ary representation of k Q_infinity := true; a := k; b := 0; while ((a /= 0) or (b /= 0)) loop a_div_2 := div_2(a); if a mod 2 = 0 then a := b + a_div_2; b := -a_div_2; elsif (a/2) mod 2 = b mod 2 then if Q_infinity then xQ := xP; yQ := yP; Q_infinity := false; else point_addition(xP, yP, xQ, yQ, f, new_xQ, new_yQ); xQ := new_xQ; yQ := new_yQ; end if; a := b + a_div_2; b := -a_div_2; else if Q_infinity then xQ := xP; yQ := add(xP, yP); Q_infinity := false; else point_addition(xP, add(xP, yP), xQ, yQ, f, new_xQ, new_yQ); xQ := new_xQ; yQ := new_yQ; end if; a := b + a_div_2 + 1; b := -(a_div_2 + 1); end if; xP := product_mod_f(xP, xP, f); yP := product_mod_f (yP, yP, f); end loop;

An executable Ada file frobenius_point_multiplication3.adb, including Algorithm 10.10, is available at www.arithmetic-circuits.org. In order to implement the preceding algorithm, an upper bound of a and b must be known. It can be demonstrated that − 2m ≤ a < 2m

and

− 2m − 1 ≤ b < 2m − 1

(10.73)

so that a is an (m + 1)-bit 2s complement number and b an m-bit 2s complement number. A datapath for executing Algorithm 10.10 is shown in Fig. 10.2.

307

308

Chapter Ten

next_a = b + a_div_2 + carry, next_b = –(a_div_2 + carry)

xxP yyP

next_a yyP

yQ

0 1

xxP

next_b

xxP + yyP xQ

initially : k sel_1

carry

ce_ab

initially: 0

load b

a

y1 start_addition point addition

(a = 0) (b = 0) a0 a1 x or b0

flag generation

addition_done

x3 xxP + yyP yyP y3

xxP

1, 2

0 next_xQ

2 1

0

squaring

sel_2

squaring

next_yQ 0 initially: 1

ce_Q load

initially: xP

initially: yP

xxP xQ

FIGURE 10.2

yQ

ce_P load

yyP

Q_infinity

Point multiplication.

According to Eqs. (10.69) and (10.72), their computation time is approximately equal to T ≈ mTpoint-addition ≈ m2(Tmod-f-product + Tmod-f-division)

(10.74)

A VHDL model has been generated. The complete VHDL file K163_point_multiplication.vhd is available at www.arithmeticcircuits.org. The entity declaration is entity K163_point_multiplication is port ( xP, yP, k: in std_logic_vector(m-1 downto 0); clk, reset, start: in std_logic; xQ, yQ: inout std_logic_vector(m-1 downto 0); done: out std_logic ); end K163_point_multiplication;

The VHDL architecture corresponding to the circuit of Fig. 10.2 follows:

An Example of Application—Elliptic Curve Cryptography xor_gates: for i in 0 to m-1 generate xxPxoryyP(i) <=xxP(i) xor yyP(i); end generate; with sel_1 select y1 <= yyP when ‘0’, xxPxoryyP when others; with sel_2 select next_yQ <= y3 when “00”, yyP when “01”, xxPxoryyP when others; with sel_2 select next_xQ <= x3 when “00”, xxP when others; first_component: K163_addition port map( x1 => xxP, y1 => y1, x2 => xQ, y2 => yQ, clk => clk, reset => reset, start => start_addition, x3 => x3, y3 => y3, done => addition_done ); second_component: classic_squarer port map( a => xxP, c => square_xxP ); third_component: classic_squarer port map( a => yyP, c => square_yyP ); register_P: process(clk) begin if clk’ event and clk = ‘1’ then if load = ‘1’ then xxP <= xP; yyP <= yP; elsif ce_P = ‘1’ then xxP <= square_xxP; yyP <= square_yyP; end if; end if; end process; register_Q: process(clk) begin if clk’ event and clk = ‘1’ then if load = ‘1’ then Q_infinity <= ‘1’; elsif ce_Q = ‘1’ then xQ <= next_xQ; yQ <= next_yQ; Q_infinity <= ‘0’; end if; end if; end process; divide_by_2: for i in 0 to m-1 generate a_div_2(i) <= a(i + 1); end generate; a_div_2(m) <= a(m); next_a <= (b(m-1)&b) + a_div_2 + carry; next_b <= zero - (a_div_2(m-1 downto 0) + carry); register_ab: process(clk) begin if clk’ event and clk = ‘1’ then if load = ‘1’ then a <= (‘0’&k); b <= zero; elsif ce_ab = ‘1’ then a <= next_a; b <= next_b; end if; end if; end process; aEqual0 <= ‘1’ when a = 0 else ‘0’;

309

310

Chapter Ten bEqual0 <= ‘1’ when b = 0 else ‘0’; a1xorb0 <= a(1) xor b(0);

The complete model additionally includes a control unit.

Comment 10.4 In order to minimize the computation time, the circuit should be slightly modified. The part of the circuit that computes next_a and next_b is the critical path that defines the clock period. If ripple-carry adders are used, the minimum clock period should be of the order of mTFA. Nevertheless, the updating of a and b is performed in parallel with the updating of xQ and yQ, and in many cases the computation of xQ and yQ is executed by the point-addition component. In this case, it is not necessary to compute next_a and next_b in one cycle. The updating of a and b (ce_ab = 1) can be done at the same time as the updating of xQ and yQ (ce_Q = 1). The problem arises when the updating of xQ and yQ is done in one cycle, that is, when a is even (xQ and yQ do not change) and when Q = ∞ (xQ = xP and yQ = yP or xP + yP). A simple solution consists of adding additional no-operation states to the control unit in such a way that every time a is even or P = ∞, the updating of a and b (ce_ab = 1) is delayed a number s of cycles, sTCLK being greater than the computation time of next_a and next_b.

10.6

FPGA Implementation The complete point multiplication circuit of Fig. 10.2 has been implemented within a Spartan3 (speed-5) programmable device with P defined by the following hexadecimal coordinates: xP = 2fe13c0537bbc11acaa07d793de4e6d5e5c94eee8 yP = 289070fb05d38ff58321f2e800536d538ccdaa3d9 The order of P is equal to n = 4000000000000000000020108a2e0cc0d99f8a5ef The circuit computes kP for any k belonging to the interval 0 < k < n. Because the number of cycles depends on the value of k, average values have been computed. As before, the times (Period, AverTime) are expressed in ns. All the source files are available at www.arithmetic-circuits.org.

FFs

LUTs

Slices

Period

AverCycles

AverTime

2,170

3,514

2,062

7.9

54,422.8

429,940

An Example of Application—Elliptic Curve Cryptography

10.7

References [ARS78] L. M. Adleman, R. L. Rivest, and A. Shamir. “A Method for Obtaining Digital Signatures and Public-key Cryptosystems.” Communications of the ACM, 21, pp. 120–126, 1978. [BSS99] I. F. Blake, G. Seroussi, and N. P. Smart. Elliptic Curves in Cryptography. Cambridge University Press, Cambridge, 1999. [ElG85] T. ElGamal. “A Subexponential-time Algorithm for Computing Discrete Logarithms over GF(p2).” IEEE Transactions on Information Theory, IT-31, pp. 473–481, 1985. [HMV04] D. Hankerson, A. Menezes, and S. Vanstone. Guide to Elliptic Curve Cryptography. Springer, New York, 2004. [LD99] J. López and R. Dahab. “Improved Algorithm for Elliptic Curve Arithmetic in GF(2n).” Lecture Notes in Computer Science, 1556, pp. 201–212, Springer-Verlag, 1999. [Sol00] J. Solinas. “Efficient Arithmetic on Koblitz Curves.” Designs, Codes and Cryptography, no. 19, pp. 195–249, 2000.

311

This page intentionally left blank

APPENDIX

A

p = 2192 – 264 – 1 A.1

Hexadecimal Representation p = 2192 − 264 − 1 = 1648 − 1616 − 1 where 1648 = [100 . . . 000 . . . 000]16 1648 − 1 = [FF . . . FFF . . . FFF]16 p = (1648 − 1) − 1616 = [FF . . . FEF . . . FFF]16 that is, pi = F, for i = 0 to 15 and i = 17 to 47

A.2

and

p16 = E

mod p Reduction A.2.1

Generic Sequential Circuit

In order to avoid the use of long-operand multipliers, nonrestoring or SRT reducers should be used. Both types of reducers, for k = 192 and n = 384, have been implemented. The packages storing the parameter values are the following: package nr_reducer_parameters is constant N: natural := 384; constant K: natural := 192; --COUNTER_SIZE is the number of bits of N-K-1 constant COUNTER_SIZE: natural := 8; end nr_reducer_parameters; package srt_reducer_parameters is constant N: natural := 384; constant K: natural := 192; --COUNTER_SIZE is the number of bits of N-K-1 constant COUNTER_SIZE: natural := 8; end srt_reducer_parameters;

313

314

Appendix A Recall that in the second case (SRT) the final steps (decoding from stored-carry form to normal form and correction if the obtained result is negative) are computed with carry-propagate adders and that the corresponding delays could be greater than the clock period. As mentioned in Comment 2.1, some kind of synchronization of the final operations should be introduced, for example, adding s clock periods with s such that sTCLK > Tfinal steps.

A.2.2

Specific Combinational Circuit

Another option is the specific circuit described in Sec. 2.6.2.

A.2.3

FPGA Implementation

All three circuits have been implemented within Spartan3 (speed-5) programmable devices (Table A.1). The times (total time) are expressed in ns. The parameters FFs and LUTs represent the number of flip-flops and look-up tables, respectively. Every slice includes two flip-flops and two look-up tables. All the source files are available at www.arithmetic-circuits.org.

Nonrestoring

LUTs

Slices

Total time

391

1,157

679

4166.4

SRT

583

2,525

1,365

1574.4

Specific

None

648

642

45

TABLE A.1

A.3

FFs

mod (2192 − 264 − 1) Reducers

mod p Addition and Subtraction The adder-subtractor of Fig. 3.3 has been implemented. The package storing the parameter values includes the following constant definitions: constant K: integer := 192; constant M: std_logic_vector(k-1 downto 0) := X”fffffffffffffffffffffffffffffffeffffffffffffffff”;

The implementation results are the following (Spartan3, speed-5): LUTs 25

Slices

Total time

13

9

All the source files are available at www.arithmetic-circuits.org.

p = 2 192 – 2 64 – 1

A.4

mod p Multiplication A.4.1

Generic Circuit

Three sequential generic circuits have been described in Chap. 3. The corresponding entities are csa_mod_multiplier, dar_mod_multiplier, and dar_csa_multiplier. The package storing the parameter values includes the following constant definitions: constant k: integer := 192; --logk is the number of bits of k-1 constant logk: integer := 8; constant m: std_logic_vector(k+1 downto 0) := “00” & X”fffffffffffffffffffffffffffffffeffffffffffff ffff”; --minus_m = 2**(k+2) - m constant minus_m: std_logic_vector(k+1 downto 0) := “11” & X”00000000000000000000000000000001000000000000 0001”;

The implementation results are the following (Spartan3, speed-5) (Table A.2): FFs

LUTs

Slices

Period

Cycles

Total time

csa_mod

1,271

3,678

2,053

dar_mod

400

593

400

6.233

384

2393.5

23.615

384

9068.2

dar_csa

597

1,835

1,113

9.796

384

3761.7

Cost and Delay of mod 2192 − 264 − 1 Multipliers

TABLE A.2

All the source files are available at www.arithmetic-circuits.org.

A.4.2

Specific Circuit

Another method consists of multiplying x by y, and then reducing mod p with a specific combinational circuit. For that, the carry-save shift-and-add multiplier of Fig. 3.5 and the mod p reducer of Sec. 2.6.2 can be used. An additional ripple-carry adder is necessary for summing up the outputs pc and ps of the carry-save adder. A complete VHDL file csa_modp192_multiplier is available at www.arithmeticcircuits.org. The entity declaration is entity csa_modp192_multiplier is port ( x, y: in std_logic_vector(191 downto 0); clk, reset, start: in std_logic; z: out std_logic_vector(191 downto 0); done: inout std_logic ); end csa_modp192_multiplier;

315

316

Appendix A and the corresponding architecture is first_step: modif_csa_multiplier port map( x => x, y => y, clk=> clk, reset => reset, start => start1, ps => ps, pc => pc, p => p, done => done1 ); sum <= ps + pc; mult <= sum & p; second_step: mod_p192_reducer port map(x => mult, z => z);

The complete model also includes a control unit in charge of the done flag generation. The implementation results are the following (Spartan3, speed -5):

A.5

FFs

LUTs

Slices

Period

Cycles

Total time

584

1,828

1,355

5.9

192

1132.8

mod p Exponentiation The Montgomery method is used. The following values must be previously computed: exp_k = 2k mod p = 2192 mod (2192 − 264 − 1) = 264 + 1 = 1616 + 1 exp_2k = 22k mod p = (1616 + 1)2 = 1632 + 2 · 1616 + 1 The package storing the parameter values includes the following constant definitions: constant K: integer := 192; --logK is the number of bits of K constant logK: integer := 8; constant M: std_logic_vector(K-1 downto 0) := X”fffffffffffffffffffffffffffffffeffffffffffffffff”; --minus_m = 2**k - m constant minus_M: std_logic_vector(K downto 0) := ‘0’ & X”000000000000000000000000000000010000000000000 001”; constant one: std_logic_vector(K-1 downto 0) := conv_std_logic_vector(1, K); --exp_k = 2**k mod m constant exp_K: std_logic_vector(K-1 downto 0) := X”000000000000000000000000000000010000000000000001”; --exp_2k = 2**(2*k) mod m constant exp_2K: std_logic_vector(K-1 downto 0) := X”000000000000000100000000000000020000000000000001”;

Both the MSB-first and LSB-first algorithms have been implemented (Spartan3, speed-5) (Table A.3).

p = 2 192 – 2 64 – 1

FFs

LUTs

Slices

Period

Cycles

Total time

MSB-first

1,185

1,993

1,199

8.176

73,733

602,841

LSB-first

1,779

3,554

1,983

8.871

36,869

327,065

TABLE A.3

Cost and Delay of mod 2192 − 264 − 1 Exponentiators

All the source files are available at www.arithmetic-circuits.org.

A.6

mod p Division Four sequential generic circuits have been described in Chap. 4. The corresponding entities are Euclidean_divider, binary_algorithm, plus_ minus, and Fermat_divider. All four circuits have been implemented within Spartan3 (speed-5) programmable devices (Table A.4).

Euclidean

FFs

LUTs

Slices

Period

AverCycles

AverTime

2,703

4,205

2,644

26.1

43,782.7

1,142,728

Binary

771

3,401

2,091

19.9

404.3

8,046

Plus-minus

798

2,016

1,103

26.2

260.7

6,831

Fermat

1,143

2,012

1,460

19.4

113,483

2,201,570

TABLE A.4

Cost and Delay of mod 2192 − 264 − 1 Divider

Obviously, the best option is the plus-minus algorithm. In this case p mod 4 = 3 (the least significant bits of p are 11) so that Eq. (4.31) is used for computing w4 − 1 mod p. The package storing the parameter values includes the following constant definitions: constant K: natural := 192; constant P: std_logic_vector(K downto 0) := ‘0’&x”fffffffffffffffffffffffffffffffeffffffffffffffff”; --LOGK+1 bits for representing integers between -k and k constant LOGK: natural := 9; --MINUS_P = -p constant MINUS_P: std_logic_vector(K+1 downto 0) := (‘1’ & not P) + ‘1’; --TWO_P = 2.p constant TWO_P: std_logic_vector(K+1 downto 0) := P & ‘0’; --if p mod 4 = 3: constant pp1: std_logic_vector (K+1 downto 0) := p(k)&p; constant pp3: std_logic_vector (K+1 downto 0) := MINUS_P;

All the source files are available at www.arithmetic-circuits.org.

317

This page intentionally left blank

APPENDIX

B

Optimal Extension Fields B.1

GF(23917) B.1.1 VHDL Models and Constant Definitions Several VHDL models have been generated for circuits executing operations over GF(23917), that is, the set of polynomials of a degree smaller than 17 over the field Z239, modulo the irreducible polynomial f(x) = x17 − 2. The MSE_first_mod_f_multiplier.vhd file includes the following entities: mod_239_reducer, adder_subtractor, mod_239_ multiplier, MSE_first_mod_f_multiplier, LSE_first_mod_f_multiplier. The parameter values are the following: constant K: natural := 8; constant P: std_logic_vector(k-1 downto 0) := conv_std_logic_vector(239, k); constant M: natural := 17; type long_polynomial is array(M downto 0) of std_logic_vector(K-1 downto 0); type polynomial is array(M-1 downto 0) of std_logic_vector(K-1 downto 0); --LOGM is the number of bits of m-1 constant LOGM: natural := 5; constant f: long_polynomial := (“00000001”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “00000000”, “11101101” );

The oef.vhd file includes the following additional entities: mod_ 239_inverter and oef. The latter is a divider over the particular optimal extension field under consideration. The mod 239 inversion is performed with a table storing the 238 inverses. The table is modeled

319

320

Appendix B using a block RAM component available within the unisim library of Xilinx. For another semiconductor or programmable device vendor, the table model should be modified. The package storing the parameter values includes the Frobenius coefficients: constant f1: polynomial := ( x”43”, x”bb”, x”65”, x”4b”, x”06”, x”d3”,x”24”, x”16”, x”28”, x”33”, x”01” ); constant f2: polynomial := ( x”bb”, x”4b”, x”a3”, x”80”, x”24”, x”43”,x”65”, x”06”, x”a6”, x”d3”, x”01” ); constant f4: polynomial := ( x”4b”, x”80”, x”28”, x”84”, x”65”, x”bb”,x”a3”, x”24”, x”47”, x”43”, x”01” ); constant f8: polynomial := ( x”80”, x”84”, x”a6”, x”d8”, x”a3”, x”4b”,x”28”, x”65”, x”16”, x”bb”, x”01” );

B.1.2

x”a3”, x”a6”, x”80”, x”47”, x”d8”, x”84”,

x”28”, x”47”, x”84”, x”16”, x”33”, x”d8”,

x”a6”, x”16”, x”d8”, x”06”, x”d3”, x”33”,

x”47”, x”06”, x”33”, x”24”, x”43”, x”d3”,

FPGA Implementations

Combinational circuits have been used for implementing the 16-bit to 8-bit mod 239 reducer, the mod 239 adder-subtractor, and the mod 239 multiplier (Table B.1):

Operation

LUTs

Slices

Total time

Reducer

63

37

17.1

Adder-subtractor

25

13

9

Multiplier

31

18

15

TABLE B.1

Cost and Delay of mod 239 Operators

As quoted above the mod 239 inverter is a table storing the 238 inverses.

Optimal Extension Fields As regards the mod f(x) operations, two serial multipliers have been implemented: MSE-first and LSE-first (Table B.2):

MSE-first LSE-first TABLE B.2

FFs

LUTs

Slices

Mult

Period

303

1,690

937

17

415

1,779

1,061

18

Cycles

Total time

26.1

34

888

19.4

17

330

Cost and Delay of Serial mod f(x) Multipliers

Finally, several dividers have been implemented (Sec. 6.4) (Table B.3):

FFs

LUTs

Slices

Mult

RAM

Total Period Cycles time

Pseudo Euclidean

871

3,923

2,272

39

1

36

147

5,292

Binary

623

3,235

2,001

37

−

56

37

2,072

Reduction to multiplications (MSE)

562

2,607

1,594

34

1

25

7,602

190,050

Reduction to multiplications (LSE)

672

2,794

1,754

35

1

19

4,202

79,838

Optimal extension field (MSE)

603

2,873

1,609

34

1

25

235

5,875

Optimal extension field (LSE)

715

3,268

1,894

35

1

19

133

2,527

TABLE B.3

Cost and Delay of mod f(x) Dividers

All the source files are available at www.arithmetic-circuits.org.

B.2

GF((232 − 387)6) B.2.1

Constants

GF((232 − 387)6) is the set of polynomials of degree smaller than 6 over the field Zp, where p = 232 − 387, modulo the irreducible polynomial f(x) = x6 − 2. Thus, p = 232 − 387 = [FFFFFE7D]16

m=6

c=2

Other important values are the Frobenius constants (Sec. 6.4) fji = cjt mod p

with t = ⎣pi/m⎦

321

322

Appendix B It can be shown that fji = bji mod p

with b = f11 = c⎣p/m⎦ mod p

The following values have been computed: ⎣p/m⎦ = ⎣FFFFFE7D/6⎦ = 2AAAAA6A b = 22AAAAA6A mod FFFFFE7D = 9CD1D682 so that fji = (9CD1D682) ji mod FFFFFE7D and

f11 = 9CD1D682 f12 = 9CD1D681 f13 = FFFFFE7C f14 = 632E27FB f15 = 632E27FC f21 = 9CD1D681 f22 = 632E27FB f23 = 00000001 f24 = 9CD1D681 f25 = 632E27FB f31 = FFFFFE7C f32 = 00000001 f33 = FFFFFE7C f34 = 00000001 f35 = FFFFFE7C f41 = 632E27FB f42 = 9CD1D681 f43 = 00000001

∀i,j in 1 to 5

Optimal Extension Fields f44 = 632E27FB f45 = 9CD1D681 f51 = 632E27FC f52 = 632E27FB f53 = FFFFFE7C f54 = 9CD1D681 f55 = 9CD1D682

B.2.2

mod p Reduction

In order to avoid the use of long-operand multipliers, nonrestoring, or SRT reducers should be used. Nonrestoring and SRT reducers for k = 32 and n = 64 have been implemented. The parameter values are the following: constant N: natural := constant K: natural := -- COUNTER_SIZE is the constant COUNTER_SIZE:

64; 32; number of bits of N-K-1 natural := 5;

It is important to observe that the final steps (decoding from stored-carry form to normal form and correction if the obtained result is negative) are computed with carry-propagate adders and that the corresponding delays could be greater than the clock period. As mentioned in Comment 2.1, some kind of synchronization of the final operations should be introduced, for example, adding s clock periods with s such that sTCLK > Tfinal steps. The implementation results are the following (Table B.4):

Nonrestoring SRT

FFs

LUTs

Slices

Period

70

197

101

100

425

216

Cycles

Total time

7.7

32

246.4

6.2

32

198.4

TABLE B.4 Cost and Delay of 64-bit to 32-bit mod Reducers

B.2.3

mod p Addition and Subtraction

The adder-subtractor of Fig. 3.3 has been implemented. The package storing the parameter values is the following: package addsub_parameters is constant k: integer := 32; constant m: std_logic_vector(k-1 ”fffffe7d”; end addsub_parameters;

downto

0)

:=

X

323

324

Appendix B

B.2.4

LUTs

Slices

Total time

97

49

9

mod p Multiplication

The double, add, and reduce algorithm, with stored-carry encoding (Eq 3.10), is used. The package storing the parameter values is the following: package dar_csa_multiplier_parameters is constant k: integer := 32; --logk is the number of bits of k-1 constant logk: integer := 5; constant m: std_logic_vector(k+1 downto 0) := “00” & X”fffffe7d”; --minus_m = 2**(k+2) - m constant minus_m: std_logic_vector(k+1 downto 0) := “11” & X”00000183”; end dar_csa_multiplier_parameters;

Some kind of synchronization of the final operations should be introduced.

B.2.5

FFs

LUTs

Slices

Period

Cycles

Total time

109

322

167

6.1

64

390.4

mod p Division

The plus-minus algorithm is used. The following values are previously computed: minus_p = 2k + 2 − p = 234 − (232 − 387) = 3.232 + 387 = [300000183]16 two_p = 2p = 2[FFFFFE7D]16 = [1FFFFFCFA]16 In this case, p mod 4 = 1 (the least significant bits of p are 01) so that Eq. (4.28) is used for computing w4−1 mod p. The parameter values are the following: constant K: natural := 32; constant P: std_logic_vector(K downto 0) := ‘0’&X”fffffe7d”; --LOGK+1 bits for representing integers between -k and k constant LOGK: natural := 6; constant MINUS_P: std_logic_vector(K+1 downto 0) := (‘1’ & not P) + ‘1’; constant TWO_P: std_logic_vector(K+1 downto 0) := P & ‘0’;

Optimal Extension Fields constant MINUS_ONE: std_logic_vector(LOGK downto 0) := conv_std_logic_vector(-1, LOGK+1); constant MINUS_TWO: std_logic_vector(LOGK downto 0) := conv_std_logic_vector(-2, LOGK+1); --if p mod 4 = 1: constant pp1: std_logic_vector (K+1 downto 0) := MINUS_P; constant pp3: std_logic_vector (K+1 downto 0) := p(k)&p; end plus_minus_Parameters;

The cost and delay of several mod (232 − 387) dividers are shown in Table B.5. FFs

LUTs

Slices

AverCycles

AverTime

Euclidean

459

729

435

Period 5.1

1,348.3

10,517

Binary

130

510

285

8.9

65.3

582

Plus-minus

151

402

206

13.9

46.5

647

Fermat

228

445

267

8.8

3,457

30,422

TABLE B.5 Cost and Delay of mod (232 − 387) Dividers

B.2.6

mod (x6 − 2) Multiplication

The MSE-first serial multiplier of Chap. 5 is used. The package storing the parameter values is the following: package mod_f_multiplier_parameters_package is constant k: natural := 32; constant p: std_logic_vector(k-1 downto 0) := X”fffffe7d”; constant m: natural := 6; type long_polynomial is array(m downto 0) of std_logic_vector(k-1 downto 0); type polynomial is array(m-1 downto 0) of std_logic_vector(k-1 downto 0); --logm is the number of bits of m-1 constant logm: natural := 3; constant f: long_polynomial := (X”00000001”, X”00000000”, X”00000000”, X”00000000”, X”00000000”, X”00000000”, X”fffffe7b”); end mod_f_multiplier_parameters_package;

FFs

LUTs

Slices

Period

Cycles

Total time

340

760

518

15.8

2,455

38,789

325

326

Appendix B

B.2.7

mod (x6 − 2) Division

In this case, r = 1 + p + p2 + p3 + p4 + p5 so that h(x)r − 1 can be computed as follows: d0 (x) = h(x) d1 (x) = d0 (x)p = h( x) p d2 (x) = d0 (x)d1 (x) = h(x)1 + p d3 (x) = d2 (x)(p ) = h(x)p 2

2 + p3

d4 (x) = d2 (x)d3 (x) = h( x)1 + p+ p d5 (x) = d4 (x)p = h(x)p + p

2 + p3 + p 4

d6 (x) = h( x)d5 (x) = h(x)1 + p + p d7 (xx) = d6 (x)p = h(x)p+ p

2 + p3

2 + p3 + p 4

2 + p3 + p 4 + p 5

= h(x)r − 1

The corresponding division algorithm, similar to Algorithm 6.8, is the following.

Algorithm B.1—mod f(x) division a := h; for j in 0 .. 5 loop e(j) := end loop; a := product_mod_f(e, a, f); for j in 0 .. 5 loop e(j) := end loop; a := product_mod_f(e, a, f); for j in 0 .. 5 loop e(j) := end loop; a := product_mod_f(e, h, f); for j in 0 .. 5 loop e(j) := end loop; -- e = hr-1 a := product_mod_f(e, h, f); inv := invert(a(0)); a := e; for j in 0 .. 5 loop e(j) := -- e = hr-1.h-r = h-1 z := product_mod_f(e, g, f);

(a(j)*frobenius(j,1)) mod p;

(a(j)*frobenius(j,2)) mod p;

(a(j)*frobenius(j,1)) mod p;

(a(j)*frobenius(j,1)) mod p;

-- a = hr -- inv = h-r, a = hr-1 (a(j)*inv) mod p; end loop; -- z = h-1.g

An example of datapath corresponding to Algorithm B.1 is shown in Fig. B.1. The total computation time approximately amounts to

Optimal Extension Fields f52 f51 f42 f41 f32 f31 f22 f21 f12 f11

1 0 1 0 1 0 a5 a4 a3 a2 a1 a0

1 0

1 0

sel_f

1 a0

5 4 3 2 1 0

5 4 3 2 1 0

aj

j

fji

mod p inverter

0 1

sel_e

inv

start_mult_p

mod p multiplier

mult_p_done

product_p ce2 ce1

(j = 5) ce2 ce1

(j = 4) ce2 ce1

(j = 3) ce2 ce1

(j = 2) ce2 ce1

(j = 1) ce2 ce1

(j = 0)

ce_e e5

e4

e3

e2

e1

e0

a(x) h(x) g(x) e(x ) 0 1 2

sel_ahg

start_mult_f

mod f(x) multiplier

mult_f_done

product_f 1

0

initially : h(x) a (x )

z (x )

FIGURE B.1

Datapath.

sel_a

ce

ce_a

start_inv inv_done

327

328

Appendix B five times the computation time of a mod f(x) multiplier. A VHDL model has been generated. The complete VHDL file mod_f_divider.vhd is available at www.arithmetic-circuits.org. The entity declaration is entity mod_f_divider is port( g, h: in polynomial; clk, reset, start: in std_logic; z: out polynomial; done: out std_logic ); end mod_f_divider;

The VHDL architecture corresponding to the circuit of Fig. B.1 is the following: with j select aj <= a(0) when “000”, a(1) when “001”, a(2) when “010”, a(3) when “011”, a(4) when “100”, a(5) when others; with j select fji <= f12(0) when “000”, f12(1) when “001”, f12(2) when “010”, f12(3) when “011”, f12(4) when “100”, f12(5) when others; fji_selection: for i in 1 to 5 generate with sel_f select f12(i) <= f1(i) when ‘0’, f2(i) when others; end generate; f12(0) <= x”00000001”; with sel_e select in2 <= fji when ‘0’, inv when others; with sel_ahg select ahg <= a when “00”, h when “01”, g when others; with sel_a select next_a <= e when ‘1’, product_f when others; first_component: dar_csa_multiplier port map( x => aj, y => in2, clk => clk, reset => reset, start => start_mult_p, z => product_p, done => mult_p_done ); second_component: mod_f_multiplier port map( a => e, b => ahg, clk => clk, reset => reset, start => start_mult_f, z => product_f, done => mult_f_done ); third_component: plus_minus port map( x => x”00000001”, y => a(0), clk => clk, reset => reset, start => start_inv, z => next_inv, done => inv_done ); register_a: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then a <= h; elsif ce_a = ‘1’ then a <= next_a;

Optimal Extension Fields end if; end if; end process; iteration: for index in 0 to 5 generate registers_e: process(clk) begin if clk’event and clk = ‘1’ then if ce_e = ‘1’ and j = index then e(index) <= product_p; end if; end if; end process; end generate; register_inv: process(clk) begin if clk’event and clk = ‘1’ then if ce_inv = ‘1’ then inv <= next_inv; end if; end if; end process; counter: process(clk) begin if clk’event and clk = ‘1’ then if load = ‘1’ then j <= “101”; elsif update = ‘1’ then if j = 0 then j <= “101”; else j <= j-1; end if; end if; end if; end process; z <= product_f;

The complete model additionally includes a control unit. The package storing the parameter values follows: package mod_f_divider_parameters is constant k: natural := 32; constant p: std_logic_vector(k-1 downto 0) := x”f ffffe7d”; constant m: natural := 6; type long_polynomial is array(m downto 0) of std_logic_vector(k-1 downto 0); type polynomial is array(m-1 downto 0) of std_logic_vector(k-1 downto 0); --logm is the number of bits of m-1 constant logm: natural := 3; constant f: long_polynomial := (x”00000001”, x”00000000”, x”00000000”, x”00000000”, x”00000000”, x”00000000”, x”fffffe7b”); constant f1: polynomial := (x”632e27fc”, x”632e27fb”, x”fffffe7c”, x”9cd1d681”, x”9cd1d682”, x”00000001”);

329

330

Appendix B constant f2: polynomial := (x”632e27fb”, x”9cd1d681”, x”00000001”, x”632e27fb”, x”9cd1d681”, x”00000001”); end mod_f_divider_parameters;

FFs

LUTs

Slices

Period

Cycles

Total time

1,024

2,409

1,448

15

14,383

227,251.4

All the source files are available at www.arithmetic-circuits.org.

APPENDIX

C

Binary Fields C.1 GF(2163) GF(2163) is represented by the set of polynomials of degree smaller than 163 over GF(2), modulo the irreducible polynomial f(x) = x163 + x7 + x6 + x3 + 1.

C.1.1

mod f(x) Multiplication

Severalcombinational(classic_multiplication,mastrovito_V2_multiplication) and sequential (interleaved_mult, montgomery_mult) entities have been described in Chap. 7. The parameter values are the following: constant M: integer := 163; constant F: std_logic_vector(M-1 downto 0):= “000”&x”00000000000000000000000000000000000000C9”;

In the case of the interleaved multiplier, several implementation strategies, based on the number G of bits computed at each cycle, have been considered. The implementation results (Spartan3, speed-5) are given in Table C.1. All the source files are available at www.arithmetic-circuits.org.

C.1.2

mod f(x) Division

Generic VHDL models of mod f(x) dividers (binary_algorithm_ polynomials.vhd) have been defined in Sec. 7.4. In the case of GF(2163) the parameter values are the following: constant M: integer := 163; --logM is the number of bits of M plus an additional sign --bit: constant logM: integer := 9; constant F: std_logic_vector(M downto 0) := x”800000000000000000000000000000000000000C9”;

331

332

Appendix C

Total time

N

G

FFs

LUTs

Slices

Period

Cycles

Classic

–

–

22,356

15,171

–

–

39

Interleaved

1

509

511

271

4.5

163

815

Interleaved

2

527

676

369

4.8

82

369

Interleaved

4

531

849

463

4.8

41

197

Interleaved

6

538

1,017

555

5.0

28

134

Interleaved

8

555

1,356

745

5.2

21

105

Interleaved

11

546

1,843

965

5.7

15

78

Interleaved

13

515

1,884

975

5.7

13

74

Interleaved

15

528

2,215

1,161

5.8

11

64

Interleaved

16

534

2,237

1,190

5.9

11

65

Interleaved

33

560

4,449

2,304

7.5

5

37

Interleaved

55

589

6,956

3,588

9.7

3

29

Mastrovito

–

–

22,347

15,201

–

–

36

Montgomery

–

344

347

184

7.4

163

1,206

TABLE C.1

Cost and Delay of Multipliers over GF(2163)

The implementation results (Spartan3, speed-5) are the following: FFs

LUTs

Slices

Period

AverCycles

AverTime

679

726

544

7.4

326

2412

The source file is available at www.arithmetic-circuits.org.

C.1.3

Squaring

According to the results of Chap. 7 the best solution is a combinational circuit modeled by the classic_squarer entity. The parameter values are the same as before (Sec. 1.1), and the implementation results (Spartan3, speed-5) are the following: LUTs

Slices

Total time

165

86

3

The source file is available at www.arithmetic-circuits.org.

C.1.4

Elliptic-Curve Operations

The VHDL models K163_addition.vhd and K163_point_multiplication.vhd have been described in Chap. 10 and are available at

Binary Fields www.arithmetic-circuits.org. The implementation results (Spartan3, speed -5) are the following: FFs

LUTs

Slices

Period

AverCycles

AverTime

2,170

3,514

2,062

7.9

54,422.8

429,940

C.2 GF(2233) Another NIST-recommended finite field is GF(2233), that is, the set of polynomials of degree smaller than 233 over GF(2), modulo the irreducible polynomial f(x) = x233 + x74 + 1.

C.2.1

mod f(x) Multiplication

Only sequential circuits are considered. The entities interleaved_mult and montgomery_mult are described in Chap. 7. The parameter values are the following: constant M: integer := 233; constant F: std_logic_vector(M-1 downto 0):= (0=> ‘1’, 74 => ‘1’, others => ‘0’);

In the case of the interleaved multiplier, several implementation strategies, based on the number of bits G computed at each cycle, have been considered. The implementation results (Spartan3, speed-5) are given in Table C.2.

Cycles

Total time

6.4

223

1427

541

5.6

112

627

1,192

689

5.5

56

308

780

1,919

1,045

5.7

28

160

15

880

3,112

1,736

5.9

15

88

16

879

3,130

1,743

5.9

14

83

Interleaved

32

932

6,213

3,346

7.5

7

52

Interleaved

56

1321

10,112

5,718

11.5

4

46

Montgomery

–

484

489

255

233

1,748

N

G

FFs

LUTs

Slices

Interleaved

1

763

723

417

Interleaved

2

769

957

Interleaved

4

794

Interleaved

8

Interleaved Interleaved

TABLE C.2

Period

7.5

Cost and Delay of Multipliers over GF(2233)

All the source files are available at www.arithmetic-circuits.org.

333

334

Appendix C

C.2.2

mod f(x) Division

Generic VHDL models of mod f(x) dividers (binary_algorithm_ polynomials.vhd) have been defined in Sec. 7.4. In the case of GF(2233) the parameter values are the following: constant M: integer := 233; --logM is the number of bits of M plus an additional sign --bit: constant logM: integer := 9; constant F: std_logic_vector(M downto 0) := (0=> ‘1’, 74 => ‘1’, 233 => ‘1’,others => ‘0’);

The implementation results (Spartan3, speed-5) are the following: FFs

LUTs

Slices

Period

AverCycles

AverTime

962

1,013

763

7.4

466

3448

The source file is available at www.arithmetic-circuits.org.

C.2.3

Squaring

According to the results of Chap. 7 the best solution is a combinational circuit modeled by the classic_squarer entity. The parameter values are the same as before (Sec. 2.1), and the implementation results (Spartan3, speed-5) are the following: LUTs

Slices

Total time

153

99

3

The source file is available at www.arithmetic-circuits.org.

C.2.4

Elliptic-Curve Operations

Circuits for executing the elliptic-curve operations over K-233 have also been generated, namely K233_addition.vhd and K233_point_ multiplication.vhd. They are available at www.arithmetic-circuits.org. The parameter definition packages and the entity declarations are package K233_addition_parameters is constant m: natural := 233; constant logm: natural := 8; end K233_addition_parameters; entity K233_addition is port( x1, y1, x2, y2: in std_logic_vector(m-1 downto 0); clk, reset, start: in std_logic; x3: inout std_logic_vector(m-1 downto 0); y3: out std_logic_vector(m-1 downto 0); done: out std_logic );

Binary Fields end K233_addition; package K233_package is constant m: natural := 233; end K233_package; entity K233_point_multiplication is port ( xP, yP, k: in std_logic_vector(m-1 downto 0); clk, reset, start: in std_logic; xQ, yQ: inout std_logic_vector(m-1 downto 0); done: out std_logic ); end K233_point_multiplication;

The corresponding architectures are similar to those of the K163_ addition and K163_point_multiplication entities. Nevertheless, apart from the operand length, another difference is that K-163 is a Type-1 Koblitz curve [Eq. (10.53)] while K-233 is a Type-0 curve [Eq. (10.52)]. In the first case a = 1, b = 1, and μ = 1, and in the second case a = 0, b = 1, and μ = −1. The addition formulas [Eq. (10.16)] are slightly different (a = 1 or 0, in function of the curve type) and the same occurs with the base-τ representation algorithms (μ = 1 or −1, in function of the curve type). In the case of K-233 the following rules are used: if a0 = 0 then a’ = b − a/2 = b − ⎣a/2⎦ and b’ = − ⎣a/2⎦ if a0 = 1 and a1 ⊕ b0 = 0 then a’ = b − (a − 1)/2 = b − ⎣a/2⎦ and b’ = − ⎣ a/2⎦ if a0 = 1 and a1 ⊕ b0 = 1 then a’ = b − (a + 1)/2 = b − (⎣a/2⎦ + 1) and b’ = − ( ⎣ a/2⎦ + 1) The point multiplication circuit of Fig. 10.2 has been implemented within a Spartan3 (speed-5) programmable device with P defined by its coordinates: xP = 17232ba853a7e731af129f22ff4149563a419c26bf50a4c9d6eefad6126 yP = 1db537dece819b7f70f555a67c427a8cd9bf18aeb9b56e0c11056fae6a3 The order of P is equal to n = 08000000000000000000000000000069d5bb915bcd46efb1ad5f173abdf The circuit computes kP for any k belonging to the interval 0 < k < n. All the source files are available at www.arithmetic-circuits.org. FFs

LUTs

Slices

Period

AverCycles

AverTime

3,080

4,640

2,888

8.1

110,051.3

891,416

335

This page intentionally left blank

APPENDIX

D

Ada versus VHDL

T

he programming language Ada is used for describing most of the algorithms presented in this book. The reason for choosing Ada instead of C was the similarity between the Ada and VHDL syntaxes. The definition of VHDL has been widely inspired by Ada due to the fact that an electronic circuit is a real-time system all of whose components are working concurrently, and Ada was the “par excellence” language for programming concurrent systems. The reader of this book is assumed to be an electronic circuit designer, with some experience in the use of hardware description languages. His/her effort for understanding short and simple Ada procedures should be minimal. As a matter of fact, a simple Ada procedure is very similar to a VHDL procedure. This appendix presents some of the differences between Ada and VHDL procedures. Consider the first algorithm of Chap. 2, that is, Algorithm 2.1, and define the function quotient as follows [Eq. (2.13)]: quotient(s, y) = −1 if s < 0

quotient(s, y) = 1 if s ≥ 0

It corresponds to the nonrestoring algorithm of Sec. 2.1.2. The following VHDL procedure describes the corresponding reduction algorithm:

Algorithm D.1—VHDL version procedure nr_reducer(m, x: in integer; z: inout integer) is function quotient(s: in integer; y: in natural) return integer is begin if s < 0 then return -1; else return 1; end if; end quotient; variable y, s, r: integer; begin y := m*(2**(n-k)); s := x; for i in 0 to n-k loop if quotient(s,y) = 1 then r := s - y; elsif quotient(s,y) = 0 then r := s;

337

338

Appendix D else r := s + y; end if; s := 2*r; end loop; z := r / (2**(n-k)); if z < 0 then z := (z + m); end if; end nr_reducer;

In order to execute this procedure with actual values of m and x, the following test_reducer entity could be defined and simulated: --define the value of n and k: package test_reducer_parameters is constant n: integer := 20; constant k: integer := 8; end test_reducer_parameters; use work.test_reducer_parameters.all; entity test_reducer is end test_reducer; architecture Ada_style of test_reducer is --insert here Algorithm 1 signal m, x, z: integer; begin m <= 239; x <= 123456, 654321 after 100 ns, 555555 after 200 ns; process(m, x) variable var_z: integer; begin nr_reducer(m, x, var_z); z <= var_z; end process; end Ada_style;

The complete VHDL file test_reducer.vhd is available at www. arithmetic-circuits.org. The corresponding Ada version of the same procedure (Algorithm D.1) is the following:

Algorithm D.2—Ada version procedure nr_reducer(x, m: in integer; z: out integer) is function quotient(s: in integer; y: in natural) return integer is begin if s < 0 then return -1; else return 1; end if; end quotient; y, s, r: integer; begin y := m*(2**(n-k)); s := x; for i in 0 .. n-k loop

Ada versus VHDL if quotient(s,y) = 1 then r := s - y; elsif quotient(s,y) = 0 then r := s; else r := s + y; end if; s := 2*r; end loop; z := r / (2**(n-k)); if z < 0 then z := (z + m); end if; end nr_reducer;

The differences between Algorithm D.1 and D.2 are the following: In VHDL z must be declared as inout because the value of the output variable z is used internally (if z < 0 then z := (z + m); end if;); in Ada z is an output parameter; In VHDL y, s, and r are variables (neither signals nor constants) and must be explicitly declared as variables; in Ada there are no signals and the constant declaration is slightly different; see an example below: n: constant integer := 20; --(Ada) constant n: integer := 20; --(VHDL)

In VHDL the index ranges are defined with to or downto; in Ada with .. ; for example, for i in 0 to n-k -- (VHDL) for i in 0 .. n-k -- (Ada)

or for i in n-k downto 0 -- (VHDL) for i in reverse 0 .. n-k -- (Ada)

Regarding the way the source files are generated, there is another difference between Ada and VHDL. In VHDL all the source units (entities, architectures, packages, package bodies) can be stored within the same file, or within separate files. In Ada, every unit is stored within a separate file: an .ads file storing the external view of the corresponding unit, and an .adb file storing the corresponding internal definition. In order to execute the Ada procedure (Algorithm D.2) with actual values of m and x, the value of the constants n and k are defined within reducer_parameters.ads package reducer_parameters is n: constant natural := 20; k: constant natural := 8; end reducer_parameters;

339

340

Appendix D and a test_reducer.adb file is generated: with Gnat.Io; use Gnat.Io; with reducer_parameters; use reducer_parameters; procedure test_reducer is --insert here Algorithm 2 x, m, z: integer; begin loop Put(“m = “); Get(m); Put(“x = “); Get(x); nr_reducer(x, m, z); Put(x); Put(“ mod “); Put(m); Put(“ = “); Put(z); New_Line; New_Line; end loop; end test_reducer;

Observe that the way the package contents are made visible is slightly different: with reducer_parameters; use reducer_parameters; -- (Ada) use work.test_reducer_parameters.all; -- (VHDL)

The Gnat.Io package includes input-output functions such as get, put, and New_Line. They allow one to input values from the keyboard and to display the results. The complete Ada file test_reducer.adb is available at www. arithmetic-circuits.org.

Index

This page intentionally left blank

φ(n), 6, 7

A adders, carry-save, 29 adder-subtractor, 64 affine point, 295 algorithm τ-ary representation, 301 Barrett reduction, 46 digit recurrence carry save reduction, 30 digit recurrence reduction, 27 division, Fermat’s theorem, 110 double, add, and reduce, 71, 73 dual basis inversion for GF(24), 276 dual basis multiplication, 272 dual basis multiplication for GF(28), 274 inversion for GF(2m), 279 mod 2k − a reduction, 35, 36 mod f(x) division, binary algorithm, 149, 151 mod f(x) division, Euclidean algorithm, 141, 142, 143 mod f(x) division, multiplications over GF(pm) and inversion over Zp, 154 mod f(x) division, optimal extension field, 157, 158 mod m addition, 61, 62 mod m exponentiation, LSB-first, 85 mod m exponentiation, MSB-first, 82 mod m subtraction, 63 mod p division, binary algorithm, 101 mod p division, Euclidean algorithm, 92 mod p division, plus-minus algorithm, 106

algorithm (Cont.): Montgomery exponentiation LSB-first, 85 MSB-first, 83 Montgomery product, 77, 78 Montgomery reduction, 77 n-digit to (k + t)-digit reduction, 44 nonrestoring division, 94 normal basis 2k-ary exponentiation, 253, 254 normal basis binary exponentiation, 250 normal basis inversion, 255 normal basis Itoh-Tsujii inversion, 258 normal basis m-ary exponentiation, 253 normal basis Massey-Omura multiplication for GF(24), 240 normal basis multiplication, 245, 246 normal basis squaring, 238 OEF binary exponentiation, 135 OEF LSE-first multiplier, 135 OEF MSE-first multiplier, 135 OEF multiplication, 134 optimal normal basis multiplication, Type-I, 260, 261 point addition, 291 point multiplication, 293 GF(2163), 307 Montgomery algorithm, 297 τ-ary representation, 302, 303 polynomial basis binary algorithm, 204 polynomial basis binary exponentiation, 196 polynomial basis classic multiplication, 167

343

344

Index algorithm (Cont.): polynomial basis classic squaring, 187, 188 polynomial basis inversion, AIA, 207, 208, 210, 211 polynomial basis LSB-first multiplier, 172 polynomial basis LSB-first squaring, 187, 193 polynomial basis Mastrovito multiplication, 177, 180 for AOPs, 217 for class 1 pentanomials, 223 for trinomials, 220 polynomial basis Montgomery exponentiation, 200 polynomial basis Montgomery multiplication, 183, 184 polynomial basis Montgomery squaring, 188, 190 polynomial basis MSB-first multiplier, 172 precomputation of 2ik mod m, 40 shift and add multiplication, 66 SRT algorithm, 31 subtract and shift, 97 triangular basis inversion for GF(2m), 281 triangular basis multiplication for GF(2m), 282, 283 Zp[x]/f(x) addition, 117, 118 Zp[x]/f(x) binary exponentiation, 129 Zp[x]/f(x) LSE-first multiplier, 126 Zp[x]/f(x) MSE-first multiplier, 124 Zp[x]/f(x) multiplication, 123 Zp[x]/f(x) subtraction, 118, 119 AOP, 216 automorphism, 21 Frobenius, 21

B basis, 20 dual, 21 normal, 21 polynomial, 21 Berlekamp multiplier, 271 Bezout’s identity, 182 binary algorithm, 91, 100, 204 binary extension field, 22, 163, 235

C canonical basis. See polynomial basis carry-free operations, 30 cofactor, 292 coefficient, leading, 11

congruence, 4 class, 5 modulo f(x), 15 modulo n, 4 properties, 5 of polynomials, 15 conjugate, 20, 236

D defining element, 19 deg. See degree degree, polynomial, 11 discrete logarithm system, 287 divider binary algorithm, 102, 152 Euclidean algorithm, 99, 145 Fermat’s theorem, 110 multiplications over GF(pm) and inversion over Zp, 155 nonrestoring, 95 optimal extension field, 159 plus-minus algorithm, 107 division integer, 2 integer division, 93 mod f(x) binary algorithm, 147 Euclidean algorithm, 140 multiplications over GF(pm) and inversion over Zp, 154 optimal extension field, 156 mod p binary algorithm, 100 Euclidean algorithm, 98 Fermat’s theorem, 110 plus-minus algorithm, 104 divisor, 1, 12 dual basis, 21, 269 convenient dual basis, 273 conversion, 4 inverse, 275 multiplication, 270 optimal dual bases, 273 pentanomial, 273 squaring, 274 trinomial, 273 weakly dual bases, 270 duality, 21, 269

E elliptic curve, 288 Hasse theorem, 289 Koblitz, 299 nonsupersingular, 289 projective form, 295 supersingular, 289

Index elliptic curve cryptography, 287 elliptic curve operations basic algorithms, 293 group law, 290 nonadjacent forms, 294 point addition, 290 point addition, GF(2163), 305 point multiplication, 292 point multiplication, τ-ary representation, 301 point multiplication, GF(2163), 304, 306 point multiplication, Montgomery algorithm, 297 projective coordinates, 294 elliptic curve operator point addition, GF(2163), 305 point multiplication, GF(2163), 308 equivalence class, 5, 15 relation, 5, 15 ESP, 213 Euclidean algorithm, 2, 91, 92, 139, 140, 142, 207, 255 extended, 3, 4 extended, for polynomials, 14 for polynomials, 12 Euler phi function, 6 exponentiator Montgomery exponentiation, LSB-first, 85 Montgomery exponentiation, MSB-first, 83 extension (field), 10, 17, 18 algebraic, 19 degree, 19 finite, 19 optimal extension field (OEF), 132 simple, 19

finite field (Cont.): optimal extension field (OEF), 132 optimal extension field (OEF), Type-I, 133 optimal extension field (OEF), Type-II, 133 order, 17 properties, 17 Fq, 17 Frobenius constants, 157 Frobenius map, 299

G Galois field, 17, 20, 91, 139 Galois group, 21 gcd, 91, 100, 139, 140, 147. See also greatest common divisor generator, 7, 8 GF(2m), 163 GF(pm), 117 GF(q), 17 greatest common divisor, 2, 3, 12, 139, 207 group, 8 abelian, 8 commutative, 8 cyclic, 7, 8, 17 multiplicative, 6

H Hamming weight, 213, 242 Hasse theorem, 289

I ideal, 9 identity element, 8 additive, 9, 10 multiplicative, 9, 10, 11 inverse, multiplicative, 5 Itoh-Tsujii algorithm, 257

F Fermat’s little theorem, 6, 17, 91, 110 Fermat’s theorem, 207, 255, 275 field, 10, 16 characteristic, 10 extension, 10, 17, 18 isomorphic, 17 prime, 10 finite field, 17 bases, 20 GF(2m), 22, 235 GF(pm), 117 OEF polynomial multiplication, 133 OEF reduction modulo an irreducible binomial, 133

K Karatsuba-Ofman multiplication, 169 Koblitz curve, 299 Kronecker delta function, 270

L linear function, 269 López-Dahab projective coordinates, 295

M Massey-Omura multiplier, 235, 238 Mastrovito multiplication, 177 matrix decomposition, 178

345

346

Index mixed-radix numeration system, 38 mod (2192 − 264 − 1) reducer, 50 mod 239 reducer, 49 mod m operations, 61 addition, 61 carry-save multiplier, 69 double, add, and reduce, 70 exponentiation, 82 Montgomery multiplication, 75 Montgomery reduction, 77 multiplication, 66 multiply and reduce, 66 subtraction, 63 mod m reducer, FPGA implementation, 54 mod m reduction, 25 2k − a, 33, 55 Barrett algorithm, 43 Barrett reduction, 58 integer division, 25 nonrestoring, 27, 55 precomputation of 2ik mod m, 38, 57 SRT, 29, 55 Montgomery arithmetic, 75 Montgomery multiplication, 75, 182 Montgomery product, 77 Montgomery reduction, 77 multiplication, 96 multiplier-subtractor, 97 multiplier carry-save shift-and-add, 67 double, add, and reduce, 71, 74 modified carry-save, 69 Montgomery product, 79

N

n-digit to (k + t)-digit reduction, 43 nonadjacent forms, 294 normal basis, 21, 163, 235 complexity, 242, 259 exponentiation, 249 generator, 21, 235 inversion, 255 Itoh-Tsujii inversion, 257 Massey-Omura multiplication, 238 multiplication, 238 N-polynomial, 236 normal element, 21, 235, 237 optimal normal bases, 236, 259 optimal normal bases, Type-I, 236, 259 optimal normal bases, Type-II, 236, 259

normal basis (Cont.): optimal normal bases multiplication, Type-I, 260 squaring, 238 number integer, 1 natural, 1 real, 10 number theory, 1

O OEF, 132 operations over GF(p), 91 division binary algorithm, 100 Euclidean algorithm, 92 Fermat’s theorem, 110 plus-minus algorithm, 104 operations over GF(pm), 139 division binary algorithm, 147 Euclidean algorithm, 140 multiplications over GF(pm) and inversion over Zp, 154 optimal extension field, 156 optimal extension field , 156 order of an element, 7, 8

P plus-minus algorithm, 104 point at infinity, 288 polynomial, 11 0-degree, 17 addition, 163 all-one (AOP), 216 binomial, irreducible, 132 coefficient, 11 constant, 11 defining, 19 equally spaced (ESP), 213 general irreducible, 214 irreducible, 12, 19, 163 minimal, 19 monic, 11 pentanomial, 221 pentanomial, class 1, 221 subtraction, 163 trinomial, 219 zero, 11 polynomial basis, 21, 163 almost inverse algorithm (AIA), 210 binary exponentiation method, 196 division, 204 exponentiation, 195

Index polynomial basis (Cont.): extended Euclidean algorithm, 207 interleaved multiplication, 171 inversion, 206 Karatsuba-Ofman multiplication, 169 least-significant bit (LSB) multiplication, 172 LSB-first squaring, 187, 192 Mastrovito multiplication, 174 Mastrovito product matrix, 174, 217 matrix-vector multiplication, 174 Montgomery exponentiation method, 199 Montgomery multiplication, 182 Montgomery squaring, 188 most-significant bit-serial (MSB) multiplication, 172 multiplication, 164 with AOPs, 216 with ESPs, 213 with general irreducible polynomials, 214 with pentanomials, 221 with trinomials, 219 polynomial multiplication, 164, 165 reduction matrix, 166 reduction modulo an irreducible polynomial, 164, 166 squaring, 187 two-step classic multiplication, 164 two-step classic squaring, 187 prime, 2 pseudo-Mersenne, 132 relatively, 6 prime field, 10 primitive element, 7 projective coordinates, 294 projective point, 295 López-Dahab coordinates, 295 standard coordinates, 298 pseudo-Mersenne prime, 132 public key cryptography, 287

Q quotient, 1, 2, 11

R remainder, 1, 2, 11 ring, 9 commutative, 9, 16, 117 polynomial, 11 root(s), 20 adjunction, 20 of irreducible polynomial, 20

root adjunction, 19 roots, 163, 236 RSA, 287

S semigroup, 8 standard basis. See polynomial basis standard projective coordinates, 298 step function, 175 stored carry encoding, 30 subexpression sharing, 178 subfield, 10 proper, 10 subring, 9 subtraction, 96

T Toeplitz matrix, 179 trace, 20, 237, 269 triangular basis, 277 conversion, 277, 282, 283 inversion, 278 multiplication, 282 transformation matrix, 277

U unity element, 8

V vector space, 19 dimension, 19

W Weierstrass equation, 288

Z Zn, 5 Zn*, 6 Zp[x]/f(x), 117 addition, 117 exponentiation, 128 least-significant element (LSE) first multiplication, 125 most-significant element (MSE) first multiplication, 123 multiplication, 121 polynomial multiplication, 121, 122 reduction matrix, 122 reduction modulo f(x), 121, 122 serial multiplication, 123 subtraction, 117 two-step multiplication, 121

347