Arithmetic Complexity of Computations (CBMS-NSF Regional Conference Series in Applied Mathematics)

CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ...

Author: Shmuel Winograd

146 downloads 793 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics undei the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BIRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization. SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD SALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems—Z. ARTSTEIN, Appendix A: Limiting Equations and Stability of Nonautonomous Ordinary Differential Equations D. GOTTLIEB and S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and SemiGroup Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations J. F. C. KINGMAN, Mathematics of Genetic Diversity

(continued on inside back cover)

SHMUELWINOGRAD IBM Thomas J. Watson Research Center

Arithmetic Complexity of Computations

SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS PHILADELPHIA, PENNSYLVANIA

1980

All rights reserved. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the Publisher. For information, write the Society for Industrial and Applied Mathematics, 1400 Architect's Building, 117 South 17th Street, Philadelphia, Pennsylvania 19103-5052.

Copyright© 1980 by the Society for Industrial and Applied Mathematics. Second printing 1986. Third printing 1990.

Library of Congress Catalog Card Number 79-93154. ISBN 0-89871-163-0.

Printed for the Society for Industrial and Applied Mathematics by J. W. Arrowsmith, Ltd., Bristol, England.

Contents Chapter I INTRODUCTION

1

Chapter II THREE EXAMPLES Ila. Product of integers lib. Discrete Fourier transform He. Matrix multiplication

3 4 5

Chapter III GENERAL BACKGROUND Ilia. Definitions and basic results Illb. Linear functions IIIc. Quadratic and bilinear forms

7 11 13

Chapter IV PRODUCT OF POLYNOMIALS IVa. Minimal algorithms IVb. Classification of the algorithms IVc. Heuristic algorithms

25 28 32

Chapter V FIR FILTERS Va. Filters and polynomials Vb. Filters with decimation Vc. Symmetric filters

39 46 49

Chapter VI PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL VIa. An illustrative example VIb. Multiplication modulo irreducible polynomial VIc. Multiplication modulo a general polynomial VId. Multiplication modulo several polynomials

57 60 62 63

Chapter VII CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM VIla. Cyclic convolution Vllb. DFTOO-n prime VIIc. DFT(p r )-p odd prime Vlld. DFT(2r) VIIe. Multidimension DFT

71 76 79 83 87

Bibliography

93

This page intentionally left blank

CHAPTER I

Introduction The two major problem areas which are the concern of Arithmetic Complexity of Computations are: 1. What is the minimum number of arithmetic operations which are needed to perform the computation? 2. How can we obtain a better algorithm when improvement is possible? These two questions are very large in scope, since they pertain to any computation which is arithmetic in nature. In these lectures we will not attempt to cover the whole theory, but concentrate on a narrower class of computational problems—that of computing a system of bilinear forms. That is we aim at a better understanding of the amount of arithmetic needed to compute the quantities «Ai,

where the A:,'S and y/s denote the inputs to the algorithm, that is the data, and the aijk 's are constant independent of the data. As we will see later, there are many problems of practical interest which can be viewed as computing a system of bilinear forms. Yet finding the most efficient algorithm for computing a system of bilinear forms can be most vexing. In spite of the apparent simplicity of systems of bilinear forms many of the problems connected with their computations have not been solved. In developing these lectures we will try to keep a balance between the mathematical aspects of the theory and the applicability of the results. We will emphasize the results which lead to applications in the area of signal processing. This choice was motivated by several reasons: Many signal processing problems place a very heavy computational load on even the largest computers which are available, thus even a modest reduction in their execution time may have practical significance. Secondly, the results which are applicable to signal processing are relatively new and consequently do not appear in any of the books on computational complexity which are available, but are scattered in journal articles. Last, and not least, the choice of the material which was made gives a good indication of the flavor of complexity of computation (admittedly at the expense of indicating its full scope). In the next section, we will describe three algorithms which were discovered in the last two decades and which were the motivation for much of the development in complexity of computations. The following section will provide a general background of the basic results, and the next three sections will deal with the l

2

CHAPTER 1

complexity of computing convolution, digital filtering, and the discrete Fourier transform. I would like to end this introductory section with another qualification. Even though the title speaks of arithmetic complexity, that is the consideration of both the number of additions and the number of multiplications, we will concentrate our attention on the number of multiplications. The main reason is that the theory concerning the number of multiplications is much more developed than that of the number of additions. However, in the discussion of some of the applications we will have to pay attention to the number of additions, and not just to the number of multiplications.

CHAPTER II

Three Examples The systematic investigation of the number of arithmetic operations needed to perform a computation received its greatest impetus by the discovery, during the 1960's, of three very surprising algorithms—for the multiplication of two integers, for computing the discrete Fourier transform, and for the product of two matrices. These three computational problems have been with us for a very long time, and the fact that more efficient algorithms for their execution were discovered indicated that a study of the limits of the efficiency of algorithms may yield some more interesting results. We will devote this section to describing these algorithms. Their importance is not only historic; the new algorithm for computing the discrete Fourier transform has had a very profound impact on the way many computations are being performed. Ha. Product of integers. In the early 1960's two Russian mathematicians, Keratsuba and Toom, discovered a new way for multiplying two large integers. The regular way, we all learned in school, for obtaining the product of two n -digit numbers called for performing n2 multiplications of single digits and about the same number of single digit additions. The new method enables us to obtain the product using many fewer operations. The specific method which will be described is different from the original one. The original paper of Karatusba [3] and the follow-up by Toom [4] are more complicated (and yield sharper results). Let x and y be two n -digit numbers and assume that n = 2m is an even number. If b denotes the base then we can write x = x0 + x i • bm and y = yo + yi • bm, where XQ, xi, yo, yi are m-digit numbers. The product z of x and y can then be written as Thus the problem of computing the product x - y can be viewed as that of computing the three quantities XQ, yo, Xoy\ + JCiyo, x\, yi, and performing 2m = n single digit additions. The key to the new algorithm is the way x0y0, x0yi + X i y 0 , and JCiyi are computed. The computation of these quantities is based on the identities:

Thus one has to compute the three products x0y0, x\yi, and (x0 - Xi)(y0 — yi), and perform two more additions (or subtractions) of 2m(=w)-digit numbers. The 3

4

CHAPTER II

computation of x0 — x\, and yi — yo necessitates two more subtractions of m-digit numbers. If we take as a unit of addition the addition (or subtraction) of two single-digit numbers with the possibility of a carry-(or borrow), we obtain that the product of two n -digit numbers uses 3 multiplications of m-digit numbers and 4n units of additions. We can, of course, use the same scheme to obtain the product of Xoyo, *iy i, and (XQ — *i)(yo ~ y i). Thus if n = 2s is a power of 2, and using the initial condition that for m = 2° = 1 only one single digit multiplication and no units of additions are needed, we obtain that 3s multiplications of single digits and 8(3s — 2s) units of addition suffice. In the case that n is not a power of 2, we can "pad" the numbers by adding enough leading zeros. So the formulas are still valid if we take s — [loga n] for any n. To summarize, we have shown a method for computing the product of two n-digit numbers which uses at most 3 • 3log2" =3 . nl0^3 single digits multiplication and 8 • 3 • 3log2" = 24 • n log23 units of additions. A more refined analysis can reduce the constants 3 and 24, or their sum (the total number of operations), but the rate of growth of the number of operations will remain nlog23. In his paper, Toom describes other methods which further reduce the exponent of n. We will not discuss them here. The main point we wish to emphasize is that even such a timeworn computational task as the product of two large integers may not be as simple and as strightforward as we had thought. lib. Discrete Fourier transform. The Fourier transform and harmonic analysis have been recognized for a long time as powerful tools in the study of functions and linear systems. This importance was not reflected in numerical computation because of the amount of computations which was needed to calculate the discrete version of the Fourier transform. In 1965 the situation changed radically. Cooley and Tukey published a new algorithm for computing the discrete Fourier transform (DFT) which substantially reduced the amount of computations [5]. Their algorithm known as FFT (fast Fourier transform) is now an indispensible ingredient in all aspects of signal processing. The discrete Fourier transform of n points is given by

where cu is the nth root of 1 (w = e2m/n). Straightforward computations of each A, uses n (complex) multiplications and n — 1 (complex) additions. So the straightforward computation of all the n A,'s uses n2 (complex) multiplications and n2-n (complex) additions. In 1965 Cooley and Tukey observed that whenever n = n\ • n2 is a composite number a more efficient method of computation exists. Each /' in the range 0 ^ / g n -1 can be written uniquely as / = i\ + i2 • n i (0 ^ /1 < n i; 0 ^ i2 < n2), and each / in the range 0 g/ ^ n — 1 can be written uniquely as / =/i • «2+/2 (p =J\ < «i; Osy 2 THREE EXAMPLES Since 1 consequently we have that for every 5 and and For each/2 = 0, 1, • • • , n2 -1 we define 6,1>/2 = X^Lo (a>n^llhah.n.,+JV that is, bilth is the DFT of the n T points ah.n2+h for each /2. (Note that w"2 = e2™7"1). Denoting fc*'1'2' ^ii,/2 by ch,h we see tnat f°r each /i = 0, 1, • • • , HI — 1, A,-1+,-2n, is the discrete Fourier transform of the Ai2 points c,lj/2. Thus, computing the DFT of Aii • n 2 can be done by computing the DFT of n\ points n 2 times (for each 72), then performing the (AII —1) («2~1) complex products w'1'2 • &,1J2 (note that whenever /i = 0 or 12 = 0 this multiplication does not have to be performed), and finally computing the DFT of n 2 points AU times (for each /i). In the case that Aii or Ai2 are themselves composite numbers we can use the same idea to compute the DFT of n\ or Ai2 points. In particular, when AI = 2s is a power of 2 this algorithm uses (5 — 1) • 2s"1 = (Ai/2)(log2 n — 1) (complex) multiplications and s • 2s — n Iog2 AI (complex) additions. It is this tremendous reduction in the number of arithmetic operations which makes the FFT algorithm so useful for many practical applications. He. Matrix multiplication. The third new algorithm, for the product of two matrices, was discovered by Strassen in 1969 [6], The traditional way for computing the product of two AI x AI matrices uses n 3 multiplications and Ai 3 -Ai 2 additions. The new algorithm will use many fewer operations if AI is large enough. As was the case for the FFT algorithm and for the new method for computing the product of two integers, this algorithm is also iterative in nature. That is, it reduces the problem of multiplying two AI x AI matrices to several instances of smaller problems. To start with, let us assume that AI = 2m is an even number. We can partition an AI x n matrix into four mxm matrices. Using that partition, we can write the fact that A x B = C as where A,/s, Bj/'s, and C,/'s are m x m matrices. The fact that C = A xB means that 6 CHAPTER II If we compute the C,/s in the straightforward way no savings will be obtained. Strassen, however, proposed a new way for computing the C,/'s. The algorithm first proceeds by computing the seven matrices: and then the algorithm computes the C,/s using: Altogether the algorithm calls for computing 18 additions of m x m matrices and 7 multiplications of m x m matrices. The key to the economy of this algorithm is that it calls for only 7 multiplications of m x m matrices rather than 8. Assume that the product of two m x m matrices uses ra3 multiplications and m 3 — m 2 additions. Since the sum of two raXm matrices uses ra2 additions, Strassen's algorithm for the product of two nXn matrices uses 7(m 3 ) = 7/8n 3 and 7(ra 3 -ra 2 ) + 18m2 = 7ra3 + llm 2 = Irt3 + T7i2 additions. Whenever n >30 we have that i n 3 < n 3 and In 3 + ^7t 2 <« 3 — n2. Whenever m itself is an even number we can increase the number of arithmetic operations saved, by using the same algorithm for computing each of the 7 products of m x m matrices. Taking n = 2s to be a power of 2, and denoting by M(s) the number of multiplications needed to multiply two nXn matrices, and by A (5) the number of additions, we obtain Using the initial conditions M(0) = 1 and A(0) = 0 we obtain: In the case n is not a power of 2, we can "pad" the two matrices by O's so as to make their dimension n', a power of 2. Since n'<2n we obtain for every n: M(n) < 1 x n log27 -42n 2 , where M(n) denotes the number of multiplications and A(n) the number of additions needed to multiply two n x n matrices. A more careful "padding", as well as using the regular way for multiplying two matrices where n is small shows that the total number of operations for multiplying two n x n matrices need not exceed 4 x n log27 . CHAPTER III General Background The three examples of the previous section indicate that there is a need for systematic study of the minimum number of multiplications and additions which are required to perform the computation. In this section we will describe some of the general results which have been obtained in the last decade concerning the minimum number of multiplications. Most of the results will be stated without a proof. At the end of these notes we have a bibliography directing the interested reader to the research papers where the proofs are given. Ilia. Definitions and basic results. Any systematic study of algorithms must begin by making precise the notion of the algorithm under investigation. Before stating the definition of an algorithm which we will use, we should explore our intuitive notion of an algorithm for computing arithmetic functions. The inputs to the algorithm are some real (or complex) numbers, that is, the data on which the algorithm operates. At each step of the algorithm it may add two previous results, subtract, multiply, or divide them. For a general computation, we also want to provide the algorithm with the ability to branch. Since we consider here algorithms which perform arithmetic we assume that the branching is done on some arithmetic predicate. The ability to branch cannot reduce the amount of computation the algorithm performs "in general." Let us assume that the input to the algorithm is some point in n dimensional (real or complex) space. Actually the argument we are about to give is valid if the inputs are restricted to even some open neighborhood of n dimensional space. As we execute the algorithm we arrive at the first branch instruction. At least one of the branches will be taken by all points at some open neighborhood of the space. We will take this branch. We continue this way, always taking the branch which is valid for a whole neighborhood. We thus see that the algorithm has a path which will be taken whenever the inputs are in some neighborhood. That means that the results computed are valid for this whole neighborhood. But the results computed are some arithmetic functions of the inputs, and are therefore valid for all points in the space except possibly for some algebraic variety. We will therefore assume that the algorithm we are dealing with has no branching. The second point we wish to make concerns the inputs to algorithm. Our intuitive notion is that the inputs are some variable. Thus our first impulse is to assume that the inputs are a set {*i, x2, - • • , xn} of variables. This is not sufficient since the algorithm may multiply the variables by some constant independent of the jt,-'s so as to form, for example, 2x\ or x5 + 3. We therefore have to assume that we also have a collection of constants at our disposal. Even that is not always 7 8 CHAPTER III enough. There are situations in which it is natural to assume that say both x\ and x\ are given at the outset. The availability of both x\ and x\ may help to simplify the algorithm. We will therefore assume that in addition to the constants and the variables we may also have some functions of the variables as given. All that algorithms can do with the data is add, substract, multiply, or divide so as to form new quantities. That is all the algorithm can do is compute quantities which are in the field generated by its inputs. It is convenient therefore to assume that the inputs to the algorithm, as well as the quantities which the algorithm is to compute are elements in some field H (which includes, of course, all the constants). The last point we wish to make before giving the definition concerns the constants. The operation of obtaining 2x from x can be done by either multiplying jt by 2 or by adding x to itself. Since our aim is to minimize the number of multiplications, we will always choose the latter. Yet it is convenient to use 2x (which is the multiplicative form) rather than resort to x + x. We will therefore not count multiplication by a constant as a multiplication. This convention of not counting certain multiplication makes sense when the constant is an integer. Yet we may also want to compute x + 3 and therefore include 3 as a constant. As a matter of fact, we will assume that the collection of constants at our disposal is a field, and therefore will always have 3 as a constant whenever we have 3 as a constant. It is for mathematical convenience that will adopt the convention that multiplication by any constant is not counted. This convention means that we will not count computing x/3 from x and 3 as involving multiplication (or division). The justification for such a counter intuitive convention is the mathematical ease which results from it. At the same time we also undertake to justify this convention whenever we apply the results. In every application we have to convince ourselves that this convention makes sense from the application's point of view. DEFINITION 1. Let G be a field (the field of constants) and let H => G be a field. Let B ^ H be a subset of the elements of H (the set of inputs to the algorithm). An algorithm A (over the set B) is a finite sequence h, h2, • • • , hn of elements of H such that either (1) /iieB;or (2) there exists /, k < i such that /i, = hj ° hk where ° stands for +, —, x, : . DEFINITION 2. Let /i, /2, • • • , / ) be elements of H. An algorithm A = (h\, /i2, • • • , hn] overB is said to compute /i, /a, • • • , ft if for every /(I ^ / ^ ?) there exists / such that ft = hj. In other words, if the set {hi, h2, • • • , hn] includes the set {/I, /2, ••-,/,}. DEFINITION 3. Let A = (h\, h2, • • • , hn) be an algorithm over B. An ht is said to be a nonmultiplication-division step (non m/d step) if either of the following three conditions hold: 1. hi&B. 2. There exist/, k GENERAL BACKGROUND 9 Otherwise /i, is called an m/d step. We denote the number of m/d steps in an algorithm A by /u, (A), and if we want to emphasize the dependence on B and G by /Lis(A), or /IB(A; G). DEFINITION 4. Let G, H, and £ be as in Definition 1, and let /i, /2, • • • , / , be elements of H. We define the multiplicative complexity of f\, • • • ,/r, denoted by At(/i,/2, • • • ,/r)by: where A ranges over all algorithms over B (with G as the field of constants) which compute /i, • • • , /r. To emphasize the dependence of the multiplicative complexity on G and B we will at times denote it by ju B (/i, * • • ,/ri G). Let us explore some immediate consequences of the definition. It is obvious that if (/i, /2, • • * > /r) are all elements of B then /Lt fi (/i, • • • , /,) = 0. We can enlarge the class of elements of H whose multiplicative complexity is 0. Let us denote by LG(X) the linear span (over G) of X; then it is obvious that if /i, /2, • • • , /, are all elements of LG(B) then /U B (/I, * ' ' » f t ' , G) = 0. The clue to this observation is that we can always modify an algorithm by adding to it a step which is an element of B without changing the number of m/d steps of the algorithm. Similarly, we can modify an algorithm by inserting a step which is the sum or difference of two previous steps, or which is the product of a previous step by an element g e G, without changing the number of m/d steps of the algorithm. The following propositions, and their corollaries are immediate consequences of this ability to modify an algorithm, and will be therefore stated without proofs. PROPOSITION 1. AIB(/I, • • • , / , ; G) = 0 // and only if ft&LG(B) for all i = 1,2, • • - , ; . PROPOSITION 2. Let B and B' be two subsets of H such thatB' c LG(B); then for every fa, • • • , / , COROLLARY. Let B and B' be two subsets of H such that LG(B') = LG(B); then for every f i , • • • , / , PROPOSITION 3. Lef {/i, • - - , / , } and |/i,/2, • • • ,f'r} be two subsets of H such that f I e LG(/i, • • • , / ( ) /or eac/i / = 1, 2, • • • , / • ; f/ien /or every B COROLLARY. Let {/1? • • • , / , } and' {/i, • • • ,/,} oe fwo subsets of H such that LG(fi, • • ' ,ft} = LG(f'1, • • - ,/;); then for every B PROPOSITION 4. (/ = 1,2, • • • , f) tfien Let /,' =/, + o,, / = 1, 2, • • • , t, where 10 CHAPTER III Motivated by Proposition 4, we see that it is more natural to view the /,'s not as elements of the field H but rather as (representatives of) elements of the space H' = H/LG(B). That is, we view H as a space (over G) and take the quotient of H with LG(B). For the same reason we will view the steps of an algorithm A (over B) not as elements of H but as (representatives of) elements of H'. This convention will enable us to state the results in a more succinct, and less cumbersome, manner. We are now in a position to prove the first basic result which provides us with a lower bound on the multiplicative complexity of /i, • • • , /,. LEMMA. Let A = (hi, h2, • • • , hn) be an algorithm over B, and let h(l), h(2), • • • , h(s) be the m/d steps of A; then for each i-1,2, • • • , n, /i, e LG(h(l), • • • , h(s)} (where, as stated above, we view the h^s as elements of H'}. The lemma can be proved easily by induction on n, and therefore the proof will not be given. Let A be an algorithm which computes /i,/2, • • • ,ft, and let h(\], h(2), • • • , h(s) be the m/d steps of A. It follows from the lemma (and Definition 2) that for each i = 1,2, • • • , t, /, eL G (/z(l), • • • , h(s)), and therefore that LG(/i, • • • ,/,)cLG(/i(l), • • • , h(s)). Therefore, we have the dimLG(/i, • • • ,/f)^dimLG(/z(l), • • • , h(s))^s. (We use dim X denote the dimension of a space X.} Since A is an arbitrary algorithm computing/i, /2, • • • , /, we have just proved: THEOREM 1. /*B(/i, • • - , £ ; G)^dimL G (/i, • • • ,/,). It should be emphasized that LG(/i, • • • , / < ) and therefore its dimension depends on B as well, since by the convention mentioned earlier we view the/, 's as elements of H1 = H/LG(B). Theorem 1 enables us to settle one problem connected with the example described in § Ha. The key to the method described there was an algorithm for computing/i = *oyo, /2 — *oyi"+ *iyo, /a = x\y\ using only three multiplications. A better method could have been obtained if we compute /i, /2, and/3 using only two products. An immediate consequence of theorem 1 is that no such algorithm is possible. COROLLARY. LetB = G\J{x0,xi, y0, yi}; then /u, B (/i,/ 2 ,/ 3 ; G) = 3, where fi = x0y0, f2=x0y1 + x1y0, f3 = x1y1. Proof. Since /i, /2, and /3 are linearly independent, theorem 1 states that At(/i,/2, /3) = 3. The algorithm in § Ila shows that /^(/i,/ 2 ,/ 3 )^=3. Hence the result follows. Theorem 1 enabled us to show that the algorithm of § Ila is minimal in the number of multiplications. It does not indicate how to find the algorithm. The following theorem, combined with the theorem of the next section will provide us with such a tool. If we examine the three elements /i, /2, /3 of the corollary we see that f\ and /3 have the special property that each of them can be computed using only one product. The algorithm of § Ila has the special property that it computes f\ and/3 directly, and not as the sum of products. There are other algorithms for computing /i, /2 and /3 in three multiplications which do not have this property. For example, GENERAL BACKGROUND 11 the algorithm: does not compute /3 directly. Theorem 2 guarantees that whenever some of the quantities to be computed can be each computed using one multiplication, there exists a minimal algorithm which computes each of these quantities directly. THEOREM 2. Let {flt /2, • • • ,/,} be a set of elements of H. //>B(/,; G) = 1 for i = l,2,---,k, and if {/i, /2, • • • , fk} are linearly independent (as elements ofH') there exists an algorithm A = (hi, • • • , hn) (over B) with the following properties: 1. A computes /i, /2, /,; 2. /*B(A) = /*fl(/1,/2, • • - , / , ) ; 3. for each i - 1, 2, • • • , k, h(/)=/,- (as elements of H'}, where h(i) is the i-th m/d step of A. Proof. Let A = (h\, • • • , hn] be a minimal algorithm for computing /i, • • • ,/,. We will modify A so as to cause it to have the desired properties. We first construct A from A by adding at the beginning of A an algorithm or computing fk. This new algorithm satisfies /u(A') = /Lt(A) +1 so we have to delete an m/d step from A' yet guaranteeing that the resulting algorithm computes /I, ••',/,. Let h(l), h(2), • • • , h(s} be the m/d steps of A. By the lemma, k ef LG(/i(l), • • • , /*(,$)). Let r be the smallest integer such that fk e L G (/i(l), • • • , Mr)); then /fc = £ / = i gih(i) and by minimality of r, gr ^0. Therefore, h(r) - g7lfk — E;=i gig7lh(i). That means we can modify A' by replacing ft (r) by a sequence of steps, none of which is an m/d step, computing g7lfk ~ Z/=i gig7lh(i). This new algorithm, A", satisfies /z(A") = /u(A). We continue the process with / fc _i, /t-2, • • • , / ! . The assumption that the /j-'s (i = 1, 2, • • • , k] are linearly independent guarantees that the m/d step which was deleted (in constructing A" from A') was an m/d step of the original algorithm. The final algorithm so obtained satisfies the conditions of the theorem. Illb. Linear functions. The set-up of § Ilia was quite general. In this section we will specialize the field H and the elements /:, /2, • • • , f , to be computed. Let F ^ G be a field which includes the field of constants G, and let y i, y 2 , • • • , yn be a set of indeterminates. We will take H to be H = F(y\, • • • , y n ), that is the extension of F by the n indeterminates yi, y 2 , • • • , yn. Throughout this section we will use B =F U{yi, • • • , yn}, and will therefore not mention B explicitly. We will also consider a special class of f\, • • • , /„ namely those that are linear in the y/'s. More precisely we will take /, = £/"=i 0,;yy, / = 1, 2, • • • , t, where the 0,/'s are elements of F. It is convenient to denote the (column) vector f = (/i,/2, • • • , ft)T as f = y where $ is a t x n matrix whose elements are the 0,;'s and y is the (column) vector y = (yi, y 2 , • • • , y«) T . 12 CHAPTER III Let V be the space over G whose elements are n-tuples of elements F. Thus each row of 3> is an element of V. Let V be the quotient space V = V/Gn, then each row of can be viewed as (a representative of) elements of V. We will denote by pr(3>) the dimension of the linear space (over G) generated by the rows of (viewed as elements of V). With this terminology we can restate Theorem 1 of § Ilia as: THEOREM 1'. ju,($y)^pr($). What was done with the rows of 3> can also be done with the columns of . We denote by W the space (over G) whose elements are t-tuples of elements of F, and by W the space W/G'. The columns of $ can be viewed as (representatives of) elements of W. We will denote by pc(4>) the dimension of the linear space (over G) generated by the columns of 3> (viewed as elements of W). Using this formulation we can now state the theorem: THEOREM 2. |ii(y)^pc(). We will not prove this theorem here. (The proof is given in [7].) We should emphasize that in general, pr() 5^ pc(&)- For example, if we choose F = G(xi,---,xn) to be the field G extended by the n indeterminates xi, X2,' • •, xn, and 4> the 1 x n matrix (jti, x2, • • • , xn) then pc(<£) = 1 while pc() = n. Theorem 2 of § Ilia together with Theorem 2 of this section enable us to derive the algorithm of Ila. So we assume that we are not familiar with the algorithm of § Ha, and ask whether the element /i, /2, and /3 of the corollary of Ilia can be computed using only products, and if they can then show how to find the algorithm. By Theorem 2 of Ilia we know that if there exists an algorithm using 3 products, there must be one in which two of the products are x0yo - f i and x^yi = /3. Let h be the third m/d step of the hypothetical algorithm. By the lemma of Ilia we know that/ 2 e L G (/i, /3, h), that is there exist g, gi, g2 G G such that/ 2 = gi/i + g2/3 + gh. Since /2 £ LG(/i, /3) we know that g ^ 0, and that therefore /4 = /2 - gi/i -/2/3 can be computed using only one multiplication. Plugging the expressions for/i, /2, /3 we obtain That means, by Theorem 2 of this section, that a necessary condition for At(/i>/2,/s) = 3 is that pc (3>) = 1. (It should be noticed that we have already implicitly stated that in this case F = G(x0, xi).) Our problem is thus reduced to finding whether there exist gi, g2 e G and a, (3 e G not both 0, such that: Equating coefficients we obtain: We now know that a ^ 0 since if a = 0 then /8 = agi = 0. Substituting (3 - agi in a—/3g 2 = 0 we obtain a—agig 2 = 0 which is the same as gig 2 =l. Choosing gi = r(r T* 0) and g2 = r"1 we obtain for /4 GENERAL BACKGROUND 13 and therefore Taking r = +1 yields the algorithm of Ha. This example is but a small illustration of how the results on the multiplicative complexity can guide us in finding new algorithms. We will meet more examples of this nature later on. IIIc. Quadratic and bilinear forms. Two of the three examples of § II, that of Ha and of lie dealt with the efficiency of computing certain systems of bilinear forms. A general system of bilinear forms can be written as where (x\, • • • , xr} and {yi, • • • , ys} are two sets of indeterminates, and the a^'s are elements of G. A system of bilinear forms is a special case of a system of quadratic forms, i.e., fk =£/s/ a,,**/*/, k — l,2,---,t, where the jt/'s are indeterminates and the a^'s are elements on G. In this section we will describe some special properties of minimal algorithms for such special cases. We will assume in this section (and in fact in the remaining of these notes) that B = G U{jci, • • • , xr} (in the case of quadratic forms); or that B — G(J{xi, • • • , Jc r }U{yi, • • • , ys} (in the case of bilinear forms). A system of quadratic forms can always be computed without using the division operation. But can the operation of division help in reducing the number of m/d steps? There are situations where it seems to help. For example, the most efficient method known for computing the determinant of a matrix uses some variation of Gaussian elimination, that is, it uses division, while the determinant can be computed without recourse to division. This only suggests, but does not prove, that division is needed to minimize the number of m/d steps for computing the determinant. In fact, we do not know that what is the multiplicative complexity of computing the determinant. Another more artificial example in which division helps to reduce the number of m/d steps is that of computing jc31. It is easily verified that /m(x3l) = 6. One algorithm for computing jc31 using only 6 m/d steps , 2 2 8 4 4 16 8 16 31 is: compute x = 4x • 2x, x=x-x,x=x-x,x =x •8 x 32 ,x =x • 16 x ,x = 32 31 x /x. It can be easily demonstrated that every algorithm for computing x which does not use division necessarily has at least 7 m/d steps. This cannot happen when we consider quadratic forms. The next theorem makes this assertion more precise. THEOREM 1. Let S be a system of quadratic forms. If G has infinitely many elements then there exists an algorithm A' = (h(, • • • , h'n) satisfying the following three conditions: 1. A' computes S; 2. /A (A') = /*(5); 3. Every m/d step of A' is M\(x] • M2(x) where MI(JC) and M2(x) are linear forms. 14 CHAPTER III Proof. Let A = (hi, hi, • • • , hm) be any algorithm computing S such, that fj,(A) = fj,(S}. We will construct from A another algorithm A' which satisfies the conditions of the theorem. Every step of A is an element of G(*i, • • • , jcr), that is a rational polynomial with coefficients in G in the indeterminates x\, • • • , xr. If the polynomial in the denominator of /i, has a nonzero constant term then hf can be represented as a power series in x\, • • • , xr. Since G has infinitely many elements we can always find gi;<= G, i = 1,2, • • • , r, such that by substituting x, - g, for jc, in the algorithm A we assure that all denominators will have nonzero constant terms. The algorithm obtained after the substitution computes the system 5' = Z/s/ aijk(xi ~gi)(x,< — &•)> k = 1, 2, • • • , t. The system S' is the same as S (viewed as elements of H'), so we can assume that every step of A can be expressed as a power series. Let LO, LI, and L2 be the linear operators on the space of power series which give, respectively, the constant, linear, and quadratic part of the power series. Let L = L0+Li +L2, and L' = / -L. We will modify the algorithm A to an algorithm A' such that for every step h' of A', L(h') = h, and such that for every step h of A there exists a step h' of A' such that L(h) = L(h'}. We will further ensure that (JL(A) — /u,(A). Since A computes 5, every element / of S is a step of A; let h be that step. Let h' be the corresponding step in A'. Since / is a quadratic form we obtain that L(/) =/ and therefore/ = L(f) = L(h) = L(h') = h'. So we see that the algorithm A' will satisfy the conditions of the Theorem. We will construct A' sequentially, starting from hi. By definition h\£.B = G U{jci, • • • , xr} and therefore satisfies the condition L(h) = h. Assume we have modified hi, hi,- • • , hk, then either hk+\ e B in which case it is left unchanged, or hk+i = gihi + g2hj (i,j^k) in which case it is replaced by h'k+\ = g\h\ + g-zh}, where h\ and h\ are the steps of A' corresponding to hi and hj, or hk+\ = hi x : hj. If hk+i = hi x hf then L0(hk+l) = L0(ht) • L0(hj), Li(hk+l) = L0(hi)Li(hj), +L0(hj)Li(hi), L2(hk+l) = L0(hi)L2(hi) + Li(hi)Li(hi)+Lo(hi)L2(hi). Since L0(h)e G we can compute L0(hk+i) and Li(hk+i) without m/d step, and we can compute L2(hk+i) using one m/d step, namely that of Li(hi)L$hj). We now replace hn+i by a sequence of steps computing L0(hk+i), L\(hk+$, L2(hk+i) and finally L(hk+i). To finish the proof we have to consider the case that hk+\ = hi\ hj (i, j^k}. In this case we can assume that LQ(hj} — 1, and we obtain that L0(hk+i) = L0(hi), Li(hk+i) = Li(hi)-L0(hi}Li(hi), and L2(/ifc+1)=L2(/i,.)-L0(/iI) • Ll(hjf = LzM-LofafaW+LAWLoMLiW-L^ht)). Again L0(hk+i), Li(hk+J, L2(hk+i) and L(hk+i) can be computed by a sequence of steps involving only one m/d step that of Li(hj)(Lo(hi)Li(hj)-Li(hi)}. That finishes the proof. Theorem 1 asserts more than the existence of a minimal algorithm which does not use division. It guarantees the existence of a minimal algorithm which never computes any polynomial of degree higher than 2. This assertion means that the order in which we execute the m/d steps does not matter, since no multiplication depends on the previous multiplication. An algorithm A for computing a system of quadratic forms all of whose steps are polynomial of at most second degree in the indeterminates, is called a quadratic algorithm. Theorem 1 asserts that among GENERAL BACKGROUND 15 all the minimal algorithms for computing a system of quadratic forms there must exist a quadratic polynomial. In some special cases we can guarantee that every minimal algorithm for computing a system of quadratic forms must be quadratic. THEOREM 2. Letfk = £/§, aijkXiXj, k = 1, 2, • • • , t, be a system of quadratic form. If then every algorithm A computing f\, • • • ,ft which satisfies JJL (A) = t is a quadratic algorithm. Proof. Let h(l), h(2), • • •, h(t] be the m/d steps of A, and denote by h the (column) vector h = (/i(l), h(2), • • • , h(t}}T. Denote by f the column vector f = (/i, /2, • • • , ft)T. The lemma of Ilia guarantees the existence of t x t matrix M with coefficients in G and a (column) vector 1 = (/i, /2, • • • , lt)T whose entries are linear polynomials such that f = Mh +1. Since dim LG(f\, • • • , /,) = t the rank of M must be t and hence M is invertible. We thus obtain h =M~lt-M~l\. The entries of M"1! are quadratic polynomials, and those of M~l\ are linear polynomials. Therefore, the entries of h, i.e., the m/d steps of A, are quadratic polynomials. This proves the assertion. Everything which was said about a system of quadratic forms holds a posteriori for a system of bilinear forms. There are other aspects of bilinear forms which make them a special object of study. We can understand the importance of these aspects by re-examining the example of He, that of matrix multiplication. The cornerstone of the method described there was the algorithm for multiplying two 2 x 2 matrices using 7 multiplications in such a way that the identities do not depend on the commutative law of multiplication. It was this independence of the commutative law which enabled us to apply the algorithm to matrices and not only to scalars. The assumption that the indeterminates do not commute will enable us to substitute matrices for the indeterminates, or even matrices for the *,'s and vectors of the y,'s thus increasing the usefulness of the algorithm. We will call algorithms whose validity does not depend on the commutative law of multiplication noncommutative algorithms, and we will denote by fi (S) the minimum, number of m/d steps needed by a noncommutative algorithm to compute the system 5 of bilinear forms. The following three theorems will be given without a proof. Their proof is similar to those of Theorems 1 and 2 of this subsection. (See [7].) THEOREM 3. For every system S of bilinear forms THEOREM 4. Let She a system of bilinear forms. There exists a noncommutative algorithm A computing S which satisfies the following two conditions: 1. n(A) = jl(S); 2. every m/d step of A is of the form M\(x] • M 2 (y) where M\(x] is a linear form of [x\, - • •, xr} and M2(y) is a linear form of {yi, • • • , ys}. An algorithm A all of whose m/d steps are of the form MI(JC) • M2(y) is called a bilinear algorithm. Every bilinear algorithm is necessarily noncommutative. 16 CHAPTER III THEOREM 5. Let fk =Z/=i Z'=i aiikXiyj, k — 1, 2, • • • , t, be a system of bilinear forms. If then every algorithm A computing f\, • • • ,/, which has t m/d steps is a bilinear algorithm. Consequently such systems of bilinear forms must satisfy Let/fc = £y=i X/=i ankXiVi, k = 1, 2, • • • , t, be a system of bilinear forms, and let n be rc =/Z(/i, ••• ,/ f ). Theorem 4 guarantees that there exist n multiplications mi ~ (Z; = i <*;/*i)(X/=i &•/y/) ( / = 1, 2, • • • , n) in a minimal noncommutative algorithm, where the o^'s and /3/s are in G. The lemma of Ilia guarantees that there are constants yki in G such that fk =£"=i y/dW/, fc = 1, 2, • • • , r. As will become apparent, it is often useful to consider not the system/i, • • - , / , of bilinear forms but the single trilinear form T = Zfc=1 fkZk, obtained by multiplying fk by the indeterminate zk and summing the results. The tensor of the trilinear form T is that of the coefficients aijk of the system of bilinear forms, and we will use T to denote both the trilinear form and the tensor of coefficients of the system of bilinear forms. Substituting £"=i m(yk for fk we obtain The last equation gives us an alternative characterization of /a (5): The quantity /i(5) is the smallest integer n such that there are n triplets of linear forms LI(X), Mt(y], Nt(z), I = 1, 2, • • • , n, such that T = £?=! L,(jc)m/(y)7V,(z). Put differently, fl (5) is the rank of the tensor T. There are other conclusions we can draw from the tensor (or trilinear form) formulation. Let us start again with the identity Equating the coefficients of, say jc('s, we obtain This last set of identities means that the system 5' of bilinear forms (in y,'s and •Zfc's) Zfc = i £/=i aiikyjZk, / = 1, 2, • • • , r, can be computed using only n multiplications, and one noncommutative algorithm with n m/d steps is given by computing the n products (£* =1 /3;;y/)(£ffc=1 ykizk}, I = 1, 2, • • • , n. But n was /I(5) where S is given by £/=i £/ s =i a^y/, k = 1, 2, • • • , t. That means that /Z(S') = /I(S). Since the construction can be reversed we also have / I ( S ) ^ / I ( S ' ) , which means that /I (5) =/I(5'). Instead of equating coefficients of the jc,'s we could have equated coefficients of the y/s and thus obtained a third system S"' with a noncommutative algorithm for computing it using n m/d steps, and S" also satisfied /1(S") = /I(5) = /!(£')• We have thus proved the following theorem: GENERAL BACKGROUND 17 THEOREM 6. Let S\ be the system of bilinear forms S k = l,2, • • • ,t. Construct from Si the following five systems of Then Moreover, given a noncommutative algorithm with n mfd steps for one of these systems, we can construct a noncommutative algorithm for any of the other systems with nm/d steps. Theorem 6 is very useful in several applications. We will illustrate the usefulness of the Theorem by two examples. These examples are prototypes of the construction we will use later in the sections dealing with applications. Example 1. We will start with the algorithm of § Ha, that is Following the proof of Theorem 6 we will multiply (on the left) the first equation by z0, the second by z\, and the third by 22, and then sum the three resulting equations to obtain: (Following the proof we grouped terms on the right so as to have only three summands—the number of m/d steps in the original algorithm of Ila.) Equating coefficients of the :c,'s we obtain Denoting the system of Ila by S, we obtain a new system S' which is readily 18 CHAPTER III recognized as that of computing the entries of the vector That is 5' is the computation of the product of a 2 x 2 symmetric matrix by a vector. The construction of Theorem 6 yielded an algorithm for performing this computation using only 3 m/d steps, once we had the algorithm of Ha (which was derived in Illb). Moreover Theorem 6 together with the corollary of Ilia state that £(S") = /I(5) = 3. (In fact, we can use Theorem 2 of Illb to show that /u(S') = 3.) Example 2. This example will treat the problem of computing the product of two complex numbers, (jco + /*i)(y0 + /yi) = (xoyo —*iyi) + *X*oyi + *iyo)- To tie this computation to the algorithm of Ha we recall that an alternative way of defining the field of complex numbers is to view jc0-fc ix\ as the polynomial (in u] XQ + XIU, and define multiplication of two such polynomials as (x0 + xiu)(yo + yi«) mod (w 2 + l)(« 2 + l being the minimal degree polynomial which has / as a root). The problem of computing the product of two complex numbers is the same as that of computing the coefficients of the polynomial Q(u} = (x0 + xiu)(y0 + yiu)mod(u2 + l). This formulation of the problem enables us to consider the computation as if it were done in two stages. The first stage is that of computing the coefficients of the polynomial Q'(u) = (x0 + xiu)(y0 + yiu) (this is the problem which was treated in Ila!), and the second stage consists of reducing modulo u2 + l, that is replacing u2 by -1 which means subtracting the quadratic coefficient of Q'(u) from its constant term. It is important to note that the second stage does not involve m/d steps. Thus, using the algorithm of Ila we obtain the following algorithm for the system 5 of the product of two complex numbers Following the construction of Theorem 6 we multiply (on the right) the first equation by z\ and the second by z0. (Note that the first equation was multiplied by Zi—not ZQ, and the second by z0—not z\. This was done to make the resulting system 5" look more recognizable.) Summing the two equations we obtain Equating the coefficients of the x/'s, we obtain the following system 5' and algorithm for computing it using 3 m/d steps. (Note that in writing 5' we write the coefficient of x\ first and that of x0 second): The system S' which we obtained is the same as S (S' is the system of computing GENERAL BACKGROUND 19 (yo + iyi)(zo + izi)) but the algorithm is different! In some applications the second algorithm may be preferable to the first. For example if we know beforehand the values of z0 and z\ we can precompute Z\+ZQ and z\ — z0, so executing the second algorithm calls for only 3 additions, but whether we know beforehand z0 and z\, or y0 and yi, executing the first algorithm uses 4 additions. As an immediate consequence of Theorem 6 we obtain the following corollary. COROLLARY. Let Smnp denote the system of bilinear forms of the product of the raxn matrix X by the nxp matrix Y. Then: /i(5mnp) = /i(5npm) = /Z(5p mn ) = fi(Spnm) = P*(Snmp) = /Z(Smprc). We will end this section by introducing two ways of constructing a system 5" of bilinear forms from two other systems 5 and S' of bilinear forms. The first is the construction of the tensor product of the systems 5 and S', and the second way is their direct sum. DEFINITION 1. Let S = £/=i I'=i aijkXiyj, k = l, 2, • • • , t, and S' = Z/=i Z;=i bijkXiy,-, k = 1, 2, • • • , t', be two systems of bilinear forms. The system S" = S®S' is the tensor product of S and S' if S" = I;d £-"»i c^-y/, k = 1, 2, • • • , t", when 5"' satisfies the following conditions: 1. r" = r-r',s" = s-s',t" = t-t'. 2. If /" = (/-l}r' + / ' , ( ! ^ i ^ r , l^i'^r'), ]" = (]-l}s'+f, (1^/^s, l=i/'=i 5'), and k" = (k-l)t' + k', (l^k^t, l^k'^t'), then c,-Tfc» = aijk • 6,Tfc'. In other words, the tensor of coefficients of S" is the tensor product of the tensor of coefficients of S and the tensor of coefficients of 5'. Example 3. This example will treat the tensor product of the systems S of Example 1 by the system 5' of Example 2. Consider the system 5" of bilinear forms given by If we partition the matrix <&(*) into 2 x 2 blocks we obtain that S" can be written as That is, we can "block" 5" such that the "block structure" is that of the system of Example 1, while the structure of each block is that of the system of Example 2. Moreover, different blocks (i.e., blocks denoted by different symbols) do not have any indeterminate in common. This is another way of characterizing the fact that S" = Sx S', where S is the system of Example 1 and S' is the system of Example 2. The algorithm which was developed in Example 1 is bilinear and therefore noncommutative. That means that we can substitute matrices and vectors for the 20 CHAPTER III indeterminates without negating the validity of the identities. We thus obtain: where and That is, MI, M2, and M3 are three instances of the problem discussed in Example 2. Using the algorithm of Example 2 we obtain where where where and and andf Combining these algorithms we obtain an algorithm for computing z0, Zi, z2, and z3. Namely: For obvious reasons we will call an algorithm which was constructed as above the tensor product of the two algorithms. It should be emphasized that even if the two algorithms are minimal, their tensor product is not necessarily minimal. In other words, if S" = S ® S' then /I (5") is not always equal to /Z(5) • /Z(S'). The construction of this example did show that for every two systems S and S' of bilinear forms we have /I (5 ® S') ^ /tl(S) • /I (S'). In § VI we will meet examples where the inequality holds. GENERAL BACKGROUND 21 We can rewrite the system of bilinear forms of Example 3 as: That is, the system 5" of Example 3 can also be written as S' ® S, where S and S' are as in the example. This equivalence of 5 ® S' and 5' ® S is but a special case of a more general notion of equivalence of two systems of bilinear forms. DEFINITION 2. Let S be the system of bilinear forms k = 1, 2, • • • , t, and 5' be the system o fc = 1, 2, • • •, t. We say that 5 is equivalent to 5' (denoted by 5 = three nonsingular matrices U, V, and W (of dimensions r x r, s X s , and t x f, respectively) with entries in G such that where M,-,,- is the (/,/') entry of U, i>/,y is the (/, /') entry of V, and wk,k' is the (k, k') entry of W~l. In other words we can obtain the system 5 by substituting in for for and or where wk^ is the (k, k') entry of W). Using tensorial notation means that <2,Tfc' = bijkU\< V\ Wk'. An immediate consequence of the definition of equivalence is that two equivalent systems have the same multiplicative complexity. That is THEOREM 7. 7/5 = 5' then n(S) = /tt(5') and /I(5) = /I(5'). The second construction of a system of bilinear forms from two other systems is that of the direct sum. t 2, DEFINITION 3. Let Z7=i Z';=i bijkXtyj, k = 1, 2, • • • , / ' , be two sy 5" = 505', the direct sum of 5 and 5' 2, • • • , 4. For all other triplets (ijk) not covered in (2) or (3), ciik = 0. Example 4. Let 5 be the system of bilinear forms of Example 1, and 5' the system of Example 2. The system 5" = 5 ©5' is given by The algorithms of Example 1 (for computing S) and that of Example 2 (for couputing S'} can be combined to yield an algorithm for computing 5" = 5 ©5' 22 CHAPTER III having 6 m/d steps. Namely: The algorithm thus constructed computes S and 5' separately. We will call this algorithm the direct sum of the algorithm for 5 and the algorithm for 5'. This construction of the direct sum algorithm shows that /u,(5©5') ^ /*(5)0/u,(5"). No example of 5 and 5' is known where the inequality holds. As a matter of fact it is conjectured that fi(S@S') = fi(S)@ti(S') (and also that jl(S®S') = jZ(S)@ ll (5')). Furthermore, it is conjectured that every minimal algorithm for computing S@S' is the direct sum of a minimal algorithm for computing 5 and a minimal algorithm for computing S'. In § Vlwe will see a class of systems of bilinear forms for which the conjecture is proved. We will end the section by proving the conjecture in a special case. DEFINITION 4. A system of bilinear forms zfc = Z/ s =i Z;=i a,;fcjc,y/, k = 1, 2, • • • , t is said to be minimal if A system of bilinear forms is said to be weakly minimal if one of its transposes is minimal. Using this definition we can now state a theorem describing some situations in which the direct sum conjecture holds. THEOREM 8. If Sis a minimal system of bilinear forms, then for every system S' of bilinear forms /a (5 0 S') = /u, (5) © p (£') and /Z (5 ® S') = (I (S) ® jl (S')- If S is only weakly minimal then /Z(5©5') = /Z(5)©/tZ(5'). To facilitate the proof of the theorem we will state and prove a lemma. LEMMA. Let S be a minimal system of bilinear forms Zk k = 1,2, • • • , t. Sis equivalent to the systemS given by z'k = Z/=i Z!=i biikXiy,; k = l, 2, • • • , t, satisfying fji (z'k) = 1 for all k = 1,2, • • • , t. Proof. The minimality of S guarantees, by Theorem 5 of this subsection that every minimal algorithm for computing S is bilinear. Let mi, m2, • • •, mt be the t m/d steps of such an algorithm. By the lemma of Ilia we have z = Wm, where z is the (column) vector z = (z\, z2, • • • , zt)T, m is the (column) vector m = (mi, m2, • • • , mt)T, and W is a txt matrix with entries in G. The condition t = dim LG(ZI, z2,---, zt) implies that the rank of W is t, and therefore that W is nonsingular. Define the system S of z'k = Z/=i Z/=i bakXiyj, k — 1,2,..., t by z' = W~lz, where z' is the column vector z' = (z(, z ' 2 , - - - , z't)T. The system S is equivalent to 5, and z' = W~li = m shows that z'k = mk, k = l, 2,- • • ,t, which means that /t(z fc ) = 1 for k = l,2,---,t. This proves the lemma. Proof of Theorem 8. Let S be the system of bilinear forms satisfying the lemma; then S®S' is equivalent to S@S'. By Theorem 2 of Ilia there exists a minimal GENERAL BACKGROUND 23 algorithm A = (hi, Ii2,'• • , hn) for computing S®S', such that if h(l), h(2), • • • , h(l) are its m/d steps then h (i) - zj, / = 1,2, • • • , t. We will now modify the algorithm A to A' by substituting 0 for each occurrence of an jc, in A (i = 1, 2, • • • , r). The algorithm A' computes S' and has at most l — t m/d steps. Therefore, l — t^^(S'} (and if A is a bilinear algorithm l-t^jl(S')). That is n(S@S') = n(S®S') = fji(A) = l^t +lji(S') = iJi(S) + tJi(S'). In case A is bilinear we obtain jl(S@S')^ jl(S) + /Z(S'). But /4(S©S')^/*($) + /*($') and /I (5 ©5')^ /i(5)©/I(5'). This proves the first half of the theorem. To prove the second half, let 5 be a transpose of S such that S is minimal, and let S' be the corresponding transpose of S'. We thus obtain that S@S' is a transpose of 5®S' and therefore n(S@S') = ji(S®S') = /I(5') = /I(5) + /Z(5'). This proves the second half of the theorem. This page intentionally left blank CHAPTER IV Product of Polynomials Let x = (x0, x\, • • • , xm} and y = (y0, yi, • • • , y«) be two vectors. The convolution of x and y written as z = x * y is an (m + n + 1)-vector z = (z0, zi, • • • , zm+n) given by z, = X j = 0 x/y,-_/, / = 0,1, • • • , m + n, where by convention xi = 0 for / > m and y, = 0 for i>n. The operation of convolution is a generalization of the computation of Ila. In fact, the basic algorithm of Ila is for computing the convolution of the two 2-vectors x = (JCQ, Jti) and y = (y0, yi). As we shall see algorithms for computing the convolution of two vectors will be the foundation of the algorithms we will derive in subsequent sections. We will therefore devote this section to exploring efficient algorithms for computing convolutions. IVa. Minimal algorithms. It is easily established that no linear combination of the m + n +1 quantities z, = £/=o x/y,-/, / = 0, 1, • • • , m + n, with coefficients in a field G yields an element of LG(1, x0, • • • , xm, y0, • • • , yn). That means that the z,'s are linearly independent. (Recall that we use B = G(J{x0, • • •, xm}(J {yo,' • • ,yn}, and that the linear independence is that of the z,'s as elements of H' = H/LG(B)). It follows from Theorem 1 of § Ilia that /A(ZO, • • • , zm+fl)^ m + n + l. We will exhibit an algorithm for computing the z,'s using m + n +1 m/d steps in the case that G has m + n + l elements, thus showing that jLt(z 0 , • • • , z m + n ) = m + n +1. (In case G is a finite field with fewer than m+n elements /x(z0, • • • , z m+2n ) is indeed greater than m + n +1.) We will assume in the rest of these notes that |G| is infinite, and even that G is of characteristic 0. To explain the algorithm it is useful to view the problem of computing the convolution z = x * y as a problem of multiplying two polynomials. Let R(u) = Y,r=o XiU1 and S(u) = £"=o y/w ; be two polynomials in u with the coefficients .x,'s and y/s respectively. The polynomial Q(u) = R(u) • S ( u ) is Q(u) = '£™=S zkuk where the coefficients of Q(u] are the entries of z = x * y. Thus the problem of computing the convolution can be viewed as that of computing the coefficients of the (m+n)th degree polynomial, Q(u), from the coefficients of wth degree polynomial R(u) and the nth degree polynomial S(u). Let a0, a\, • • • , am+n be m + n + l distinct elements of G. By Lagrange interpolation formula we have 25 26 CHAPTER IV where Q,(«) is the polynomial in u It is clear from the definition of Qt(u) that its coefficients qiti are elements of G, which are determined once a0, «i, • • • , am+n are chosen. We thus obtain that zfc = £ri"o" O(fl,)^,fc, that is, once we compute (?(«,-), / = 0, 1, • • • , ra + «, we can compute the z/c's with no additional m/d step. Since Q(u} = R(u] • S(u), we have the Q(ai) = R(ai)S(ai)I where /?(«,-) = Zflo^/ai and S(a,-) = £/"=o y/aj, / = 0, 1, • • • , m + n. That means that we can compute jf?(a,) and 5(a,-) without any m/d step. For each / = 0, 1, • • • , m + « we can, therefore, compute Q(at) = R(cii) • S ( and PRODUCT OF POLYNOMIALS 27 and finally: The first problem this algorithm raises is a consequence of our assumption that multiplication by an element of G (in our case by a fixed rational number) is not counted as an m/d step. In practice, we have to justify counting, say (*0 + 2jti + 4jt2)(y0 + 2yi + 4y2)/24 as only one multiplication and not two. In many applications it really should be counted as two multiplications, and our definition does not conform to the requirements of these applications. As we shall see in the next sections there are applications in which it is natural to assume that y0, yi, and y 2 are held fixed while the jc,'s vary. That is there are applications in which we have to compute z = x * y for many choices of x but only one choice of y. In these cases we can compute Y = (y0 + 2yi + 4y 2 )/24 once and for all, and for every occurrence of Jt0, JCi, and x2 we compute (*0 + 2jci + 4jc2) • Y which indeed requires only one product. The second problem has to do with the number of additions used by the algorithm. It is possible to try to reduce the number of additions by the choosing of a different set of a0, «i, a2, a^, and a4. Only a small improvement in the number of additions can be obtained in our present case by a different choice of a,'s. Yet the problem of the number of additions is not always acute. In the examples of Ha and lie we construct algorithms for large problems (product of large matrices, product of large numbers) by iterating the basic algorithm. We noticed in these cases that the number which governed the rate of growth of the total number of operations was the number of m/d steps of the basic algorithm. This phenomenon is quite general. The rate of growth of the number of operations is governed by the 28 CHAPTER IV homogeneous part of a difference equation, and the coefficients of the homogeneous part depend only on the number of m/d steps. Thus this algorithm for computing the coefficients of the product of two quadratic polynomials using 5 m/d steps implies that it is possible to multiply two large numbers using at most Cnlog3? operations for some constant C. The third problem is the numerical accuracy of the algorithm. Consider the way the algorithm computes Z4 = x 2 y2- The coefficient of jc 2 y2 in the algorithm is ~6 — 6 + 4x2^ + 4x2^. That is in order to compute *2y2 the algorithm computes 4x2 (twice) and then multiplies this 4*2 by 2^y2. This does not cause severe problems of accuracy as yet, but that is because we have chosen a small problem. Consider the problem of the product of two 20th degree polynomials. This problem necessitates the choice of 41 constants. If one of those constants is as small an integer as 4, the step of the algorithm which compute R(ai) will necessitate computing 420x20. This number is too large for any computer with fixed point arithmetic, and may cause severe roundoff problems in most floating point ones. Algorithms for convolution will play a central role in the construction of algorithms for other applications. It is important, therefore, to obtain algorithms in which the constants are small integers. In the next subsection we will show that every algorithm for computing the convolution of two vectors of even moderate size necessarily needs large constants. In the following subsection we will consider some heuristic methods for obtaining algorithms which do not use the minimum number of m/d steps, but have only small constants. Let z = x*y. We saw that /U,(ZQ, • • • , zm+n) = m +n + 1; that is, the multiplicative complexity of z0, z-i, • • •, zm+n is the same as dim LG(z0, • • • , zm+n). In other words, z0, z\, • • •, zm+n is a minimal system of bilinear forms. As we saw in Theorem 5 of IIIc, every minimal algorithm for computing z0, Zi, • • • , zm+n is bilinear. Our aim now is to classify all minimal algorithms for computing z0, z\-> • • •, zm+n. To do that we need another result which is a slight strengthening of the lemma of IIIc. The proof of this theorem is almost identical with the proof of the lemma, and we will therefore state the theorem without a proof. THEOREM 1. LetM(x)y be a system of bilinear forms. This system is of minimal complexity if and only if the following two conditions hold: 1. No linear combination of rows ofM(x) (with coefficients in G) yields a row all of whose entries are 0. 2. There exists a nonsingular matrix C~l (with entries in G) such that for every row c 0/C~V(dW(jt)y) = i. For every C~l there exists an algorithm A = A(C~l) computing M(x)y, satisfying n(A) = /tt(M(jt)y) such that its /cth m/d step mk is given by mk = (£/=i «ifc*«)(Z/=i j3/fcy/), where the a,fc's and /S/fc's are in G. Moreover, for every algorithm A computing M(x)y satisfying /a(A) = /u.(M(x)y) there exists a matrix C~ 1 suchthatA=A(C" 1 ). IVb. Classification of the algorithms. We will use Theorem 1 of the last subsection to exhibit all the algorithms for computing z = x * y using m+n + 1 m/d steps. We first write this system of bilinear forms as M(x)y where M(x) is PRODUCT OF POLYNOMIALS 29 the (m + n +1) x n matrix whose (/, /)th entry is *,•_/ whenever 0 ^ / —/ ^ m, and 0 otherwise. Let A be any algorithm computing x * y using m + n +1 m/d steps, and let C"1 be the matrix such that A = A(C~l). Let c = (c0, c\, • • • , cm+n) be a row of C"1; then cM(jc) is given by: cM(x] = (I0(x), li(x), • • • , ln(x)) where /,-(*) = ££L0 ci+ixh i = 0, 1, • • • , m. The requirement that pc(l0(x), • • •, /„(*)) = 1 is equivalent to saying that the matrix y is of rank 1 where y is the (m + n) x (n +1) matrix given by <: < < .. ='-1+/9 c-,- w0=< t/ ^ m v0=y /= ' ft. n »ft, v/!,/ The next step in classifying the algorithm is determining the condition for y to be of rank 1. We have to consider two cases, depending on whether the first row of y (which is CQ, Ci, • • • , < : „ ) is identically 0 or not. Case 1. c0 = c\ = • • • — cn = 0. Let p be the smallest integer such that cp^0 (by assumption p>n). Let p = ra+n + l— q (q^l); then the q x q submatrix consisting of the intersection of the last q rows and q columns of y has cp on its main antidiagonal, and 0 everywhere above it, so it is nonsingular, and therefore the rank of y is at least q. Since the rank of y is 1 we must have q = l, and c0 = ci = - • • = c m + n _ i = 0, cm+n T* 0. (Since C"1 is nonsingular, none of its rows can be identically 0.) Case 2. (c0, • • • , c n ) is not identically 0. Since the rank of y is 1, its second row (ci, c2, • • • , cn+\] must be a multiple of the first. That means that there exists a g e G such that Ci = gc0, c2 = gci = g2c0, • • • ,cn+i = gcn = g n+1 c 0 - (By assumption of Case 2 we must also have c0 T* 0.) The third row of y, namely (c2, c3, • • • , c n + 2 ) must also be a multiple of the first, and since we already know that c2 = g2c0, the third row is g2 times the first. It follows therefore that cn+2 = g2cn = gn+2c0. Continuing the argument we see that c, = g'c0 for all / = 0, 1, • • • , m + n. So in this case we have (CQ, c\, • • •, cm+n) = c0(i,g, g2, • • • ,g1m+n), CO^Q. The matrix C" is nonsingular, so at most one of its rows can be of the first case. Also, for every pair of rows which are of the second case, their g's must be different. If we denote the fcth m/d step of the algorithm by mfc, then cM(x)y = mk, where c is the /rth row of C"1. Therefore, if the kth row of C"1 is of the first case then mfc, the /rth m/d step of A, is m/t =cm+nxmyn, and if it is of the second case then / V fn i\ / v n i\ wk = c 0 (Li=o Xig )(Li=o v,-g ). We will end the analysis of the class of algorithms for computing convolutions by observing that whatever an algorithm is given by M(jc)y = Cm, we can obtain other algorithms from it by choosing an arbitary permutation matrix II, and an arbitrary nonsingular diagonal matrix D, and define the new algorithm by M(x)y = (CD~1Il~1)(IlZ>m). From now on we will identify two algorithms if they can be obtained one from the other by a choice of such a n and D. We will summarize the previous analysis by the following theorem: THEOREM 2. Let A be an algorithm for computing z = x * y satisfying ^(A} = m + n + 1. Then the m/d steps of A are given by either 1. mfc = (gk+££o.r;gfc)(gfc+Z?Loyigfc), k=0, 1, • • • , m+n, such that ifk^l, K k ^ K i ; or 30 0. CHAPTER IV such that if It is clear that the g t 's and gk 's play no essential role and we will assume they are The algorithms of the first class of Theorem 2 are exactly the algorithms described in IVa. Thus Theorem 2 states that every minimal algorithm for computing g = x * y is either obtained as described in IVa or by a small variation of it. It is easily verified that the algorithm of Ila is of the second kind. If we examine Theorem 2 we see that every algorithm of Case 1 calls for m + n +1 distinct elements of G, and every algorithm of Case 2 calls for m + n distinct elements of G. But what if G has fewer than m + n elements? In this case we cannot construct either one of the two types of algorithms. But Theorem 2 asserts that every algorithm A for computing x * y which has m+n + l m/d steps must be one of the two types. Therefore, an immediate consequence of the Theorem is the following corollary: COROLLARY. // \G\<m+n, then IJL(X * y; G)>ra + n + 1. We will now give a somewhat different interpretation of these two classes of algorithms. This interpretation will help us in deriving heuristic algorithms for performing convolution, as well as algorithms for some other computations to be discussed later. The main tool we will require is the Chinese remainder theorem. What will be needed is the computational aspects of the theorem. We will therefore state the theorem in the form we will need it without a proof. Let G be a field, and let p(u] e G[u] be a polynomial with coefficients in G. If p ( u ) = p i ( u ) • p2(u) and (pi(u),p2(u)) = l (i.e., pi(u] and p2(u) are relatively prime) then by the Euclidean algorithm we can find two other polynomials q\(u], q2(u}& G[u] such that pi(u)qi(u)+p2(u)q2(u) = 1. Let R be a commutative ring without zero divisors which includes G. We will denote by R[u] the ring of polynomials with coefficients in R, and by R[u]/(p(u}) the ring of polynomials with coefficients in R with addition and multiplication defined modulo p ( u ) , CHINES REMAINDER THEOREM. Let G be a field, R a commutative ring without zero divisors which includes G; p ( u ) , p\(u), p2(u)& G[«] be polynomials such that (p\(u), p2(u)) = 1 and p(u) = p\(u] • p2(u). Then R[u]/(p(u)) = R[u]/(pi(u))xR[u]/(p2(u)). The isomorphism s is given by the following: For every r(u)<=R[u]/(p(u)), s(r(u)) = (si(r(u}), s2(r(u})} = (r(u) mod pi(u], r(u}modp2(u)}, and for every (r^u), r2(u}}zR^Kp^u}}xR[u]/(p2(u)) s~\(ri(u),r2(u))) = (rl(u)Xp2(u)xq2(u) + r2(u)Xpi(u)xqi(u))modp(u) where qi(u] andq2(u) satisfy p i ( u } q i ( u ) + p2(u)q2(u) = 1. Let us see how we can use the Chinese remainder theorem to construct algorithms. Assume we have an algorithm A\ which computes the coefficients of (Z"=o *iw')(X"io y<M<) m°d Pi(u) (where n\ = deg pi(w)), as well as an algorithm A2 which computes the coefficients of (X"lo ^w')(Z"=o1 v'w') m°d P2(u} (where n2 = degp2(u)). Assume also that (PI(U),p2(u)) = 1. We will now construct an algorithm A which computes the coefficients of (£"=o *;w')(Z"=o y,M') mod p ( u ) , where p(u)=pi(u) • p2(u] and n =degp(u) = nl + n2. The algorithm A has three parts. The first part consists of computing the coefficients of ri(«) = PRODUCT OF POLYNOMIALS 31 modpi(w), r2(u) = ^nu=0XiUt modp2(u), Si(«) = Z"=0 y,«'modpi(w), ands2(u) = Yl'i=o y.-w' modp 2 (w). Since the coefficients of p\(u] andp 2 (w)arein G, this part of A uses no m/d steps. The second part of A consists of using algorithm AI to compute the coefficients of ti(u) = r\(u] • s\(u) mod p\(u] and the algorithm A2 to compute the coefficients of t2(u) = r2(u) • s2(u) mod p2(u}. The number of m/d steps in the second part of A is /x. (A i) + /JL (A2). The third part of A consists of computing the coefficients of ti(u)p2(u)q2(u) + t2(u)pi(u)qi(u) modp(u). Since p2(u) • q2(u)andpi(u) • q\(u] have elements of G as their coefficients, this part of A uses no m/d steps. By the Chinese remainder theorem (Z"=o yiu')modp(u) = (tl(u)p2(u)q2(u) + t2(u)pl(u)q1(u))modp(u), and therefore A is the desired algorithm. We also have, by construction, ^(A}/x(Ai)-+/A(A 2 ). We will now return to the classification of the algorithms for computing z = \*y, that is, computing the coefficients of Q(u] = (£™=ox,w')(Z"=o y,-«'). Let go, gi, • • • , gm+n be m + n + l distinct elements of G, and let p(u) be the polynomial p(u} = nT=o (u—gi). The polynomial p(u] is of degree m + n + l, while the polynomial Q(«) is of degree ra + n. So the three polynomials Z"=0y<w'> an<^ O(") can be viewed as being elements of R[u]/(p(u)), where R = G[x0, • • • , xm, y0, • • • , y n ]- The problem of computing z — \ * y can therefore be viewed as the problem of computing the coefficients of the product of two polynomials in R[u]/(p(u}). Since p ( u ) is factorable into the m + n + l factors (u -gi), i = 0, • • • , m + n, which are pairwise relatively prime, the construction of the preceding paragraph enables us to build up an algorithm for computing z = x * y from algorithms for computing the coefficients of the product of polynomials modulo w-g,. But (Z/lo^w'XZ^o y.-w') mod (w-g,-) = /)> so ^ can be computed using only one m/d step. This algorithm is easily recognizable as that of Case 1 of Theorem 1 of this subsection, which is the same as Toom's construction of Section IVa. We obtain the algorithms of Case 2 of Theorem 2 by starting with the following identity: Let gi, g2, • • • , gm+n be m + n distinct elements of G;then Using the Chinese remainder theorem in the way explained earlier we can construct an algorithm for computing the coefficients of Q'(u) using m + n m/d steps. The coefficients of the polynomial xmyn Rili" (w -g,) are all multiples of xmyn and therefore can be computed using only one m/d step, namely xmyn. By adding the appropriate coefficients of Q'(u] and xmyn l\7=" (u ~"&) we obtain an 32 CHAPTER IV algorithm for computing the coefficients of Q(u) using m + n + 1 m/d steps. This is the algorithm of Case 2.of Theorem 1. For convenience of notation we will denote the identity underlying the algorithm of Case 2 as Q(u) = Q(u) mod [(u -oo) n^V ("-&)]• This way we can unify Cases 1 and 2 of Theorem 1 by saying that they stem from the identity Q(u) = O(u] mod [O/l'o" (u ~#i)]» where at most one of the a,-'s is oo and all the rest are in G. Example. We will illustrate the construction of the preceding discussion by deriving the algorithm used in § Ha. Let R(u) be the polynomial R(u) = x0 + xiU, and let S(u) be the polynomial S(u) = yo + yi". We wish to compute the coefficients of the polynomial O(u) — z0 + ziu+z2u2 where Q(u) = R(u) • S(u). The polynomial Q(u) is quadratic, so we have to choose three distinct constraints in G U {00} (we assume that G is the field of rational numbers). Let these constants be a0 = 0, a\ = —l, a 2 = °°- We thus obtain that To compute Q'(u) we have to compute By the Chinese remainder theorem: And as we have Equating coefficients we obtain the algorithm of § Ha. IVc. Heuristic algorithms. We saw in § IVa that the use of interpolation to obtain an algorithm for computing convolution leads to algorithms with large constants. In § IVb we proved that every minimal algorithm for computing convolution necessarily has large constants even when the size of the vectors to be convolved is moderate. In this subsection we will describe three methods for obtaining algorithms whose number of m/d steps is close enough to the minimal, yet do not have large constants. We will illustrate these constructions using the computation of convolving two three-dimensional vectors, that is, computing the coefficients of Q(u} = (XQ + XIU +X2u2)(yo + yiu+y2u2). As we know every minimal algorithm is based on the identity Q(u} = Q(u} mod UI, = o (u ~#<)] where at least 4a/'s are rational numbers (we take G to be the field of rational numbers) and the fifth may be oo. PRODUCT OF POLYNOMIALS 33 The field of rational numbers has three elements whose magnitude does not change when we take their power; namely, 0, 1, and -1. Even if we take a 0 = °o, a\ — 0, a2 = 1, and a 3 = — 1, any choice of a4 will cause us to have increasingly large coefficients. One way of computing the coefficients of Q(u] is to start with the identity The construction of the algorithm for computing was discussed in the second example of § IIIc. We saw in § IIIc that the algorithm for computing the coefficients of (r0 + riu)(s0 + siu) provides us with an algorithm for computing the coefficients of (r0 + riu)(so + s\u) mod u +1. This construction is quite general. Every algorithm for computing the coefficients of QC?=o r,«') • (Zr=o SiU1) can be used to construct an algorithm for computing CL?=o ftu') • (Z?=o -Si"') mod p ( u ) for any p ( u ) £ G[u]. We will now resume the development of the algorithm for computing the coefficients of Q(u). In the second example of § IIIc we developed the following identities: Substituting XQ — X2 for r0, x\ for rif yo~V2 for s0 and yi for si, we see that the coefficients of ((x0-X2) + xiu)((y0-y2) + yiu)mod(u2 + l) = t0 + tiu can be computed using the identities We are ready to put all the pieces together. Using the Euclidean algorithm to obtain the q\ and q2 of the Chinese remainder theorem we obtain 34 CHAPTER IV where One of the reasons for the choice of « 2 +1 is that both its roots, / and —/, lie on the unit circle and therefore their magnitude does not change as we raise them to higher powers. This ensures us that when we compute t0 and t\ of t0 + tiu = Z"=0 Xtu') mod (u2 4-1) the coefficients of the different jc,-'s in t0 and t\ will remain bounded. This property is enjoyed by all cyclotomic polynomials. We could have chosen w 2 + « + l or u2 — w + 1 instead of w 2 + l and have obtained different algorithms. In addition to cyclotomic polynomials we can use the polynomial u (r>l) instead of just using u. Similarly we can use (u — oo)r instead of just (u — oo). A point of explanation as to the meaning of (u — oo)r may be in order. We used the term multiplication modulo u — oo to indicate the construction of the Case 2 of Theorem 1 of IVb; that is, using the identity: Q(u} = Q(u] modp(u) + xmynp(u), where deg Q(M) = deg P(u). That is, we computed Q(«) modulo p(u] and thereby modified some of the coefficients by multiples of xmyn, the highest coefficient of Q(u), and then "corrected" this modification. We can extend this construction by the use of the identity Q(u) = Q(u] modp(u) + (xm+iyn + xmyn)p(u) + xmynp*(u) where deg O(u) = l+degp(w). In that case we have to "correct" the modifications caused by the two highest coefficients of Q(u). By analogy we refer to the construction as multiplication modulo (u — oo)2. It should be emphasized that the construction given above does not exhaust all possible ways of obtaining sub-minimal algorithms. Consider the following iden- PRODUCT OF POLYNOMIALS 35 titles: These identities enable us to compute the coefficients of Q(w) = Cxo + *iw+x 2 w 2 )(yo+yi"+y2« 2 ) using only 6 m/d steps. Yet these identities cannot be obtained by the use of the Chinese remainder theorem. The second heuristic method was already sketched in lib; that is, we construct algorithms for computing the coefficients of the product of larger polynomials by iterating the algorithm for computing the coefficients of the product of smaller ones. We will illustrate this construction by describing an algorithm for computing the coefficients of We know that the minimum number of m/d steps is 7; but every such algorithm will have coefficients other than 0,1, and —1. We will describe another algorithm using 9 m/d steps which has all its coefficients 0, 1, and — 1. We will start from an algorithm similar to that of Ha; that is, the algorithm which is based on the identities and then iterate it. We can write where The identities written above mean that and 36 CHAPTER IV Each of the three products X0Y0 = (x0 + X i u ) ( y o + y i u ) , XiYi = (x2 + X3u)x (y 2 + V2«), aLnd(X0 + X1)(Y0+Yl) = ((x0 + X2) + (xi + X3)u)((y0 + y2) + (yi + y3)u) can also be performed using three m/d steps. More precisely, we have: where , and where and where m7 = x2y2, ms = (x2 + x3)(y2 + ys), and m9 = *3y3. Substituting these expressions we obtain: This identity enables us to compute the coefficients of Q(u) using 9 m/d steps. Clearly we can iterate the algorithm once more and obtain an algorithm for computing the coefficients of the product of two 7th degree polynomials using 27 m/d steps. In general, if n = 2k we can compute the coefficients of the product of two (n — 1) degree polynomials using 3k = n'°823 m/d steps. Note that if we iterate the algorithm described earlier to computing the coefficients of two quadratic polynomials using 6 m/d steps, we would have obtained algorithms for computing the coefficients of two (n — 1) degree polynomials in 6k = n10£36 m/d steps for n = 3k. Since Iog3 6 > Iog2 3 the algorithm of § Ila is better for iteration. The third heuristic method combines some of the advantage of the first two. The idea behind this method is the use of fields of constants which are larger than the field of rational numbers. We wilHllustrate this method by taking for G the field of rational numbers extended by t, that is, the fields whose elemen where « 2 = —1. (We use a instead of the customary notation / in order to emphasize the generality of the idea. What is important is that we extend the field of rationals by a root of an irreducible polynomial, in our example w 2 +1. We could have j ust as well taken another polynomial, say w 2 + M +1). We will illustrate the method by developing an algorithm for computing the coefficients of Q(u] — (xo + xlu+X2U2)(y0 + yiii + y2u2). We start with the identiy PRODUCT OF POLYNOMIALS 37 Using the Chinese remainder theorem we obtain the following identities: where mi = x0y0, m 2 = (*o + *i + *2)((y0 + yi +y 2 )/4), m3 = (xo-xl+x2)((yoyi + y2)/4), ra4 = ((*o-*2) + a*i)(((y 0 -y 2 ) + ayi)/4), and ms = ((xo-x2)a*i)(((yo~yi)~«yi)/4). That means we have an algorithm using only 5 m/d steps. We have to be careful in our interpretation of an m/d step. An m/d step does not represent a multiplication of real numbers any more; an m/d step now represents a multiplication of two linear polynomials modulo u2 +1. (Again we prefer this terminology to the more customary one of saying that an m/d step represents the product of two complex numbers in order to emphasize the generality of the idea. If a had stood for the root of u2 + u +1 then an m/d step would have represented the product of two polynomials modulo u2 + u +1.) Thus an m/d step really stands for three (real) multiplications, and this algorithm, instead of being an improvement, requires 9 multiplications. The power of this algorithm becomes apparent when we iterate it. Before examining the results of iterating the algorithm we will reinterpret the problem. Instead of viewing the coefficients of the polynomials as single entities we will consider them as a pair, or what is equivalent as a linear polynomial in a. Thus Addition of two such pairs is the addition of two linear polynomials, and multiplication of two such pairs is the product of two polynomials modulo u2 +1. If we have a pair (r1} r2) then a(r\, r2) = (—r 2 , ri) since a(ri + ar2) = —r2 + ari. We see that with respect to these pairs the assumption that multiplication by any element of the field of constants is not counted remains valid. (The usefulness of this point of view becomes even more apparent if we take a the root of u2 + u +1, i.e., a =— 1+V—3/2. In this case a(r0, ri) = (—r\, r0 — ri) since a(r 0 + r1o:) = ar0 + a2r\ = ar0 — (\+a}ri = —ri + (rQ — ri)a. Thus we see that we can give an interpretation to the seemingly nonsensical statement that "multiplication by —1 + V—3/2 can be effected by one subtraction.") If we now iterate the algorithm we obtain an algorithm for computing the coefficients of the product of two n — \ polynomials (for n = 3 f c ) using tk = n logs 5 m/d steps. Since each m/d step (i.e., the product of two pairs) can be done in 3 (real) multiplications the algorithm uses 3«l0835 multiplications. It is easy to verify that Iog3 5 < Iog2 3 so for large enough n this new algorithm is superior to the one described earlier. 38 CHAPTER IV The number of multiplications can be further halved if we have to compute both zi = Xi * y and z2 = x2 * y. In this case we set up the computation as that of z = (zi + ax2) *y = xi *y + «X2 * y. That is, the a-free parts of the results yield Xi * y while the a terms of the results yield x2 * y. We will see in the next section that such a situation does arise in certain applications. CHAPTER V FIR Filters One of the most frequent computations in signal processing is that of a finite impulse response (FIR) filter. Let x0, xi, x2, • • • be the sequence obtained by sampling the signal at times 0, A,, 2 A,, 3A,, • • • . Filtering the signal is the computation of z0, z\, • • • where z, =^=0 xi+jhj, i = 0, 1, 2, • • • , where h0, hi,---,hn-i are known as the weights of taps of the filter. Such a filter is known as an n-tap filter. Va. Filters and polynomials. The computation of the first ra outputs z0, Zi, • • • , z m -i of an n tap filter can be written as That is, it can be written as Xh where X is an ra x n matrix such that One of the transposes of this system of bilinear forms (in the jc,'s and /i/s) is obtained by multiplying the /cth bilinear form by Zk (k = 0,1, • • • , m — 1) on the left, and then considering the bilinear forms (in the Zfc's and /i/s) which are the coefficients of jc,'s. More precisely, we first construct the trilinear form (where it is understood that z& = 0 for k > m — 1 and h, = 0 for / > n — 1). These coefficients of the x,'s are the elements of the convolution z * h. As a matter of fact we already saw, in the first example of § IIIc, the connection between the convolution of two 2-dimensional vectors and the computation of two outputs of a 2-tap filter. This connection between convolution and filtering means that we can use the algorithms developed in the previous section for computing convolutions to obtain algorithms for computing filtering. If we denote by F(m, n) the computation of m outputs of an n-tap filter, then we have just shown that fi(F(m, n)) = m + n — 1. Before examining the algorithms for filtering which we obtain this way we will prove: THEOREM 1. Proof. It is convenient to write F(m,n) in a slightly different form, i.e. zfc = Z/l"o"~ hi-kXi where it is understood that hj = 0 unless 0 ^/ ^ n -1. That is 39 40 CHAPTER V F(ra, n] can be written as Hx where H is an m x (m + n — 1) matrix such that Hi,j = hj-\, and x is the (column) vector x = (jco, *i, • • • , jt m + n _ 2 ) T . It is easily verified that pc (H) = m + n — l and by Theorem 1 of Illb we obtain /u, (F(m, n)) ^ m + n — 1. But we have already seen that /I (F(m, n)) = m + n — l, so (j, (F(m, n)) = m + n — l. In Example 1 of § IIIc we derived the following identities for computing F(2,2) and where The algorithm based on these identities uses three m/d steps. To understand the number of additions it uses we will first see how it can be used to compute the outputs of a 2-tap filter. Let Zfc = Xkho + Xk+ihi, k = 0, 1, • • • , be the outputs of a 2-tap filter. Using the algorithm just described we can first compute z0 and zi, then compute z 2 and z3, followed by z4 and z5, and so on. So to compute 2s outputs we use 3s multiplications, that is § multiplications per output. What about the number of additions? The algorithm calls for computing h0 + hi, x0 — x\ and x2 — *i, m\ + m2 and m2 — m3. It seems as if the algorithm calls for five additions. Closer inspection shows that only four additions are required. The quantity h0 + hi is needed for computing z0 and zi, but also for computing z2 and z3, z4 and z5, and so on. We can compute this quantity only once and use it whenever needed. As a matter of fact ho + hi, can be computed when the program is written, and does not have to be computed when it is executed. We see that this algorithm enables us to compute 2s outputs of a 2-tap filter using 45 additions, i.e., two additions per output. We will therefore say that this algorithm uses (2M; 2A) per output. A 2-tap filter is not very common in practice. We can obtain an algorithm for n-tap filters with larger n by iterating the algorithm for F(2, 2). Of course, we could have iterated the algorithm for convolution as explained in IVc, but it is more convenient to count the number of additions if we iterate the algorithm of F(2, 2) directly. We illustrate this iteration by deriving an algorithm for F(4, 4). The problem of computing four outputs of a 4-tap filter can be written as: If we partition the matrix into 2 x 2 blocks we obtain: FIR FILTERS 41 where: Using the algorithm for F(2, 2) means that we have to compute and then we have to perform Mi+M 2 and M2 —M 3 , taking 4 additions. The computation of X0 —X\ requires three additions (x0 — x2, x\ -*3, and x2 — x4), and the computation of X\ - X2 two more additions (jc3 — x5, and x4 — jc6) since x2 — x4 had already bee computed. We can compute MI, M2, and M3 using the algorithm for F(2, 2) so that each of them uses three multiplications and four additions. Altogether we have obtained an algorithm for F(4, 4) which uses nine multiplications and 4 + 5 + 3x4 = 21 additions. Thus we obtain a (24M, 54A) algorithm for computing the outputs of a 4-tap filter. (A closer examination reveals that the computation of x4 — ;c6 for one set of four outputs is the same as that of x0-x2 for the following set of four outputs, and therefore the algorithm really requires only 20 additions, i.e., it is a (2*M, 5A) algorithm). It should be clear that this iteration process is quite general. We can partition the matrix of the computation of 2s outputs of a 2s -tap filter into four 5 X 5 blocks. When we apply the algorithm of F(2, 2) we see that we obtain an algorithm with a number of m/d steps which is three times as many as needed to compute F(s, s). The number of additions to compute Xo — X\, in this case, is 25 — 1, and the number of additional additions to compute X\— X2 is 5. The parenthetical comment at the end of the preceding paragraph is valid as well, and we see therefore that the computation can be so arranged that X0 — X\ and Xi —X2 can be computed using only 5 additions each. The computation of M\ + M2 and M2 —M 3 uses 2s additional additions. To summarize, if F(s, s) can be done using m multiplications and a additions, then, using the algorithm for F(2, 2), we can compute F(2s, 2s) using 45 + 3a additions and 3m multiplications. In this way we can construct a sequence of algorithms to compute F(16, 16). ALGORITHM 1. The output of the filter are computed in the straightforward way. This algorithm uses 16x16 = 256 multiplication and 16x15 = 240 additions, i.e., we have a (16M, 15A) algorithm. ALGORITHM 2. We can use the algorithm for F(2, 2) to obtain an algorithm which computes F(8, 8) three times, and needs 4 x 8 = 32 additional additions. If 42 CHAPTER V we computeF(8, 8) in a straightforward way we need 8 x 8 = 64 multiplications and 8x1 — 56 additions. Altogether the algorithm uses 3 x 64 = 192 multiplications and 32 + 3 x 56 = 200 additions. That is, it is a (12M, 12sA) algorithm. ALGORITHM 3. We can modify Algorithm 2 by using the F(2, 2) algorithm to obtain a new algorithm for F(8, 8). This algorithm uses 4 x 4 = 1 6 additions and computes F(4,4) three times. Altogether then the algorithm for F(16, 16) uses 32 + 3 x 16 = 80 additions and nine times the computation ofF(4,4). The straightforward way of computing F(4,4) uses 16 multiplications and 12 additions, so Algorithm 3 uses 80 + 9 x 12 = 188 additions and 9 x 16 = 144 multiplications. So Algorithm 3 is a (9M, 1 l|A) algorithm. ALGORITHM 4. This is a modification of Algorithm 3 obtained by using for F(4,4) an algorithm which uses 4 x 2 = 8 additions and three times an algorithm for F(2, 2). Thus this algorithm forF(l6,16) uses 80 + 9 x 8 = 152 additions and 27 times the computation ofF(2, 2). Computing F(2, 2) in a straightforward way uses two additions and four multiplications. So Algorithm 4 uses 152 + 2 7 x 2 = 206 additions and 27x4 = 108 multiplications, that means it is a (6|M, 12gA) algorithm. ALGORITHM 5. Algorithm 5 is obtained by iterating the F(2, 2) algorithm all the way. In this case we use the (3M, 4A) algorithm to compute F(2, 2). (Alternatively, we can view Algorithm 5 as a modification of Algorithm 3 in which we use the (2IM, 5 A) algorithm to compute F'(4,4)). Simple calculations show that Algorithm 5 uses 260 additions and 81 multiplications, and that therefore it is a (5^M, 164 A) algorithm. Which of these algorithms is preferable? We cannot answer this question. Only detailed examination of the algorithm in conjunction with the performance of the computing system can decide whether, for example, the (6|M, 12gA) algorithm is preferable to the (9M, 1 l|A) one. All that we can do is provide the designer of the system with the various alternatives. In the following examples we will derive few additional algorithms and examine their behavior under iteration. Example 1. We will start with the convolution of a three-dimensional vector with a two-dimensional one. That is, we will compute the coefficients of Q(u) = (x0 + xiu+x2u2)(y0 + yiu). We will obtain the algorithm by the use of the identity: Following the steps described in § IVb we obtain where and FIR FILTERS 43 The tensor of this system, and its decomposition is given by Equating coefficients of the y,'s we obtain the following transpose of the problem: where m1 = (z0-z2)xQ, m2 = (zl + z2)((x0 + Xi + x2)/2), m3 = (z2~zi)((x0Xi + x2)/2), and w 4 = (zi- z3)x2. We thus developed an algorithm for F(2, 3) which uses four multiplications and eight additions. (By Theorem 1 of this section M (F(2,3)) = 4). Equating the coefficients of the jc,'s we obtain a second transpose: where mi = (z0 -z 2 )y 0 , m2 = (zi + Z2)((y0 + yi)/2), m3 = (z2- Zi)((y 0 -yi)/2), and m4 = (zi - z3)y i. That is an algorithm for F(3, 2) using four multiplications and eight additions. (Again, by Theorem 1 of this section n(F(3, 2)) = 4). We will use the second algorithm, in conjunction with the algorithm for F(2, 2), to obtain a new sequence of algorithms for a 16-tap filter. More precisely, we will obtain a sequence of algorithms for F(24, 16). ALGORITHM I'. This is the straightforward algorithm, which is a (16M, 15A) algorithm. ALGORITHM 2'. We partition the matrix of F(24, 16) to 8x88blocks. The addition corresponding to (ZQ — z2) requires 15 additions, that of (z\ — z$) eight more additions, and that of (z\-\- z2) and (z2 — Zi) 15 additions each. Altogether we have 53 additions. The additions corresponding to m2 + m3, m\-\-(m2 + m^}, m2 — m3, and (m2 + m^) — ni4 require eight additions each. So altogether F(24, 16) can be computed with 85 additions and four computations ofF(8, 8). Doing the F(8, 8) in the straightforward way we obtain an algorithm for F(24, 16) using 256 multiplications and 309 additions. That is, we have a (10§M, 12gA) algorithm. We can obtain Algorithms 3', 4', and 5' as was done prior to the example, by using the algorithm for F(2, 2) to compute F(8, 8), F(4, 4), and F(2, 2). It is important to note that since the matrix of F(3, 2) is rectangular and not square, we can use the algorithm for F(3, 2) only once. We cannot iterate it. Example 2. In § IVc we discussed the possibility of using some algebraic number field as the field of constants. FIR filters provide us with an application 44 CHAPTER V where this concept may be advantageous. If we take for G the field of rationals extended by a, where a 2 = -1, we can compute z0 + azm, Zi + azm+\, • • • , zm-i + az2m-i', followed by the computation of z2m +azm, • • • , z 3m -i + az4m-i, and so on. We will illustrate this by deriving algorithms for F(3, 3) and F(4, 3), and then using them to compute the outputs of an 81-tap filter. In § IVc we derived the following identity for computing the coefficients of where and The tensor of this system and its decomposition is: The transpose obtained by equating coefficients of the *,'s is: where and FIR FILTERS 45 We now have an algorithm for F(3, 3) using five m/d steps, and 15 additions. Of course each addition is the addition of a pair of numbers, and each multiplication is the product of two polynomials modulo u2 +1. In case each z,- stands for an 5 x s matrix (of the form of the matrix for FIR filters), the 15 additions become 22s -7 additions. We will now derive an algorithm for F(4, 3). We will start from an algorithm for computing Q(u) = (x0 + xiu + X2U2 + X3U3)(y0 + y\u + y2U2), and using the identity Q(u) = Q(u} mod (u(u - l}(u + l)(u-a)(u +«)(« -oo)) we obtain where The transpose of this computation obtained by equating the coefficients of the jc,'s of the tensor is: where hat is, an algorithm for F(4, 3) using six m/d steps and 20 additions. If the z,'s stand for sxs matrices (of the form of FIR filter) then these 20 additions stand for 27s-7 additions. We will now combine these algorithms to obtain an algorithm for F(216, 81). We can view this problem as that of F(108, 81) when each of the entries is a pair. We will obtain the algorithm by iterating the algorithm for F(4, 3) and the algorithm for F(3, 3) three times. The number of m/d steps of this combined algorithm is 6x5x5x6 = 750. But since each step is multiplication modulo u2 +1 we have 3x750 = 2,250 multiplications and 3x750 = 2,250 additions. (A little more careful analysis shows that of the 750 m/d steps 108 are of the form of (to+uti) x ?3 mod u2 +1 which requires only two multiplications and no additions, and only the remaining 642 m/d steps require three multiplications and three additions each. So altogether, the 750 m/d steps represent 3x642 + 2x108 = 2,142 multiplications and 3x642 = 1,926 additions.) The number of addition steps of the combined algorithm is (27x27-7) + 6((22x9-7) + 5((22x 3-7)+ 5x15) = 5,895. Since each addition step is the sum of a pair of numbers (i.e., a sum of two linear polynomials), the total number of additions is 2 x 5,895 + 1,926 = 13,716. This algorithm computes 216 outputs so it is a (9iiM,63|A) algorithm. 46 CHAPTER V We could have, of course, derived another algorithm which uses more multiplications and fewer additions by not iterating the F(3, 3) algorithm all the way. Vb. Filters with decimation. In many applications we do not desire to compute every output of the filter but, say, every other output (2:1 decimation), or every third output (3:1 decimation), or even higher decimation. In this section we will develop algorithms for computing the decimated outputs of FIR filters. We will start with an example. Consider the computation of three outputs of a 5-tap filter with a 2:1 decimation. If we denote by z0, zi, • • • the outputs of the filter without decimation, then our task is to compute z0, z2, and z4, followed by z6, zs, and z10, and so on. Writing z0, z 2 , and z4 as the product of a matrix by a vector we obtain, That is, our task can be viewed as computing F(3, 3) of one set of signals (jc0, X2, x4, *6, *8) and one set of tap values (ho, ti2, h4), and also F(3, 2) on another set of signals (x\, x$, xs, xj) and another set of tap values (hi, ^3), and then summing the results. If we denote by F(3, 5; 2) the problem at hand, we can express the previous sentence symbolically as F(3, 5; 2) = F(3, 3) + F(3, 2). In general, we will denote by F(m, n; d) the computation of m outputs of an rc-tap filter with a d:\ decimation. The example can be easily generalized, and we summarize it in the following theorem: THEOREM 1. Let n = s • d + r; thenF(m, n; d} = rF(m, s + l ) + (d-r)F(m, s}. Using Theorem 2 of § Illb we obtain that /j.(rF(m,s + l ) + (d-r)F(m,s))^ r(m +s) + (d- r)(m + s-l) = (m-l)d + ds + r = n+(m-l)d. In Theorem 1 of § Va we showed that /a(F(m, n)) = m + n — l, and therefore /u,(F(m, n; d))^r(m + s) + (d-r)(m+s — l) = n +(m — l)d. Using these two inequalities we obtain: THEOREM 2. Let n - s • d + r; then /u,(F(m, n; d)) = n+(m — l)d. We see from Theorem 2 that computing F(m, n; d) by performing r separate computations of F(m, s +1), and (d - r) separate computations of F(m, s) uses as few m/d steps as any algorithm for computing F(m,n;d). Nonetheless it is advantageous to combine all those computations into one, because this way we can reduce the number of additions. In the rest of this subsection we will describe one way of achieving this reduction in the number of additions. For the sake of simplicity we will take the case, which is prevalent in applications, that d divides n, i.e., that r = 0. Every algorithm for computing a system of bilinear forms can be written as computing the product Ap of the vector p = (pi, p2, • • • , PI)T whose entries are the / products, by the matrix A whose entries are in G. The addition steps of this algorithm can be partitioned into two parts: Those additions which are required to FIR FILTERS 47 compute the p;'s—to be called input addition, and those additions which are required by the multiplication by A—to be called output additions Let Ap be the algorithm for computing F(m, s) (where n = d • s). Assume this algorithm has a\ input additions and a2 output additions. If we compute Ap d times, i.e., we compute Api, Ap2, • • • , Apd separately, we need da\ + da^ additions. To compute F(m, n; d) we still have to add all these results, which requires m(d — \) additions. Altogether, this way uses d(a\ + a2 + m} — m additions. Alternatively, we have Apx + Ap2 + • • • + Apd = A (pi, +p2 + • • • + pd). So we can compute F(m, n;d)by first computing pi, p2, • • • , pd (which uses da\ additions), then summing these d vectors (which uses (d — \}l additions where / is the size of the p/s), and finally multiplying by A (which uses a 2 additions). Altogether this algorithm uses d(a\ + l) + a2 — l additions. Whenever a2>l — m the second method uses (d — I)(a2 + m — /) fewer additions. Example 1. We will derive an algorithm for computing F(3, 6; 3). As we saw, F(3,6;3) = 3F(3,2). where m x = (z 0 -z 2 )y 0 , ™2 = (zl + z2}((yo + yi)/2), m3 = (z 2 - Zi)((y 0 + yi)/2), and m4 = (zi — z 3 )yi. This algorithm has four m/d steps, four input additions, and four output additions. We will use this algorithm to obtain an algorithm for F(3, 6; 3). This algorithm for F(3, 6; 3) will have 3 x4 = 12 m/d steps, 3x4 + 2x4 + 4 = 24 additions, i.e., it is a (4M, 8A) algorithm. (The straightforward algorithm is a (6M, 5A) algorithm). To obtain the algorithm we first write F(3, 6; 3) in full. The first part of the algorithm is the computation of the input additions of the three subproblems; that is we compute (x0 — x6), Oc3 + *6), (xe — ^3), (^3 — ^9); 7), (X+ + XT), (x7-x4), (x4-xio)',(x2-x8), (xs + xs), (x8-x5), (*5-*n).This part requires 12 additions. 48 CHAPTER V The second part of the algorithm is performing the 12 m/d steps. These are: This part uses 12 multiplications. The third part of the algorithm computes: m'\= mi + ms + mg, m'2 = W2 + W6 +WIG, m's = m3 + m7 + mii, and m'4 = m4 + m8 + mi2. This requires 2 x 4 = 8 additions. The final part of the algorithm computes and uses four additions. Example 2. In this example we will discuss a 32-tap filter with a 2:1 decimation. More specifically we will derive an algorithm for F(24, 32; 2) = 2F(24, 16). Algorithms for F(24, 16) were derived in Example 1 of § Va; we will follow this derivation keeping track of the number of input additions and output additions. We partition the matrix of F(24,16) into 8x8 blocks, and denote them by z0, z\, z2,23. Using the F(3, 2) algorithm we have to compute z 0 ~ 22, z\ — z3, z\ + z2, and z^ — z\. This requires 15 + 8-1-15 + 15 = 53 additions. (Recall that once we compute z0 — Z2, the computation of z\ — z3 uses only eight additions.) Partitioning each of these 8x8 matrices into blocks of 4 x 4 matrices we can use the F(2, 2) algorithm. If we denote the 4 x 4 blocks of each of the 8x8 blocks by XQ,X\, x2 we have to compute x0 — xi and Xi~x2 which uses 7+4 = 11 additions. This last statement is true for Zo — Z2,Zi + zi, and z2 — z\; but the x0 — x\ of z\ — z3 uses only four additions once we have computed the x\—x2 of z0 — z2. Thus the number of additions is 11 + 11 + 11 + 8 = 41. We will compute the F(4,4)'s using the regular way (which has no input additions). So the total number of input additions is FIR FILTERS 49 53 + 41 = 94. Each of the 4x3 = 12 F(4,4)'s uses 16 multiplications, so the algorithm uses 16x12 = 192 multiplications. Now to the computation of the number of output additions. Each of the 12 F(4,4)'s uses 12 output additions for the total of 144. Each of the four F(2, 2) algorithms (on 4 x 4 blocks) we used requires 2 x 4 = 8 output additions for a total of 32 additions. Finally the F(3, 2) algorithm (on 8 x 8 blocks) uses 8x4 = 32 additions. So altogether this algorithm has 144 + 32 + 32 = 208 output additions. To summarize, the F(24, 16) algorithm we use has 192 multiplications, 94 input additions, and 208 output additions. Consequently the F(24, 32; 2) algorithm has 2 x 192 = 384 multiplications and 2 x 94 +192 + 208 = 588 additions. We therefore have an (16M, 24^A) algorithm. For the sake of comparison it should be noted that the straightforward algorithm is a (32M, 31 A) algorithm. Vc. Symmetric filters. In many applications we encounter symmetric filters. Let ho, hi, • • • , hn-i be the values of the taps of an n-tap filter. The filter is called symmetric if n, = /i n -i-i- From a computational point of view,-the advantage of symmetric filters is that the straightforward algorithm for an n-tap symmetric filter uses n — 1 addition but only n/2 multiplications (if n is even) or (n +1)/2 multiplications (if n is odd) for every output. We will denote by Fs (m, n) the computation of m outputs of an n -tap symmetric filter. It is clear that we can always compute Fs(m,n] using an algorithm for F(m, n). That is /ji(Fs(m, «))g/Lt(F(m, n)) = m + n — 1. Using Theorem 2 of § Illb we obtain the following: THEOREM 1. For n =21 + 1 an odd number n(Fs(m, n)) = m + n — l. For n =21 an even number ^(Fs(m, n)) = m + n—2. Proof. The first part of the theorem is immediate from Theorem 2 of § Illb (which implies that /u(F s (m, n))^m +n — 1), and from the observation that /x(F s (m, n))^/x(F(m, n}) = m + n -1. As for the second part of the theorem, Theorem 2 of § Illb implies that n(Fs(m, 2/))^m + 2/-2. We will prove the second part of the theorem by proving that ^(F 5 (m, 2/))^/a(F s (m, 2/-1)) = m+2l-2. The construction involved in the proof of this last assertion is important for the construction of algorithms. We will therefore highlight this construction by stating the result as a separate theorem. THEOREM 2. For every algorithm A for computing F(m, 21 - 1) using p multiplications and q additions, there exists an algorithm A' for computing Fs(m, 21) using p multiplications and q + m+2l-2 additions. Moreover, if A is an (rM, sA) algorithm A' is an (rM, (s + 1)A) algorithm. Proof. We will give two proofs of this result, since each of them illuminate another aspect of the computation. The first proof starts with the identity: X2i-ih'2i-i where h'0 = h0, h(=hi-h0, h'2 = h2-hi + h0, and in general h'i= h,: — hi-i + hi-2 •• -±h0. In the special case where /i, = /i2;-i-, we have . Using the fact that h'2i-i = 0 we obtain for each i = 0,1, • • - , / - 2 that h2i-2-i = 50 CHAPTER V Thus we first have to compute the m + 2 / — 2 quantities x\ =Xi+xi+i, / = 0, 1, • • • , m + 2 / —3, and in terms of these quantities we have an Fs(m, 21 — 1) computation. This proves the first half of the theorem. The second part of the theorem follows from the observation that of the terms jc I =*,-+*,•+!, / = 0, 1, • • • , m + 2 / — 3, for the computation of the "second batch" of Fs(m, 2/) (i.e.,the computation of the (m + l)st, (ra +2)nd, up to (2m)th outputs) all but the last m have already been calculated. The second proof of the theorem is not as straightforward, but may be more illuminating. Let Fs(m, 21) be the computation of m output of a 2/-tap symmetric filter. One of the transposes of this computation is that of computing the coefficients of the polynomial Q(u) = R(u)'xS(u) where R(u) is the polynomial R(u) = Zilo ziUi and S(u) is the polynomial S(u) = Z,-=o h>u>- Since the filter is symmetric, i.e., /i,=/i2/-i-,, the polynomial S(u) is symmetric as well, i.e., S(u) = u2l~lS(u~1). But every symmetric polynomial of odd degree is divisible by u + l, so u(u) = (u + l)R(u)xS'(u). Simple calculations show that S'(u) is the polynomial S ' ( u ) - Z,=o h'fu', where the terms h\ are as defined in the first proof. We now turn to an example for deriving an algorithm for computing the outputs of a symmetric filter. This example will illustrate how the extra symmetries of a symmetric filter aids us in obtaining an algorithm which uses few m/d steps and yet does not have large coefficients. Example 1. In this example we will derive an algorithm for computing the outputs of a 3-tap symmetric filter. More specifically we will derive an algorithm for Fs(4, 3). The algorithm derived in this example will be a part of the algorithm which we will derive in the next example. Just as was done in § Va, we will start by deriving an algorithm for a transpose of Fs(4, 3), and then transpose the algorithm to obtain an algorithm for Fs(4, 3). A transpose of Fs(4, 3) is the problem of computing the coefficients of Q(u) = R(u)xS(u) where R(u) is the polynomial R(u) = ZQ + Z\U + z + Z3U3, and S ( u ) is the symmetric polynomial S ( u ) = H0 + hiu + h0u2. We will compute the coefficients of Q(u) using the identity Q(u) = R(u)xS(u)mod(u4-l)ux(u-oo).If S(u) had not been 4 symmetric the algorithm would not be minimal since w -! has the nonlinear irreducible polynomial u2 +1 as a factor. Because S ( u ) is a symmetric polynomial we will be able to obtain an algorithm having only six m/d steps. Using the Chinese remainder theorem we have to compute Q(u) modulo (w-1), (u + l), (« 2 +D, andw. Q(u)mod(u—): Since Q(w)mod (u -1) = (z0 + Zi + Z3)(2h0 + hi) this can be done using the one multiplication mi = (zo + Zi + z2 + Z3)(2h0 + hi). Q(u) mod (M +1): This also can be computed using only one multiplication; namely m2 = (z0-z\ +z2-Z3)(2h0-hi). Q(u)mod (w 2 +l): The polynomial Q(u) mod (w 2 + l) is ((z0-z2) + (zi — Z3)u)(hiu)mod(u2+ l) = t 0 + tiU. We can compute tQ and t\ using the two multiplications m 3 = (z3-Zi)/ii = t0, m4 = (zo-Z2)hi = ti. FIR FILTERS 51 Q(u) mod u: This step can be done using only one multiplication: m5 — Z0h0. Putting it all together we obtain: Therefore Taking the transpose of the algorithm we obtain: where m\ = (x-i + X2 + x3 + X4)((2hQ + hi)/4), m2 = (-Xi + X2-x3 + X4)'x((2ho + hi)/4), m3 = (-X2 + X4)hi/2, m4-(xi-x3)hi/2, m5 = (xo-x4)h0, and m 6 = (-xi + xs)h0. We have derived an algorithm for Fs(4, 3) which uses six multiplications and 16 additions. Example 2. In this example we will derive an algorithm for computing the outputs of a 7-tap symmetric filter. More specifically we will derive an algorithm for F,(8, 7). A transpose of FS(S, 7) is the computation of the coefficients of Q(u) = R(u)xS(u) where #(«) is the polynomial R(u) = z0 + Ziu + z2u2 + z3u3 + Z4U4 + z5u5 + Z6U6 + z7u7, and S(u) is the symmetric polynomial S ( u ) = ho + hiU+h2u2 + h3u3 + h2U4 + hiU5 + hoU6. Using the heuristic approach described in § IVc, we start with the identity Q(u) = R(u)x S(u)mod(u8-l)u3(u-oo)3. Q(u) mod (u — 1): This part uses only one multiplication m\ = (z0 + zl + Z2 + Zj + Z4 + Z5 + z6 + z7)(2h0 + 2hl + 2h2 + h3). Q(u) mod (u + 1): This part also uses only one multiplication mi — (z0-zi + z2-Z3 + Z4-z6- z7)(2h0 — 2hi + 2h2 + h3). O ( w ) m o d ( M 2 + l ) : Since Q(w)mod (w 2 + l) = t0 + tiu =((z0-z2 + Z4-z6) + 3 + z5-z7)u)((2hi-h3)u)mod (u2 + l) = (-Zi + z3-z5 + z7)(2hl-h3) 0-z2 + Z4-z6)(2hi-h3)u we can compute t0 and ti using two multiplications: m 3 = (-zi + z3-zs + z7)(2hi + h3) = t0, m4 = (z0-z2 + z4-z6)(2hi-h3) = fi. O(w)mod(M4 + l):Thea polynomialQ(u)mod(u4+l)s ((z 5)u + (z2-z6)u2 + (z3-z7)u3)((h0-h2) + (h2-h0)u2 + h3u3) mod (w4 + l). The polynomial ((ho-h2) + (h2-ho)u2 + h3u3) mod (w 4 + l) can be written as 52 CHAPTER V M 2 [(/t 2 -/*o) + /i3«+(ft2-/io)« 2 ]mod(H 4 +l). Thus we obtain .that 4 2 2 3 G ( w ) m o d ( H + l) is U ((z0-z4) + (zl-z5)u+(z2-z6)u + (z3-z7)u )((h20) + h3u+(h2-h0)u2) mod («4 + l) = ((z6-z2) + (z7-z3)u + (z0-z4)u2 + (zl5)u3)((h2-h0) + h3u + (h2-h0)u2)mod(u4 + l) = R'(u)xS'(u)mod(u4+l). We will compute the coefficients of Q(u) mod (u4 +1) by first computing the coefficients of R'(u) xS'(u), and then "reducing" modulo u4 +1. The polynomial S'(u) is a quadratic symmetric polynomial. One algorithm for computing the coefficients of R'(u)x S'(u) using six multiplications was derived in the first example of this subsection. We will use this algorithm. Denoting the polynomial R'(u)xS'(u) by R'(u)xS'(u) = aQ + a\u + a2u2 + a3u3+ a4u4+ a$u5 we obtain from the first example that where We finally obtain that R'(u)xS'(u) mod (u + I) = (a0-a4) + (ai-a5)u + a2u2 + a3u3 = bo + biu + b2u2 + b3u3, where b0 = —4m5 —4m6 —2m7 + 2m9, bi = \ms — \m^ + 2m& + 2m\Q, 62 = 4^*5+ 4W6 —2^7, and b3 = 4m5— 4W 6 — 2m8. It seems that we were lucky in being able to express Q(«)mod (« 4 +l) as R'(u) x S'(u) mod (u4 +1) where S'(u) is a symmetric polynomial of degree two. We interrupt the example to state a theorem which shows those aspects of the problem that enabled us to do so. It also shows that this will always happen whenever we derive the algorithm for symmetric filters using the approach of this example. THEOREM 3. Let S ( u ) = T,?=o hiu' be a symmetric polynomial (i.e., hi = /im-,) of even degree m with indeterminates as coefficients. LetP(u) e G[u] be an irreducible symmetric polynomial of even degree n (n^=m) with coefficients in G. There exists a polynomial q(u)eG[u] such that S ( u ) modP(u) = q(u) • S ' ( u ) modP(u) such that S ' ( u ) is a symmetric polynomial of degree n—2. Example 2 (continued). To continue the derivation of the algorithm we have to compute O(u) mod u3 and Q(u) mod (u — oo)3. Q(u) mod w 3 : We have to compute Z0h0, z0hi + Ziho,andz0h2 + Zihi+z2ho.A slight modification of the algorithm of § Ha yields the following identities: zQh0 = FIR FILTERS 53 win, Z0hi + zih0 = mi2-mn-mi3, and z0h2 + Zihi + Z2h0 = mi3 + mi4 + mis, where: mu = z0/io, tn12 = (zo + zi)(h0 + hi), mu^Zihi, m^ = z0h2, and mi 5 = 22^0Q(w) mod (u — oo)3: For that part we have to compute z7h0, z7h\ + zeh0, and Z7h2 + z6hi + z5ho. This is the same problem as the computation of O(«)mod u , and we will use the identities: Z7h0=mi6, Zjhi + Z6h0= mi 7 — m i 6 - m i 8 , and z7h2 +Z6 — hi + z5ho = m\8 + mi9 + m2o, where mi 6 = z7h0, m\-i = (z6 + Z7)(h0 + hi), mi8 = z(>hi, m\g = -z7h2, and ra20 = Z5/ioWe are now in a position to use the Chinese remainder theorem in order to put all the pieces together to obtain an algorithm for computing the coefficients of Q(u). We first obtain that Collecting the terms by the powers of M, and using the identity Q(u) = Q(w)mod (w 11 — w 3 ) + (mi8 + mi9 + ra 20 )(« n — w 3 ) + (wi 7 —mi 6 —WisKw 1 2 —« 4 ) + 6(«13 — w5) we obtain that 54 CHAPTER V If we denote the eight outputs of FS(S, 7) by z0, zi, • • • , z7, then by taking the transpose we obtain: where: FIR FILTERS 55 We have thus derived an algorithm for computing FS(S, 7) which uses 20 multiplications and 64 additions. That is, we have a (2|M, 8 A) algorithm (as compared with (4M, 6A) algorithm of performing the computation in a straightforward way). Of the 64 additions, 30 are input additions and 34 are output additions. Example 3. In this example we will derive an algorithm for computing the outputs of a 15-tap symmetric filter with a 2:1 decimation. More specifically, we will derive an algorithm for FS(S, 15; 2). The same argument used to prove Theorem 1 of § Vb shows that F x (8,15; 2) = Fs(8, 8) + Fs(8,7). Theorem 2 of this subsection shows that we can use the algorithm for Fs(8, 7) in order to compute Fs(8, 8). To do that we first have to perform 14 additions of which only eight are really needed, since the other six have already been previously calculated. The same procedure as in § Vb enables us to obtain an algorithm for computing Fs(8,15; 2) which uses 2 x 20 = 40 multiplications, and 2 x 30 + 20 + 34 + 8 = 122 additions. That is, we know how to derive a (5M, 15|A) algorithm for computing F5(8,15; 2). The regular algorithm is an (8M, 14A) algorithm. In order to continue our investigation of the algorithms for symmetric filters with decimation, we need two new concepts: that of a skew-symmetric filter, and that of a symmetric pair of filters. An n-tap filter with tap values of h0, hi,- - • ,hn-\ is said to be skew symmetric if for all i = 0,1, • • • , n — 1, hf = —hn-\-i. Of course if the number of taps n = 21 +1 is odd then hi = 0. We will denote the task of computing m outputs of a skew symmetric filter by Fs(m, n). We have results analogous to Theorems 1 and 2 of this subsection for skew symmetric filters. We will only sketch the proofs of those theorems, since they are similar to the proofs of Theorems 1 and 2. THEOREM 4. Let n=2l be an even number. Then /Lt(Fs(m, n)) = /u(Fs(m, n — 1)) = m + n — 2. More specifically, for every algorithm A for computing Fs(m, n — 1) there exists an algorithm for computing Fs(m, n) which has the same number of m/d steps as A and (m+n—2) additions more than A. Moreover if A is an (aM, bA) algorithm then the algorithm for Fs(m,n} is an (aM, (b + 1)A) algorithm. Proof. One of the transposes of Fs(m, n) is the computation of the coefficients of Q(u) = R(u)xS(u), where R(u) = l™Jo zv', and S(U)=^~Q htul —hn-\~i (i = 0,1, • • • , n — 1). Since S ( u ) is always divisible by (1 — u) we can write S(u) = (1 - u)S'(u) where S'(u) = TJill h\u. It is easily verified that h'0 = -h0, an for / = 1,2, • • • , n-2, h\ =h'i-i+hi. It is also immediate that h\ = h'n-2-i- Therefore Q(u) = (l-u)R(u)xS'(u). Taking the transpose of (l-u)R(u)*S'(u) we see that if we define x\ = Xi-xi+\, / = 0, 1, • • • , m + n -3, we have changed Fs(m, n) into an Fs(m, n — 1) in terms of x\ and x}. The rest of the theorem is immediate. THEOREM 5. Let n=2l + l be an odd number. Then IJL(Fs(m, n)) = fjL(Fs(m, n -2)) = m + n -3. More specifically, for every algorithm, A, for computing Fs(m, n—2) there exists an algorithm for computing (Fs(ra, «)) which has the same number of m/d steps and (m+n— 3) additions more than A. Moreover, if A is an 56 CHAPTER V (aM,bA) algorithm then the algorithm for (Fs(m, n)) is an (aM, (b + l)A) algorithm. Proof. Again, we examine the transpose problem of computing the coefficients of Q(u) = R(u)x S(u). The coefficients of the polynomial S(u) = £"=0 /i,«' satisfy hi = —hn-i-i, hi = 0 so S(u) = (1 — u2)S'(u) where S'(«) is a symmetric polynomial of degree n— 3. The coefficients h\ of S"(w) are h'o = h0, h(=hi, h\ = h'i-2+hi(i^2). Therefore Q(u) = (i-u2)R(u)S'(u). Transposing the problem back we obtain Fs(ra, n — 2) in terms of tap values h\, and signal values x\ = Xi + Xt-2- The rest of the assertion of the theorem is immediate. The second concept we have to introduce is that of a symmetric pair of filters. Consider two n -tap filters z, = £"=o xi+jhj and z\ = Z"=o x'i+jh'j, where the jc, terms and x't terms stand for disjoint of indeterminates, and h\ =/i n _i-,. A symmetric pair of filters is the computation of z" = z,• + z \. We denote the computation of m outputs of a symmetric pair of filters by FF(m, n)_. THEOREM 6. /u(FF(w, n)) = n(Fs(m, n)) + /u,(Fs(m, n)). More specifically, for every algorithm, A, for computingFs(m, n)+Fs(m, n), there exists an algorithm for computingFF(m, n) having the same number ofm/d steps as A, and 2(m + n — 1 additions more than A. Moreover, if the algorithm A is an (aM, bA) algorithm then the algorithm for FF(M, n) is an (aM, (b + 2)A) algorithm. Proof. The statement of the theorem is most transparent when we look at the transpose of the problem. Nonetheless, we will give a direct proof. We start with the identity *,/!,i+xlh- = (*,- + *,;)((/z,• + h'i)/2) + (xi-jc«)((/i,-/i-)/2). Therefore zt+z't =1^=0 (x+i+x't+jKhi + h'iM + Z*:; Ot,.+y-*;+,.)(/*,.-/O/2. The first summation is the computation of a symmetric filter (with tap values (hj + h'j)/2, and signals y, = *,- + jc,'), and the second summation is the computation of a skew symmetric filter (with tap values (h,--h',-)/2, and signals y\ =• x,\—x\). The rest of the assertion is immediate. Example 4. Consider the computation of a 31-tap symmetric filter with 4:1 decimation. Let the values of taps be h0, hi, • • • , hi4, his, hi4, • • • , hi, h0- The same consideration as in § Vb shows that our problem is to compute the sum of four filters. The first filter has tap values (h0, h4, h&, hi2, hi4, hw, h6, hi), the second filter has tap values (hi, hs, h9, hu, hu, h9, hs, hi), the third filter has tap values (h2, h6, hio, hi4, hi2, h8, h4, ho), and the fourth (h3, h7, hu, his, hu, h7, /t3). The first and third filters form a symmetric pair of 8-tap filters, the second filter is an 8-tap symmetric filter, and the fourth filter is a 7-tap symmetric filter. We see that F,(8, 31; 4) = F,(8, 7) + Fs(8, 8) + FF(8, 8). Using the algorithm of Example 2 we obtain an algorithm for computing 4F5(8,7) which uses 4x20 = 80m/d step, and 4x30 + 3x20 + 34 = 214 additions. That is, it is a (10M, 26|A) algorithm. By Theorems 2 and 4 of this subsection we obtain an algorithm for computing Fs(8, 7) + 2Fs(8, 8)+F,(8, 8) which is a (10M, 29|A) algorithm. By Theorem 6 we obtain a (10M, 3l|A) algorithm for computing FS(S, 7)+Fs(8, 8)+FF(8, 8) =FS(8, 31; 4). For the sake of comparison, the straight-forward algorithm for computing Fs(8, 31; 4) is a (16M, 30A) algorithm. CHAPTER VI Product of Polynomials Modulo a Polynomial In § IV we derived algorithms for computing the coefficients of the polynomial Q(u) = R(u) • S(u) by viewing O(u) as a polynomial in the ring of polynomials modulo P(u) where P(u)e G[u] is a polynomial satisfying degP(w)>deg Q(«). In this section we will analyze the complexity of computing the coefficients of the product of two polynomials modulo a third one. In the next section we will apply these results to derive algorithms for cyclic convolutions and Fourier transforms. It may be useful to consider a simple example before we analyze the general case. Via. An illustrative example. In this subsection we will derive an algorithm for computing the cyclic convolution of two 3-dimensional vectors. As we shall see in subsequent sections, the method we will use is in a sense the most general one. Consider the following problem: We wish to compute the quantities z0, z\, z2 where That is Zi = Z/=o Xt-iVi where / — / is computed modulo 3. (It is the stepping of the indices modulo 3 which motivates calling the convolution cyclic.) It is well known that we could have defined z0, z\, and z2 by the relation (z0 + ziu+z2u2) = (xo + xiu+x2u2)(y0 + yiu + y2u2) mod (w 3 -l). We will at first take G to be the field of rational numbers. (We still retain the convention that B - G U {*/} U {y/} and therefore suppress the mention of B.) The polynomial u3 — 1 factors into two irreducible factors u3-l = (u — V)(u2 + u +1). By the Chinese remainder theorem we can break the problem into two parts. The first part is that of computing (z0 + ZiU + z2u2) mod (M — 1) = (xo + xi + x2)(yo + yi + y2) = mi, and the second part is that of computing (ZQ + 2u2) mod («2 + w + l) = ((x0-x2) + (xi-x2)u)((yo-y2) + (yi-y2)M) mod (« 2 + « + l). The computation of the first part is obvious. To compute the second part we first compute the coefficients of ((x0 — x2) + (xi—x2)u)((yo — y2) + (yi — y2)u) and then reduce modulo u2 + u +1. Let 0 + tiu + t2u = ((x0 - x2) + Ui - x2)u)((y0 ~ y2) + (y i - y2)u) = ((x0 - x2) (xi — x2)u)((yo — y2) + (yi — y2)u) mod (u(u + l)(u —2)). As explained in § IVa we obtain the algorithm t0 = m2, t\ = m3 + m2 + m4, t2 = m4 where m2 = 0-x2)(yo-y2\ m3 = (x0-xi)(yi-y0), and m4 = (xi-x2)(yi-y2). 57 We want to 58 CHAPTER VI compute the coefficients of (to + tiu + t2u2} mod (u2 + u + 1) = (t0 —t2) + (ti — t^u = ciQ + a\u. We thus obtain ao = to — t2 = m 2 —m 4 , a\ = t\ —12 = m2 + mj. We can now put the pieces together and obtain There are two questions which this algorithm raises: Is At(z 0 , Zi, za) = 4? Are there other algorithms for computing z0, Zi, and z3 which have four m/d steps? We will see in the next subsection that the answer to the first question is that fji (z0, Zi, 22) is indeed 4. As to the second question, we could have easily modified the part of the algorithm which computes t0, t\, and t2 by choosing any of the algorithms discussed in § IVa. Yet there are algorithms which cannot be obtained this way. We will now derive one such algorithm. (In the next subsection we will describe all possible algorithms.) We start with the tensor of this system of bilinear forms and its decomposition is given by the algorithm. The tensor is: Equating coefficients of the jc/'s we obtain where 0 + 2zi - z2), and m4 = ?(yi - y2)(-2z0 + zx + z2). Replacing y0 by x0, y2 by jci, yi by x2, and then ZQ by yo, z\ by yi, and z2 by y2 we obtain a new algorithm for our original problem. This algorithm cannot be obtained by making another choice for the algorithm computing to, t\, and ?2. In order to better understand this new algorithm we will recast the problem. Let A be the matrix: and therefore The matrix A was chosen so that Ax (where x = (jc0, *i, x2}T) has the coefficient of (XQ + XIU +JC2« 2 ) mod (u — 1) as its first component, and the coefficients of (x0 + Xi+x2u2) mod (u2 + u + 1) as its last two components. Designate by X the matrix PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL 59 by y the vector y = (y0, yi, y 2 ) T and by z the vector z = (z0, Zi, z 2 ) T we see that where and y 2 = yi - y2. That means that computing Az is the same as computing the two independent problems: JCQ • yo, and the coefficients of (x( + x 2 «)(yi + y'2u) mod (u + u +1). It should be emphasized that this transformation of X into block diagonal matrix is another way of formulating the Chinese remainder theorem. Let us examine the effect of the transformation A on the algorithm. The identity underlying the algorithm can be written as: where and Therefore: We see that this algorithm (which we obtained by taking the transpose can also be viewed as using the Chinese remainder theorem, even though it was not constructed this way. The only difference between the two algorithms is the way they compute the coefficients of ((x n + l) = (jci +y 2 «)(yi +y2u)mod(u2 + u + l) = t'0 + t'iu. In terms of x'\, x'2, yi, and y2 we can write m2 = \(x'\ ~x'2)(y( + y 2 ) , m3 = 3*i(yi-2y2), and m4 = 3*2(2y{ -y 2 ), and to =2m2 + m3 + m4, t[ = m2-m3 + 2rri4. To sum up this example, we saw that the Chinese remainder theorem can be used to break up the problem of computing the coefficients of Q(u) = R(u) • S(u)modP(u) into its constituent part (one for each irreducible polynomial which divides P(u}). We saw also that we can interpret the Chinese remainder theorem as providing us with a nonsingular transformation of the coefficients of Q(u) which transforms the problem into the disjoint problems (in the sense that all the variables are independent) of computing the coefficients of a product of polynomials modulo a power of an irreducible polynomials. We also saw two different methods for computing the coefficients of the product of polynomials modulo an irreducible polynomial. In the next subsection we will discuss these kinds of constructions in general terms. 60 CHAPTER VI VIb. Multiplication modulo irreducible polynomial. Let R(u) = Z"=o xtu' S(w) = £"=o yiu> be two polynomials with indeterminate coefficients. Let P(u)e G[w] be P(u] = un +Z/TO giu> be a monic polynomial with coefficients in G. The set of coefficients of Q(u) = R(u) • S ( u ) mod P(u) will be denoted by CF(P); if we would want to emphasize that the coefficients of P(u) are in G we will denote it by CFCP; G). In this subsection we will consider the computation of CF(P; G) where P(u) = Q(u)1, O(u] an irreducible polynomial. We will first derive the multiplicative complexity of CF(Q!; G), and then describe all the bilinear algorithms for computing CF(P, G) (P(w) irreducible) which use the minimum number of m/d steps. THEOREM 1. Let Q(u] be an irreducible polynomial and let P(u] = Q(u) . If kdklclnklnclnsdklnlokljlfjeijnthend;iudsej We will not prove the theorem here (for a proof see [12]), but devote this subsection to classifying all minimal bilinear algorithms which compute CF(P) where P is an irreducible polynomial. Let R(U) = ^"~Q xtu\ S(u] = Z"=o y« M ' be two polynomials with indeterminate coefficients, and let P(u) = u" +Z"=0 &iUl be a monic polynomial with coefficients in G. In subsection Via we saw one method for generating algorithms for computing CF(P) which uses In — 1 m/d steps. By Theorem 1 these algorithms are minimal. Method 1. First compute the coefficients of R(u) • S(u) using 2n — 1 m/d steps. This can be done by any of the algorithms of Theorem 1 of § IVb. Next reduce the R(u) • S(u) modulo P(u). The second half uses no m/d steps. The algorithms of Method 1 are based on algorithms for computing the coefficients of the product of polynomials, and therefore the discussion of § IVa is applicable to these algorithms as well. In particular, if degP is of even moderate size the algorithms for computing CF(P) which are generated by Method 1 will have large coefficients. It is important, therefore, to find other ways for generating minimal algorithms. We will describe two other methods for generating algorithms. We will first illustrate each of these methods by an example, and then state the general method. Example 1. Let R(u) = x0 + xiu, S(u) = y0 + yiu, and P(u} — u2 + u + l. Using Method 1 we can obtain an algorithm for computing R (u) • S(u) which is based on the identity: and therefore we obtain the following identity: We will now modify this identity to obtain another algorithm. We start with the identity 3(1 - u)(l + u)(l - u) = 1 mod u2 + u + 1. We therefore obtain R(u) • S ( u ) mod P(u) = i(l - u)[R'(u) • S ' ( u ) mod P(u)] mod P(u\ where and £'(«) = (l-w)S(w) mod PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL 61 We can use the algorithm of Method 1 to compute R'(u) • S'(u)modP(u). Performing the substitutions we obtain R'(u] • S'(u) modP(u) = (mi - ra2) + (m3 + m\)u, where Therefore R(u) • S(u) mod P(u) = |(1 - u)[R'(u) • S'(u) mod P(u)] mod P(u) = The reader may easily verify that this algorithm is the second one used in § Via. The construction of the example is quite general. We can summarize it by: Method 2. Choose two polynomials pi(w), p 2 (w) with coefficients in G such that degpi, degp 2 <degP. (That is, pi(u), p 2 ( u ) E G [ u ] / ( P ( u ) ) . ) Choose the polynomial p 3 (w)e G\u]l(P(u)} such that Pi(«)p 2 («)p 3 (w) = 1 modP(w). The polynomial p 3 («) always exists and is unique, because P(u] is irreducible. Compute the coefficients of R'(u) = pl(u}R(u)modP(u), and S'(u) = p2(u}S(u) modP(u). The coefficients of R'(u) and S'(u) can be computed without using an m/d step. Compute the coefficients of T'(u) = R'(u)S'(u) modP(w) using an algorithm of Method 1 which takes In — 1 m/d steps. Finally compute the coefficients of T(u)=p3(u)T'(u) modP(u). This last part does not use any m/d step. Because Pi(u}p2(u}pz(u} = 1 modP(w) we have that T(u} — R(u}S(u] modP(w). Example 2. Let P(u) be P(u) = u2+l, and Q(v) = v2-2v+2. The roots of P(u) are / and — /, and those of Q(v) are 1 + / and 1 — /. If we choose G to be the field of rational numbers then the field G[u]/(P(u)} is isomorphic to the field G[v]/(O(v)). One such isomorphism is a: G[u]/(P(u))^G[v]/(Q(v})where cr(a +bu) = (a —b) + bv and o-~1(c+dv) = (c+d) + dv. We will now construct an algorithm for computing the coefficients of (XQ + o + y i « ) m o d u2 + l. Using the isomorphism a we have (x 0 + y\v) mod (v2-2v+2)). Let OT(XO + 0 + y i M ) = ( y 0 - y i ) + y i U = yo +y(v. 2If we use Method 1 for computing the coefficients of (x'o +x(v)(y'o +y(v) mod v — 2v+2 we obtain: (x'o +x(v)(y'0 +y'\v} — mi + (mi + m 2 + m3)u + m^v 2 where Wi = x'oy'o = U 0 -xi)(y 0 - yi), m 2 = (XQ -x\)(y\-yo) = (x0-2*i)(2yi - y0), m3 = x{yi=xiyi.Reducing modulo v2 — 2v + 2, we obtainT'(V)-(XO+X\V)- (yo + yi^)mod (v2-2v+2) = (rai-2ra 3 ) + (rai + m 2 + 3m 3 )y. We now apply cr^1 and obtain (x0 + x i u ) ( y 0 + yiu) mod (u2+ 1) = c r ~ l ( T ' ( v ) ) = (2mi + m2 + m 3 ) + (mi + m 2 + 3m3)«. Method 3. To compute the coefficients of R(u) • S(u) mod P(u), choose a polynomial Q(v) such that G[u]/(P(u)) is isomorphic to G[v]/(Q(v)). Let cr: G[u]/(P(u))-> G[u]/(Q(v)} be any isomorphism. Compute the coefficients of R'(v) = a(R(u)) and S'(v) = cr(S(u)). This part uses no m/d steps. Next compute the coefficients of T'(v) = R'(v) = R'(v)S'(v) mod Q(v) using an algorithm which has 2 n - l m / d steps. This algorithm may be derived by Method 1. Finally, compute the coefficients of o-~l(T'(v)). This last part does not use any m/d steps. The proof of the following theorem is given in [15]. 62 CHAPTER VI THEOREM 2. Let P be an irreducible polynomial. Any bilinear algorithm for computing CF(P) which uses 2 deg (P—l) m/d steps is derivable by one of the three methods described above. If we examine the three methods we can see that they rest ultimately on our ability to compute the coefficients of the product of two n-\ polynomials using 2n — 1 m/d steps. As an immediate consequence of this observation and of the corollary of § IVb we have: COROLLARY. Let P be an irreducible polynomial. If \G < 2 d e g ( P ) - 2 then /Z(CF(P);G)>2degP-l. Vic. Multiplication modulo a general polynomial. As we saw in § Via the computation of CF(u3— 1) is equivalent to the computation of CF(u — 1o CF(u2 + u + l). In general, if P(w) = Zjc=1 P,-(M)'-, where the P|'s are the distinct irreducible polynomials which divide P(u}, then CF(P) =CF(P(l)®CF(Plf) • • -®CF(Pl£}. As was mentioned in the end of § IIIc, we do not, in genera know whether (j,(S®S') = /u,(S) + /u(S') for arbitrary systems of bilinear forms 5 and S'. However, in the special case that the systems under consideration are CF(P)'s, the strong version of the direct sum conjecture can be proved. (See [12].) THEOREM 1. Let Pi(u) e G[u] be irreducible, i - 1, 2, • • • , k, and let Q,-(«) = Pi(u)\ If |G|^max{2degQ,}-2 then /a(CF(O1)©CF(O2)©- • -®CF(Qk)) = Ai(CF(Qi))©- • •®fji(CF(Qk))= (2^=ldeg(Qi))-k. Moreover, every minimal algorithm computes each of the CF(Q,)'s separately, that is, every minimal algorithm is a direct sum algorithm. The proof of the first half of Theorem 1, which will not be given here, proceeds by first showing that for every G^(CF(Ql}@- • -©CF(Q fc )) = (2 I*=1 deg Qi)-k, and then uses the fact that /i(CF(Oi)+ ©• • •®CF(Qk)}^ £; = i /Lt(CF(Oi)) to show that the inequality is indeed an equality. The second half of the theorem proceeds by showing that every algorithm which uses (2 £i=i deg Qi) — k m/d steps must be a direct sum algorithm. We therefore obtain the following corollary: COROLLARY 1. // |G|<2 max {deg (?,-}-2 then /x(CF«?i)©- • -©CF(O k ))> (2 If.! deg (?,-)-*. As was mentioned earlier, p(u) = l\i = 1Pi(u)li where CF(P} is equivalent to CF(P1J)®- • -®CF(Pl£). This equivalence is an immediate consequence of the Chinese remainder theorem, and was the basis of the algorithm derived in § Via. This equivalence enables us to use Theorem 1 to analyze the multiplicative complexity of CF(P) for any polynomial P(u) e G[u]. COROLLARY 2. LetP(u) = FI,=i Pf(u)li, where the P/'s are the distinct irreducible polynomials which divide P(u). If |G|^2 max {deg (P|;)} ~ 2 then n(CF(P}} = 2 deg P — k. Moreover, every minimal polynomial can be derived by the use of the Chinese remainder theorem. It should be emphasized that Theorem 1 of this subsection as well as Theorem 1 of § VIb were proved under the assumption that CF(P] is a system of bilinear forms. That is, they pertain to the computation of the coefficients of T ( u ) = R(u)-S_(u)modP(u), where R(u) = U"=o *«•"'', S(«) = ir=d y.-«'» **(") = w" +Z"=o a'u' e G[M], and the jc,'s as well as the y/'s are indeterminates. In some PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL 63 of the applications of the next section we will have to consider the problem of computing the coefficients of (Z"=o /i w ')(Z"=o y< w ') m°d P(")> where the/ ; 's are in some field F which is an algebraic extension of G. (The y,'s are still indeterminates.) We will denote the set of coefficients of R(u] by f, and the set of coefficients of T(u) by CF(P, f). We clearly have /x(CF(P, f)) ^pi(CF(P)), as we can take any algorithm for computing CF(P) and substitute the appropriate elements of F for the jt,-'s. It is also clear that for certain choices of f we have M,(CF(P, f)) < /it (CFCP)). (For example, if all the coefficients o f R ( u ) are in G then /*(CF(P,f)) = 0.) If Pi(u) = un(i} + Z"L'olaiiuieG[uli = l,2,'--,k, are a set of polynomials, we will consider the computation of the set of all the coefficients of the polynomials TI(M), T2(u), • • • , T k ( u ) whereTj(u}=* (Z"=o /iX)(Z"=o ] yi/w') m°d P/(w). As before, we assume that the /,,'s are in F, and the y,/'s are all distinct indeterminates. By a slight abuse of notation we will denote the set of coefficients of all the 7}(«)'s by CF(Pi, fi)©CF(P 2 , t2)@- •-® CF(Pk, f fc ). The rest of this subsection is devoted to deriving a lower bound on the multiplicative complexity of this problem. Let R(u) = Z^o ffu' and R'(u) = 1^0 f-u satisfy R (u)-R'(u) £ G[u], that is, /,--/,'eG, / = 0, 1,- • - , « - ! . If P ( w ) e G [ M ] is of degree n, and 5(w) = I^dy.M 1 , then R(u)-S(u)modP(u) = R'(u}S(u) modP(u) + (R(u)-R'(u}) modP(u). We can therefore conclude that /u(CF(P,f)) = /Li(CF(P,f')). This observation motivates us to consider f as a vector space over G, and to denote by p: F-+F/Gthe natural homomorphism. Using this notation, the observation just made means that ya(CF(P, f)) depends on P and on the set p(f) (and not on f). We are now ready to state the main result. (See [17.) THEOREM 2. Let PI(M),- • • , P k (w) e G[u] be irreducible polynomials. Let d be rf=dimLG(p(fi)Up(f2)U-• • Up(ffc)). If for every], dim LG(p(f,)) ^1 (that is, no f/ consists of only elements of G) then p(CF(Pi,fi)©- • •©CF(P f c ,f k ))^rf + iJ^degP, Of course, as was remarked earlier, (ji(CF(Pi, fi) + - • - + CF(Pk, f fc )) = fjL(CF(Pi)@- • • ©CF(Pfc)) = 2 I;k=1 deg Py - k. Therefore, when d is as large as i can be, that is when d=^j=l deg P7- we have: COROLLARY 3. // dim LG (p (f x) U • • • U p (f fc)) = Iyk=: deg Py, r^en At(CF(P 1 ,f 1 )©---©CF(P k ,f f c )) = 2l* : =1 degP / --fc. This corollary is a generalization of the first part of Theorem 1 of this subsection. If all the coefficients of all the /? 7 (w)'s are distinct indeterminates, and if we take for F the field G extended by all these indeterminates, then CFCP!,!!)©- • •©CF(Pfc,!k) = CF(Pi)©- • -©CFCPk), and Corollary 3 is the statement of the first half of Theorem 1 . VId. Multiplication modulo several polynomials. Up to now we considered the problem of computing the coefficients of the product of two polynomials of one variable modulo a polynomial. In some of the applications of the next section we will need to know the multiplicative complexity of the computation of the coefficients of the product of two polynomials of several variables modulo several polynomials. As we will see, we already have all the necessary tools for 64 CHAPTER VI determining the multiplicative complexity of this problem. More specifically, we will see that computing the coefficients of the product of two polynomials modulo several polynomials is equivalent to the direct sum of several CF(P)'s. We can therefore determine the multiplicative complexity by using Theorem 1 of § Vic. First, we will describe this equivalence by an example, and then give the general formulation. Example 1. Let R(u, v) = Xo + xiu+x2v+X3itv, and S(u, v) = yo + yiw + yau + yawu. We wish to compute the coefficients of R(u, v}S(u, v] mod ((u2 + u + 1), (v2 + v + l)). In other words, we view R(u, v} and S(u, v) as members of the ring G(x, y)[u, v ] / ( u 2 + u + l,v + u +1), and wish to compute the coefficients of this product as members of this ring. Direct calculations show that R(u, v)S(u, v) mod ((u2 + u + l), (v2 + v + 1)) = The system of coefficients, z0, z\, z2, 23, can be written as: If we partition the matrix into 2 x 2 blocks we see that this system can be written as That is, this system of bilinear forms in the tensor product It is tempting to conclude that (L(zQ, z/, z 2 , z3) = f l ( C F ( u 2 + u -1-1)). p,(CF(u2 + u + 1)) = 3 • 3 = 9. (We assume that G is the field of rational numbers. But as was mentioned in § IHc, there are cases for which /Z(S®S") + !)• For this G we have n(CF(u2 + u + 1); G) = 2, and following the process described in § VIb we obtain the following PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL 65 algorithm: where and We will use these two algorithms to construct the desired algorithm for the original z0, Zi, z 2 , and z3. To do that we define Z0 = zo + zi, Zi = z 2 + 0z3, Yo = yo + 2 + 4>y3» X0 = JCo + <^JCi, and X2 = jc2 + jc3. In terms of the capital variables we can write our system of four bilinear forms as: (The reader can easily verify that what was done is just a rewrite of the "partitioning" which was done earlier in the example. This is the reason that we used the same symbols.) If we now use the second algorithm, which has two m/d steps, we obtain where: and We can now use the first algorithm to compute MI = Mi,i + Mlj2 and M2 = M2,i +<£M2,3. We thus obtain: where Substituting these values for M\ and M2 we get: 66 CHAPTER VI Equating coefficients of 4> we get: This is the desired algorithm. The method used to derive the algorithm also enables us to prove that it is minimal, namely, that ^t(z0, Zi, 22, z3; O) = /Z(z0, ^i, 22, 23; O) = 6, where Q is the field of rational numbers. To see that we again start with the field Q(0), and the system CF(u2 + u + 1). In this field w 2 + w + l = («-0)(« + l + 0), and if we want to use the Chinese remainder theorem we take every polynomial a + ub into the pair of polynomials (a +ub] mod (u — 0) = a +4>b and (a + ub) mod (u + l + 0) = ( a — 6) — 0&. Let A be the matrix which describes this transformation of the coefficients of the polynomials, that is: We can now write the system CF(u2 + u +1) as We now return to the problem at hand. We wish to compute the system To construct the new matrix A we substitute the companion matrix of w 2 + u +1 for an occurrence of 0 in A (and the 2 x 2 identity matrix for an occurrence of 1). We thus get PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL Using this A we have z' - Az = (AX(A) l 67 )Ay = X'y' where That means that the system z0, Zi, z2, z3 is equivalent to CF(u2 + u + \}@ CF(u + M + 1). Theorem 1 of § Vic guarantees that /u,(z 0 , Zi, z 2 , z 3 ; O) = /Z(z0, zi, z2, z3; O) = 6. The procedure used in the example is quite general, and we will outline the general results. We consider the problem of computing the coefficients of where P(u)e G[u] is a monic polynomial of degree n, P'(v)e G[v] is a monic polynomial of degree m. By extension of notation we will denote this system by CF(P,P'}. Another way of describing this system is by CF(P,P') = CF(P) ® CF(P'). We will first consider the case that P(u} and P'(v) are irreducible polynomials (over G). In the following we will assume that G is a perfect field which has enough elements. In particular, if G is of characteristic 0 it satisfies our conditions. Let G' be the field G' = G[v]/(P'(v)), that is, G' is obtained by augmenting a root a of P'(v) to G. The polynomial P(u) is not necessarily irreducible in G', and we will assume that in G'P(u} = Y[i=lPi(u) where each polynomial P/(M) is irreducible (over G'). (Note that the assumption that G is a perfect field assures us that the P/(w)'s are distinct.) We now construct, for each /', the field G] = G'[«]/CP/(«)>, that is we augment G' by a root 0, of P/(M). The field G- is an algebraic extension of G, and therefore is a simple extension of G. That is, for every; = 1,2, • • • , k there exists an element y; such that G,' = G(y/). Let Qj(u) be the minimal polynomial of y7. We are now ready to state the general result. (See [16].) THEOREM 1. LetP(u] and P'(u) be two monic irreducible polynomials over the field G (which is of characteristic 0). Then CF(P,P') is equivalent to CF(Ql)@ • • -®CF(Qk), where the Q/'s and k are as described above. Therefore /Z(CF(P,P')) = M(CP(P,P')) = I/ k .i/*(CF(0 / )) = 2degP-degP'-A:. The equivalence of CF(P, P') to the direct sum of the CF(O;)'s means that Theorem 1 enables us to determine the multiplicative complexity of computing the coefficients of the two polynomials of several variables modulo several 68 CHAPTER VI (irreducible) polynomials. We will illustrate this by analyzing the multiplicative complexity of CF(P, P', P"), i.e., of computing the coefficients of where P(u) e G[w] is irreducible of degree m, and P"(w) e G[w] is irreducible of degree /. The system CF(P, P', P") can be written as CF(P, P')®CF(P"). By Theorem 1 we have that CF(P,P',P") = (CF(Qi)@CF(Q2)®- • -©CF(Q fc ))® CF(P") = CF(Q1)®CF(P")®(CF(Q2)®CF(P"))®---®(CF(Qk)®CF(P")) = CF(Qi,P")©CF(Q 2 , P")©- • -®CF(Qk,P"). Again, we can use Theorem 1 to replace each CF(Q/, P") by an equivalent system which is the direct sum of CF«?fc)'s. Altogether we see that CF(P, P', P") is equivalent to the direct sum of CF(Oi)'s. The multiplicative complexity of CF(P, P') for arbitrary P(u) and P'(v} can be determined, as in § Vic, by the use of the Chinese remainder theorem. This is summarized in the following theorem. (See [6].) THEOREM 2. Let P(u) = l^=i QM and P'(i;) = n,'=i Q'i(v), be the decomposition ofP(u) and Q(v) into powers of irreducible polynomials. Then CF(P, P') is equivalent to Z/=i Z,- = i CF(Qt, Q/) (where the summation indicates the direct sum). In the case that each Q,(«) and Qj(v] is an irreducible polynomial (rather than a power of irreducible polynomials], we can further replace each CF ((?,-, Q/) by the direct sum of CF(P)'s, as described in Theorem I of this subsection, and then evaluate the multiplicative complexity by Theorem 1 of § Vic. Example 2. To indicate the constructive nature of the theorem we will analyze CF(u3-l,v3-l). CF(u3—l, v3 — 1) is the system which we designate by z = Xy. Let A be, as in § Via, the matrix PRODUCT OF POLYNOMIALS MODULO A POLYNOMIAL 69 and let A = A ® I be the tensor product of A with the 3 x 3 identity matrix. The system z' = Az = (AX (A)~l)(Ay) = X'y' is equivalent to CF(w 3 -l, u 3 -!). Direct computation shows that X' is the matrix where That is, CF(«3-l,t>3-l) is equivalent to CF(u -1, v3-l)®CF(u2+u4-1, v3 — l). Since « -1 is a first degree polynomial, CF(u -1, y 3 -1) is the same as CF(u 3 -l). Let A = I®A be the tensor product of the 3 x 3 identity matrix / with A. The system 2," = Az' = (AX'(A)~l)(Ay')is equivalent to z' = X'y' and therefore also to CF(u3 — 1, v3 — 1). Direct calculations show that this system is CF(u — 1, v -1)© CF(u-l, v2 + v + l)®CF(u2 + u + l, v-l) + CF(u2 + u + l, v2 + v + l). As we saw in Example 1 of this subsection, CF(u2 + ii + l, v2 + v + l ) is equivalent to 2CF(w 2 + « + l). So altogether CF(w 3 -l, v3-l) is equivalent to CF(w-l)© 4CF(w2 + w + l). That means, by Theorem 1 of § Vic, that ^(CF(u3-l, u 3 -l)) = 13. We will end this subsection by illustrating how the ideas presented so far can be used to derive an approximate algorithm. Example 3. Let P(u] = u8-u7 + u5 — u4 + u3 — u + 1, which is the minimal polynomial for e2™/15. By Theorem 1 of § VIb, ^(CF(P}\ O) = 15, where O is the field of rational numbers. By Theorem 2 of § VIb every minimal algorithm for this computation has large constants. We will show a way for deriving a suboptimal algorithm which exploits the structure of the field Q(e2m/l5}. As is well known, Q(e2m/l5) is the same as Q(e2m/3, e2™/5). Because the minimal polynomial for e2m/3 is u2 + u + 1, and the minimal polynomial for e2m/5 isv4 + v3 + v2 + v + 1, we know that CF(P) is equivalent to CF(u2 + u + 1, v4 + v3 + v2 + v +1) CF(u2 + u + l)®CF(v4 + v3 + v2 + v + l}. That means that CF(u2 + u + 1, v4 + v3 + v2 + v + l] can be viewed as computing the system where z, =z,-o + 02';i» *; = *jo + $*ii> and y, = y,o + <£yii» and if satisfies 0 +(f> + 1=0. The field Q(<£) has 7 elements whose magnitude does not change when we 70 CHAPTER VI raise them to higher and higher powers, namely, 0, 1, —1, <£, —0, <£2, — (j>2. Therefore, if we take G = Q(4>) we can compute z0, Zi, z 2 , 23, and z4 using 7 m/d steps without incurring large coefficients. Each m/d step is a product of two linear polynomials modulo « 2 + w + l, i.e., can be done in 3 m/d steps over Q. Altogether we have an algorithm over Q which has 7 - 3 = 21 m/d steps. (The best the author has succeeded in doing using the heuristic methods of § IVc was obtaining an algorithm which has 24 m/d steps.) The reader is advised to compare the construction of this example with the one of Example 1 of this subsection. CHAPTER VII Cyclic Convolution and Discrete Fourier Transform In this section we will describe two applications of the results of § VI. The first application is that of computing the cyclic convolution of two vectors, and the second is the computation of the discrete Fourier transform (DFT). Vila. Cyclic convolution. There are two ways of defining the cyclic convolution of the two vectors x = (x0, x\, • • • , xn-\)T and y = (y0, yi, • • • , yn-\)T, which yield the vector z = (z0, Zi, • • • , zn-i)T. The first way is z, = / = 0, 1, • • • , n — 1, where i —j is understood to mean (/ —/') mod n. It is readily seen that this is nothing but the computation CF(u" — 1). The second way is: z, = ]T"=o *i+/y/, / = 0,1, • • • , n — 1, where / +j is understood to mean (/ +;') mod n. If we write the second way as z = Xy, the (/, ;)th entry of X is Xi+j, where (/ +/) means (/ +/) mod n. (i, j = 0,1, • • • , n — 1.) Because the indices of the jc's in X form the table of "addition modulo n," we will denote the second system of bilinear forms by CF(Zn). If we denote y n _ 7 - by i7/(/' = 0, 1, • • • , n-1, and take yn = y0) we see that n -1) is equivalent to CF(Zn). These two systems are also related in another way, which is more important for deriving algorithms: Each is a transpose of the other. To take the transpose of £"=o *<-/y; we form the trilinear form and we see that the coefficients of y, are (/ = 0, 1, • • •, n -1), that is, CF(Z n). In many applications we can compute the cyclic convolution of many vectors x with the same vector y (the vector y may, for example, be the values of the taps of a filter). That means that computations which involve only the entries of y are not to be counted. In particular we may be able to tolerate larger constants as coefficients of the y/s. Because CF(un — 1) and CF(Zn) are a transpose of each other we are able to arrange it so that the "not so nice" coefficients will be those of the y/'s. Instead of describing the procedure in general we will give an example. Example 1. We will derive an algorithm for computing z, the cyclic convolution of x and y, where x = (x0t xit x2, x3, x4)\ y = (y0, y i, ya, y3, y4)T, and z = (z0, Zi, Z2, z3, z4)r. The algorithm will be over the field of rational numbers. By the Chinese remainder theorem the algorithm will be built up from two parts, the first being that of Oto + JCi + X2 + *3 + x 4 )(yo + yi + y2 + y3 + y4) m°d («~ l) = (*o + *i + *2 + *3 + *4)(yo + yi + y2 + y3 + y4) = m0. The second part is that of 0 -x4) + (xi- x4)u + (x2 -x4)u2 + (x3 ~x4)u3)((y0- y4) + (yi - y*}u + 2 - y4)«2 + (y3 - y4)w3) mod (u4 + u3 + u2 + u + 1) = R(u) x S(u) mod (u4 + u3 M 2 + W + 1). 71 72 CHAPTER VII We will obtain the algorithm for the second part by first computing the coefficients of R(u)xS(u) and then "reducing" modulo u4 + u3 + u2 + u + l. The algorithm for computing the coefficients of R(u)xS(u) will be obtained by iterating an algorithm for the product of two linear polynomials, as explained in § IVc. We start with the algorithm (a0 + aiu)((3o + (3iu) = (ao + aiM)(j8 0 + £i«) mod (u(u + !)(« - oo)) = a0/8o + ((«o- ai)(£i -0o) + "o^o + ai)8i)w +ai/Si«. Iterating this algorithm to obtain the coefficients of R(u)xS(u) and then reducing modulo u4 + u3 + u2 + u + l we obtain that t0 + tiu + t2u2 + 3u3 = R(u)xS(u)mod(u4 + u3 + u2 + u + l ) is given by: where: By the Chinese remainder theorem we obtain that Taking the transpose by equating coefficients of the jc,-'s of the tensor and then replacing y, by xt and finally z, by y, we obtain that CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 73 where That is, we derived an algorithm having 10 m/d steps and 31 additions. Agarwal and Cooley [13] used heuristic methods to derive algorithms for computing the cyclic convolution of two n-dimensional vectors. Their algorithms are summarized in Table 1. TABLE 1 N No. Mult. No. Add. 2 3 4 5 7 9 2 4 5 10 19 22 4 11 15 31 58 71 We observe that the values of n in Table 1 are all powers of a prime number. The method used to derive these algorithms is not limited to these cases only. It can be employed also when n has more than one prime divisor. In this latter case there are other methods which enable us to derive heuristic algorithms. Let n=pxq where ( p , q ) = l. It can be easily verified that in this case CF(un — 1) = 74 CHAPTER VII CF(up — 1, vq — 1), and that this equivalence is obtained by simply renaming the variables. This is an easy consequence of the Chinese remainder theorem for integers; or equivalently, of the fact that Zn = Zp®Zq. Assume we have an algorithm for computing CF(up — 1) which has Mp multiplications and Ap additions. Assume further that we have an algorithm for computing CF(uq — 1) which has Mq multiplications and Aq additions. By combining these two algorithms we can obtain an algorithm for computing pxMq multiplications and pxAq+ApxMq additions. We can, of course, choose which factor we name p and which q. This choice is q+ApxMq^qxAp+Aqx Mp, or equivalently (Mq - q}/Aq ^ p-p}/Ap. We can use this construction to obtain an algorithm for the cyclic convolution of two 15-dimensional vectors which has 4 x 10 = 40 multiplications (and 11x5 + 4x31 = 179 additions). This algorithm is by no means the best we can achieve. Using the fact that CF(w15-l) = CF(u -l)®CF(u4 + u3 + u2 + u + l)@ CF(u2 + u + l)®CF(us-u7 + u5-u4 + u3-u + l), we can compute 4 3 CF(u-l) 2 using one m/d step, CF(u + u + u + u + l ) using nine m/d steps (and small coefficients), CF(u2 + u + l ) using three m/d steps, and CF(u8 — u7 + us — u4 + u — M + l) using 21 m/d steps (as was explained in Example 3 of § VId). Altogether we can obtain an algorithm having 34 m/d steps. Example 2. Consider the computation of CF(Z6). This system can be written as: By the Chinese remainder theorem for integers, the ring R6 of integers modulo 6 is isomorphic to R2xR3. The isomorphism is given by: 0 —(0, 0), 1 —(1, 1), 2-(0,2), 3-(1,0), 4-(0, 1), 5-(l, 2). If we arrange the z,'s and the y/s i lexicographical order, and rearrange the rows and columns of the matrix accordingly we see that CF(Z6) can also be written as: That is CF(Z6) = CF(Z2) x CP(Z3 CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 75 We could have also arranged the indices in "right to left" lexicographical order, and would have obtained that CF(Z6) can be written as: That is, CF(Z6) = CF(Z3) x CF(Z2). In § Via we derived the following algorithm for computing CF(Z3): where m\ = 3(*o + *i+*2)(yo + yi + y2), m 2 = 3(*o-*2)(yo + yi-2y 2 ), m3 = |(*i - *o)(-yo + 2yi - y 2 ), and m4 = l(*i - x2)(-2y0 + yi + y2). This algorithm uses 4 multiplications and 11 additions. We can easily derive an algorithm for computing CF(Z2), namely: where mi=5(*o + *i)(yo + yi), and m2 = 21(*o-*i)(yo-yi). If we use the relation CF(Z6) = CF(Z2) x CF(Z3) we obtain an algorithm for CF(Z6) by starting with the algorithm for CF(Z2) and replacing each addition by 3 additions and each multiplication by the computation CF(Z3). We thus obtain an algorithm which has 2 - 4 = 8 multiplications, and 4 - 3 + 2 - 11 = 34 additions. If, on the other hand, we use the relation that CF(Z6) = CF(Z3) x CF(Z2) we obtain an algorithm which has 4 • 2 multiplications and 11-2 + 4 - 4 = 38 additions. The first algorithm is, of course, the preferred one. We would have seen that the first algorithm is preferable because (M2 - 2)/A 2 = (2 - 2)/4 = 0, while (M3 3)/A3 = 4-3/ll = l/ll. The algorithm for CF(Z6) is derived by first computing 76 CHAPTER VII then computing \X'y' + \X"y" and \X'y'-\X"f. algorithm for CF(Z6): We obtain the following where Agarwal and Cooley used this method of combining algorithms to obtain algorithms for the cyclic convolution of two n -vectors. Some of their results are summarized in Table 2. TABLE 2 n No. Mult. 12 20 30 60 84 120 180 210 20 50 80 200 380 560 1,110 1,520 No. Add. 100 230 418 1,120 1,860 3,096 5,620 7,566 Vllb. DFT(/i) — n prime. Let a= (a0, a\, • • •, an-i)T be an n -dimensional vector. The discrete Fourier transform of a is the vector A = (A0, A\, • • • , A n _i) CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 77 given by where o> = e2m/n. In § lib we described one algorithm for computing the discrete Fourier transform of n -dimensional vectors (denoted by DFT(n)) where n is a composite number. In this section we will derive other algorithms. We will start at the point not covered by § lib — treating the case that n is a prime number. We can write DFT(rc) as A = Wa where W is the nxn matrix given by ( W)u = a "(0 g /, / ^ n - 1). We denote by W the (n - 1) x (n - 1) matrix obtaine by striking out the first row and first column of W, that is (W\/ = W;,/(l ^i,j^n— 1). Since con = 1 we have (TV),-,/ = a)k where k = ij mod n. When n is a prime number, the set of positive integers smaller than n, with the operation of multiplication modulo n form a group. We will denote this group by Mn. As is well known Mn is isomorphic to Zn-\ — the cyclic group of n — 1 elements. Let l:Zn-i-*Mn be the isomorphism, then there exists an (n — l ) x ( n — 1) permutation matrix n with (11),-.,- = 1 if ; = /(/) (0 ^ / ^ n - 2, 1 ^/ ^ n - 1) and (II), otherwise, such that HA' = (II WTT^IIa'. (We use A^ to denote (A1? - •_ • , A n _!) T and_a' to denote (ai, a 2 , • • • , fl«-i)T). The matrix UWIT1 satisfies (UWlT\k = (nwrr1),-..*. whenever i + k =i' + k' mod (n -1), O^z, /', fc', fc'^n-1. Summing it up we see that computing Wa.' is equivalent to computing the cyclic convolution of a vector w (with entries powers of a] and the vector (Ha'), where II is some permutation matrix. With an eye to what is coming we will not compute the vector A' but the vector A = (Ai— A0, A 2 — A0, • • • , A n _ i — A0), where A = ( W — /)a' (/ denotes the matrix all of whose elements are 1). The same arguments show that A is equivalent to the cyclic convolution of a vector w' whose entries are powers of a) minus 1. We can now use the algorithms developed for cyclic convolutions to compute DFT(n). Example 1. Let n =5. We will denote the elements of MS by mi, W2, m^, m* where m, • m/ = mfc if and only if / • / mod 5 = k. Let 0, 1, 2, 3 denote the elements of Z* —the group of addition modulo 4. Then one isomorphism between the two groups is /(O) = mi, /(I) = m 2 , /(2) = m4, and /(3) = m3. We thus have: where a) = e2m/5. That is (A\—AQ, A 2 -A 0 , A 4 -A 0 , A 3 -A 0 ) = (w -1, cu 2 -!, w 4 -!, w 3 -!)® (a 1? a 3 , a 4 , a2) (where 0 denotes the cyclic convolution). To derive the algorithm for DFT(5) we first derive an algorithm for computing the coefficients of £ / = O Z,-M' =(L/ > =o *i M ')(Z,-=o Vi"') m°d u"-\. Following the 78 CHAPTER VII procedure of § Vic, we obtain where rai = 4"Uo + *i + *2 + x 3 )(yo + yi + y2 + y3), w 2 = 4~(*o-*i+*2-* 3 )(yo-yi + y 2 -y 3 ), m3 = |(jco-x2)(yo + yi-y2-y3)» m4 = 2(*o + *i-*2-*3)(yi-y3), and m5 = (XQ - *i - x2 + * 3 )(yo - y 2 ). Substituting CD -1 for XQ, w 2 -! for jci, w 4 -! for jc2, w 3 -! for jc3, a\ for y0, a 3 for yi, a 4 for y2, and a 2 for y3, and recalling that oj+o» 4 = 2cos u, w +o> = 2cos2f, w-<w 4 = 2/ sin v, and w 2 - w 3 = 2/ sin (2-u), where v = 2m/5, we obtain where If we denote by m0 the "product" m0 = 1 • (a0 + ai + a 2 + a3 + a4) (the reason for calling m0 a product will become apparent when we discuss composite n), we obtain We have thus derived an algorithm for DFT(5) which has 17 additions, and 6 "multiplications". Of the six multiplications two are by rational numbers; namely, m0, and mi. (Recall that l + w + o » 2 + <w 3 4-o> 4 = 0, and therefore (cosu + cos (2v})/2 — 1 = —5/4.) So the algorithm A we derived (over the field G of the rational numbers) satisfies /u,(A, G) = 4. Examining Example 1, we see that some of the products involve multiplication by a real number (mo, mi, and m2), and some of them multiplication by an 3, m4, and m5). None of the products involve multiplication CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 79 by a general complex number. That means that each product uses only one multiplication (in case the a, 's are real numbers) or two multiplications (in case the a,'s are complex numbers). This phenomenon is quite general. As we saw, the computation of DFT(n), where n is a prime number can be viewed as computing the coefficients of R(u) • S(u) mod (u"~l-l), where R(u) = ^Io w^V and S(«) = Z"=o ard)u', where TT is a one-one mapping TT: {0, 1, • • • , n — 2}-» {1, 2, • • • , n — 1}, andr is a one-one mapping r: (0, 1, • • • , n -2}-> {1, 2, • • • , n — I}. (The mappings TT and r are determined by the isomorphism /: Mn -*Zn-i which is chosen.) Using this terminology, we have: THEOREM 1. For every odd prime n, the coefficients ofR(u) mod (un~l/2~1) are real numbers, and the coefficients of R(u] mod (w"~ 1 / 2 +l) are imaginary numbers. The other special feature of the algorithm we derived for DFT(5) is also general. We saw in Example 1 that mi was a multiplication by —5/4. The multiplication m\ corresponded to the multiplication modulo u — \. Since for any n £"=0 &>' =0 to =e2m/n we obtain that the multiplication corresponding to the product modulo (u — 1) will always be by a rational number. Thus for any prime number n we can derive an algorithm for computing DFT(n) which has 2(n -1) — 4>(n — 1) — 1 m/d steps, where (f>(k) is the number of divisors of k. It must be emphasized that we have not proved that /Lt(DFT(n); G) = 2n — 4>(n — 1) — 3, where n is a prime number, and G the field of rational numbers. The proof of the minimality of the algorithm for cyclic convolutions assumed that the entries of the two vectors to be convolved are indeterminates. In the application to computing DFT(n), one of the vectors has algebraic numbers as entries. Nonetheless this statement is true. A consequence of Theorem 2 of § Vic is the following theorem: THEOREM 2. Let G be the field of rational numbers, and let n be a prime number; then /Lt(DFT(n); G) = 2n—4>(n — 1) —3, where 4>(k) is the number of divisors o f k . In Table 1 we summarize the properties of several algorithms for computing DFT(n) where n is a prime number. TABLE 1 n #Mult. # Mult, by 1 #Add. 2 0 2 5 8 2 1 1 1 2 6 17 36 3 5 7 VIIc. DFT(pr) — p odd prime. In our study of the discrete Fourier transform of a prime number of points A = Wa we concentrated on the structure of the matrix W, which is the matrix W without the first row and column. Another way of saying it is that W is the restriction of W of those rows and columns whose index is relatively prime to n. (We start an enumeration of the rows and columns with 0.) 80 CHAPTER VII To key to our derivation of the algorithm was the fact that the exponents of w in W formed the "multiplication table" of the group Mn, and that Mn is isomorphic to Zn-\ when n is a prime number. If we want to continue with this approach for n = pr a power of a prime, we have to consider first the submatrix W which is the restriction of W to those rows and columns whose index is relatively prime to p. (The enumeration of the rows and columns starts with 0.) The exponents of w in W form the "multiplication table" of the group Mp<- which is isomorphic to the group Zp<-\p-i). (It is this isomorphism which forces us to consider the case n — 2r separately, since M2r is not a cyclic group when r =s 3 but is isomorphic to Z2 x Z2--i.) We will refer to W as the core of DFT(pr). The part of W which is left can be partitioned into "blocks," each of which is the core of DFT(ps) for some s < r (including s = 0 in which case the core of DFT(l) is the 1x1 block whose entry isl). Having rearranged the rows and columns of each block to reflect the isomorphism of Mp* with Z(P-\)P^-^, we can use algorithms for CF(Zm) to derive an algorithm for DFT(p r ). A small example may be in order. Example 1. Let n=9. The group M9 is isomorphic to Z6. If we denote the elements of M9 by mi, m 2 , ra4, m5, m7, and m8, with the group operation m, • nij = mk if k = i • j mod 9, we obtain the isomorphism /: Z6->M9 given by: 1(0) = mi, 1(1) = m2, 1(2) = m4, 1(3) = mg, 1(4) = m7, and /(5) = ms. Of the three integers smaller than 9 which are divisible by 3, we have 3 and 6 which are not divisible by 32, and 0 which is divisible by 32. We will now write down DFT(9). In doing so we do not arrange the indices in their natural order, but rather start with 0, then go to 3 and 6, and finally to 1,2,4, 5, 7, 8. In the last group we arrange the indices so as to reflect the cyclic structure of the group. The DFT(9) can be written as: where w = e2™'9. We partitioned the matrix so as to reflect its structure. The top row (and first column) consists of only 1 's. The next two rows (and next two columns) have some 1's and then three_copies of the same 2 x 2 block. This 2x2 block is readily recognized as the W matrix of DFT(3). The last block of the last six rows and sixth column representes the cyclic convolution of (a/, a>2, a>4, a>8, a>7, CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 81 To obtain an algorithm, we can start with the algorithm for CF(Z6) derived in Example 2 of § Vila. In the example we "pushed" the worst coefficients on the y/'s. In considering the discrete Fourier transform the operations on the w"s (and not the a y 's) can be precomputed. But CF(ze) can also be viewed as the coefficients of (yo + y$u + y 4w2 + y3w3 + y2w4 + y\u5)(x0 + x\u + X2U2 + x3w3 + x4u4 + x5u5) mod (u6— 1). We can therefore use the algorithm of Example 2 for computing CF(u6— 1). (This is the reason we chose to write W as we did.) Performing the appropriate substitutions we get the following algorithm for CF(u6— 1). where Substituting the appropriate u> ' 's for the jc/'s and a* 's for the y/'s, we obtain the first part of our desired algorithm. In doing this substitution we observe that and (One easy way to see that is to observe that u>3 = e2m/3, and therefore 1 +&;3 + o>6 = 0. Consequently, o/ + w 4 + fu 7 = o>2 + w 5 + o»8 = 0.) If we denote (w 1 ,^ 2 , a/, w 8 , w 7 , a> 5 )® («i, a5, a7, a8, ^4, «a)by (t0, t\, t2, h, t4, t*) we have: 82 CHAPTER VII where and v = 27T/9. The next part of the derivation of the algorithm, treats the last six columns of the second and third rows. That is, a cyclic convolution of two 2-dimensional vectors. We denote (w 3 , w 6 )® (ai + a-j + a2, fl5 + a8 + a 4 ) by (t6,t7). Again, we use the method of § VIb and obtain: where We observe, again, that m? is multiplication by — \ (since l + o>3 + £u6 = 0), and therefore the sum (ai + a7 + a2 + a5 + a8 + a4) which is needed for the first row, is —2mj, i.e., it can be obtained using one addition. Turning our attention to the first three columns, we again have to compute the convolution of the two 2-dimensional vectors: (t8, tg) = (o>3, cu6) ® (a3, a6). For t8 and tg we have: t8 = m9 + mi 0 , tg = mg — mw, where m9 = cos (3u(a 3 + a6)) = — 2(^3 +ae), and mw = i sin (3v(a3 — a 6 ))- Finally we define m0 = 1 • a0. Putting it all together we get: CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 83 We have thus derived an algorithm for computing DFT(9) which has 11 multiplications (including one by 1) and 44 additions. As was observed before, each of the products is either a multiplication by a real number (ra0, mi, m 2 , m3, m7, m9) or by an imaginary number (m4, m5, me, mg, mio). The example brought out two features of the algorithm for DFT(9). The first feature was that the substitution of the &/'s for the */'s in the algorithm for CF(u6 — 1) caused two of the products to vanish. The second feature was that each multiplication was the product of a real number, or an imaginary number, by a linear combination of the a,'s. Both of these features are general and occur whenever the procedure of Example 2 is followed in deriving the algorithm for DFT(p r )- (p is an odd prime, and r =£ 2.) We will denote_the core of the DFT(pr) by CFT(pr). That is CFT(pr] is the computation of A, =X je yw' ; a ; , / € / where I = J = {k\(k,p) = 1}. As was illustrated in the example CFT(pr)) can be viewed as computing R(u) • S(u) mod (ub -1), where b = (p - \}p"~\ R(u) = lf~o w"('V, and S(u) = X,=o <2 T (i)«', where TT: {0, 1, • • • , b -1}->{k\(k, p) = 1} and r: {0,1, • • • , b — 1}-* {k | (k, p} = 1} depend on the isomorphism between M (to is the primitive (pr)-th root of unity.) The following two theorems show that the features of the algorithm for DFT(9) hold in general for every pr. (See [17].) THEOREM 1. Let p be an odd prime, and r ^2 an integer. Let b = (p-l)pr~l R(u) be as above. Then the coefficients of R(u) mod (ub/2 — 1) are real numbers, and the coefficients of R(u) mod (ub/2+1) are imaginary numbers. THEOREM 2. Let Q be the field of rational numbers, p, r, and R(u) as in Theorem 1, and let c be c = (p —1)//~2. Then: 1. # ( « ) m o d ( M c - l ) = 0. 2. The coefficients ofR(u) mod ((ub - l)/(uc-1)) are linearly independent (as elements of Q( 2 are 2r x 2r circular matrices. As was done in § VIIc, we will first consider an example and then state the general results. Example 1. Consider the computation of CFT(16), that is A, = £,&/'«/ where / and / range over the odd integers less than 16, and 01 = e2m/l6. The group M\6 is isomorphic to Z 2 xZ 4 . One such isomorphism /: Z 2 xZ 4 -»Mi 6 is given by: /(O, 0) = 1, /(O,1) = 5,1(0, 2) = 9, /(O, 3) = 13, /(I, 0) = 15,1(1,1) = 11,1(1, 2) - 7, 84 CHAPTER VII 1(1, 3) = 3. Using this isomorphism we can write CFT(16) as Using the algorithm for CF(u2 — 1) we replace CFT(16) by an equivalent system: where v =277/16. That is, we got the direct sum of two instances of computing the coefficients of the product of two polynomials modulo u4 — 1. One algorithm for computing CF(u4— 1) is given by: where CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 85 Substituting the appropriate cos (kv) for the jc,'s and the appropriate combination of the a,'s for the y/s, we see that both mi and ra2 vanish. (Recall that cos v = —cos (9v) and cos (5v) — —cos (13v}.} Therefore, the computation of the first set of coefficients can be done using only three products, namely: When we substitute for the second set of coefficients, both m\ and m 2 vanish again. So the second set of coefficients can also be computed using the three products: Using these six products we obtain the following algorithm for computing CFT(16): The example shows that deriving the algorithm for CFT(2r) we can also expect some of the product to vanish, and that the remaining ones are either products of real numbers by linear combinations of the a, 's or products of imaginary numbers by linear combinations of the a,'s. This is indeed the case. Let R(u, u) = L,--o w u +(w £i=o w u )y> and *(u> v) = Li=o a (Z"=o a2<4"-1>5("-°)f, where n = 2r~2, and the operations on the indices are understood to be modulo 2r. In terms of these R(u, v) and S(u, v) we can now state the following theorem: THEOREM 1. The elements of CFT(2r) are the same as the coefficients of R(u, v)S(u, v)mod (un — 1, v2 — 1). Moreover: 1. The coefficients of R(u, v) mod (v — 1) are real. 2. The coefficients of R(u, v) mod (v +1) are imaginary. For r ^ 3 we have: 3. R(u, t>)mod(w" / 2 -l) = 0. 4. The coefficients ofR (u, v) mod (u n/2 +1) are linearly independent as elements of Q(w)/Q, where Q is the field of rational numbers. As an immediate consequence of the last part of Theorem 1 and Theorem 2 of § Vic we obtain: THEOREM 2. Let Q denote the field of rational numbers; then (j,(CFT(2r); Q) = 2r"1-2. Just as for DFT(pr), we can write DFT(2r) as Wa where the matrix W can be partitioned in blocks, each of which represents CFT(2S) for some s^r. The next example illustrates this partitioning for DFT(8). 86 CHAPTER VII Example 2. Let n = 8. We will derive an algorithm for computing DFT(8). As was mentioned before, M8 is isomorphic to Z2 x Z2. If we denote the elements of M8 by mi, m3, m5, and ra7, and those of Z2 x Z2 by (0, 0), (0, 1), (1, 0), (1, 1) we have the isomorphism /:Z 2 xZ 2 -*M 8 given by: /(0,0) = mi, /(O, I) = m5, 1(1, 0) = ra7, and 1(1,1) = m3. We write DFT(8) so as to reflect this isomorphism, as well as the power of 2 which divides the index. Doing that we get: where 10 = e2m/8. We partitioned the matrix so as to highlight its structure. The special feature of DFT(2r) is that w s = -l where s = 2r~\ In our case The only part of the computation which is not straightforward is that of computing (ti, t2} = (col, co7}® (a\ — as, a7 — a3). This cyclic convolution of two 2-dimensional vectors can be accomplished using two multiplications: t\ = m6 + m7, r2 = m 6 — m 7 , where m6 = 2(w 1 + w 7 )(ai — fl3 — a$~ a 7 ) = cos (v(a\ — a3 — a 5 + a 7 )) and m7 = / sin (v(a\ + <23 — a 5 — a7)) where v = 2ir/8. Using the fact that co2 = i we have the following identities for DFT(8): CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 87 where and m6, m-j as defined before. We thus have an algorithm for computing DFT(8) which has 8 multiplications and 26 additions. Of these 8 products, 6 are multiplications by 1 (or /), and all of them are either multiplications by a real number (m0, mi, m2, ra4, m^) or imaginary number (ra3, ms, mj). In Table 1 we summarize some of the properties of algorithms for computing DFT(/t) where n is a power of a prime number. TABLE 1 n #Mult. # Mult, by 1 #Add. 4 8 9 16 0 2 10 10 4 6 1 8 8 26 44 74 Vile. Multidimension DFT. We will now consider the computation of DFT(rc) when n has more than one prime divisor. Let n = n\- n-i where (ni, n2) — 1. The Chinese remainder theorem, applied to the ring of integers states that the mapping a. which takes every integer /, 0 ^ / < r c , into the pair (/i, /2), where i'i = / mod n\ and /2 = * mod «2, is an isomorphism between the ring Rn of integers modulo n and RnixRn2. The inverse mapping a~l is given by a" (/i, /2) = aii + bi2 mod «, where a is a multiple of rc2 satisfying a = 1 mod MI, and 6 is a multiple of «i satisfying b = 1 mod «2. There exists another isomorphism between the additive group Zn and Zni x Zn2. For every /, 0 ^ / < n , we define /8(/) = [/i,/ 2 ] where /i and /2 are the unique integers satisfying /in 2 + / 2 «i =/ mod n, 0 ^ / i < « i , 0 ^ / 2 < n 2 . Let «(/) = 2) and /3(/) = [/i, /2]; then / • / mod n - (ai\ + bi2}(j\n2 + i2n\} mod n = (aniiij2 + bn2i2ji) mod n +(ii]\an2) mod n +(i2]2bn\) mod n. But a is a multiple of n2 and & a multiple of n\ so a/ti = bn2 = 0 mod n. Also, a = l m o d n i and b = 1 mod n 2 , and therefore (i\j\an2} mod n = (ii]\n2) mod n, (i2j2bn\) mod n = (*2/2«i) mod n. That is, / • / mod n = ((/i/i)n 2 + (/2/2)ni) mod n. The previous paragraph shows that 88 CHAPTER VII where Wi = w"2 = e2™7"1 and \v2 = wn< = e2m/"2. Therefore if we arrange the A,'s lexicographically according to A(, lii2 ) and arrange the a/s lexicographically according to [y'i, 72] we obtain A = Wa where the matrix W is the tensor product W= Wi x W2 of the DFT(rti) matrix by the DFT(n2) matrix. In other words, if (n\, «2) = 1 thenDFT(/ti • 7x2) is equivalent to the two-dimensional DFT of « i X n 2 points. We will therefore consider the computation of multidimensional DFT, knowing that we also consider, at the same time, the one-dimensional DFT(rc) where n has more than one prime divisor. We will denote the multidimensional I^Jo wi"'1^2 • • • wiVfe-j, by DFTCm, «2, • • • , nr). Example 1. Consider the DFT(15). The mapping a is given by: The mapping /3 is given by: We denote by A the (column) vector and by a the (column) vector The DFT(15) can be written as A - Wa where W is the 15 x 15 matrix where <£ = e 2 ™ 73 , and Ws is the matrix of DFT(5). That is, W = W3 x W5. So, we can write DFT(15) as CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORM 89 where Following the procedure of § Vllb we obtain an algorithm for computing DFT(3), namely: where ra0= 1 • (ao + tfi + ^a), mi = (cos (27T/3) — I)(ai + a2), and m2 = / sin (2TT/3(ai —#2))- We therefore get an algorithm for computing DFT(15), namely: 0= W5(a0 + ai + a2), Ml = (cos(2Tr/3)-l)W5(dl + d2), M2 = i sin (2Tr/3)W5(di-d2). That is, we can compute DFT(15) by computing 3 times DFT(5) and, in addition, using 6 - 5 = 30 additions. It should be clear now why we kept count of the multiplications by 1. The product m 0 = 1 • (a0, +a\, +a2) of DFT(3) is replaced by W5(do + di + d2). The multiplications by 1 of DFT(5) are no longer multiplications by 1 when we compute M2 = i sin (2ir/3) W5(di - d2}. We can use the algorithm of Example 1 of § Vllb to compute M0, MI, M2. Each of these computations uses 6 multiplications and 17 additions. Altogether we have thus derived an algorithm for DFT(15) which has 3 - 6 = 18 multiplications (including one multiplication by 1) and 3 - 1 7 + 30 = 81 additions. We could have arranged DFT(15) so as to be DFT(5, 3) rather than DFT(3, 5). The algorithm we would have obtained would have 6 - 3 = 18 multiplications and 1 7 - 3 + 6 - 6 = 87 additions. This is the same phenomenon we saw in § Vila. Whether we arrange W as W3 x W5 or Ws x W3 does not affect the number of multiplications, but may influence the number of additions. The construction of the algorithm for DFT(15) can be used to construct an algorithm for any DFT(ni • n2) when (n\, n2) = 1, from algorithms for DFT(ni) and DFT(n2). In Table 1 below we summarize the number of multiplications and number of additions for computing DFT(n) for several composite n 's. We assume that the a,'s are complex numbers, and the table shows the number of multiplications and the number of additions of real numbers. The algorithms which are summarized in the table were built up, as in the example, from the algorithms summarized in Table 1 of § Vllb and Table 1 of § Vlld. For the sake of 90 CHAPTER VII comparison with FFT we can express the number of multiplications as UM • n • Iog2 n, and the number of additions as aA • n • Iog2 n. The table includes the values of aM and aA as well. TABLE 1 n 30 48 60 84 120 168 210 280 360 520 504 720 840 1,008 1,260 2,520 #Mult. 72 108 144 216 288 432 648 864 1,056 1,926 1,584 2,376 2,992 3,564 4,752 9,504 Add. 384 636 888 1,536 2,076 3,492 5,256 7,148 8,532 11,352 14,540 21,312 24,804 34,668 46,664 99,628 aM aA .49 .40 .41 .40 .35 .35 .40 .38 .3 .35 .35 .35 .37 .35 .37 .33 2.61 2.37 2.51 2.85 2.50 2.81 3.24 3.14 2.79 3.10 3.21 3.12 3.04 3.45 3.60 3.50 The algorithms which are constructed as in Example 1, can be programmed in a modular fashion. In the example, we have one subroutine which computes DFT(5) and call it three times. However, it should be emphasized that these algorithms are not minimal. We can use the algorithms developed in § VId to derive better algorithms for multidimensional DFT. The next example illustrates this point. Example 2. We wish to compute DFT(5, 5). Using the construction of Example 1 of this subsection we obtain an algorithm which has 6 • 6 = 36 multiplications (one of which is by 1) and 17-5 + 6 - 1 7 = 187 additions. We will now construct a different algorithm which has only 33 multiplications (including a multiplication byl). The discrete Fourier transform of 5 points can be written as where w = e2™/5. (The DFT(5) was analyzed in Example 1 of § Vllb.) As was mentioned earlier, the Chinese remainder theorem can be used to "block diagonalize" the matrix. Because the core of DFT(5) is the coefficients Of ( ( w - l ) + ( w 2 - l ) M + ( w 4 - l ) M 2 + ( w 3 - l ) M 3 ) ( f l i + fl3M+fl4« 2 + « 2 M 3 ) m o d CYCLIC CONVOLUTION AND DISCRETE FOURIER TRANSFORMFORM 91 (w 4 -1), and (w 4 -1) = (w - !)(« + I)(w 2 +1), we can write DFT(5) as where (where t> = 2W5). If we write this system as A = Wa, we see that DFT(5, 5) is equivalent to the system B = ( W x W)b. The matrix W x W has nine 1x1 blocks, six 2 x 2 blocks each corresponding to CF(u2 + 1), and one 4 x 4 block corresponding to Cf(u2 + 1, v2 + l}. That means that DFT(5, 5) can be computed using 9 - 1 + 6 - 3 + 6 = 33 multiplications. (The resulting algorithm has 172 additions.) The power of this construction increases as we increase the dimension. For example, DFT(5, 5, 5) can be computed using 174 multiplications (and 1,327 additions), while the construction of Example 1 of this subsection yields an algorithm with 216 multiplications (and 1,547 additions). The advantage of the construction of Example 2 comes from the fact that we can construct an algorithm for CF(u2+ 1, v2 + 1) having only 6 m/d steps rather than 9. Any time that the algorithm for DFT(«i) utilizes an algorithm for CF(P), and an algorithm for DFT(«2) utilizes an algorithm for CF(P'), such that P and P' have the property that n(CF(P,P')) 92 CHAPTER VII Example 3. Let Q be the field of rational numbers, and let / satisfy I2 +1 = 0. Over the field G = Q(I) we have the following algorithm for computing DFT(5). where This algorithm has 5 m/d steps and 18 additions. Assume now that we have to compute the DFT(5, 5, 5) of two sets of data a and b. We can view this computation as that of computing the DFT(5, 5, 5) of the one set of data c = a+/b. Iterating the algorithm, just as was done in Example 1 of this subsection, we obtain an algorithm having 125 m/d steps and 1,350 addition steps. But each m/d step can be simulated by 3 multiplications and 3 additions (over Q), and each addition step consists of two additions. So, over Q, we obtained an algorithm having 375 multiplications and 2,700 + 375 = 3,075 additions. This last algorithm computes two DFT(5, 5, 5), and therefore we have only 187.5 multiplications and 1,532.5 additions per DFT(5, 5, 5). We will end this section by comparing the three constructions in the case we wish to compute the three dimensional discrete Fourier transform DFT(120, 120,120). We will assume that the data is a set of complex numbers, and we will count the number of arithmetic operations on real numbers which have to be performed. The construction of Example 1 yields an algorithm having 5,971,968 = .17x(120) 3 xlog 2 (120) 3 multiplications and 97,217,280 = 2.72 x(120) 2 x Iog2 (120)3 additions. The construction of Example 2 yields an algorithm having 4,810,752 = .13x(120) 3 xlog 2 (120) 3 multiplications and 91,120,896 = 2.54 x(120) 3 x Iog2 (120)3 additions. The construction of Example 3 yields an algorithm having 5,184,000 = .14x(120) 3 xlog 2 (120) 3 multiplications and 96,940,800 = 2.71 x(120) 3 x Iog2 (120)3. Bibliography We will not give an extensive bibliography. The short list which will be given should be used as a starting point. The reader interested in further examination of the material can use the references given in the various papers. [1] A. BORODIN AND I. MUNRO, Computational Complexity of Algebraic and Numeric Problems, American Elsevier, New York, 1975. [2] A. V. AHO, J. E. HOPCROFT AND J. D. ULLMAN, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA, 1976. [3] A. KARATSUBA AND Yu. OFMAN, Multiplication of multidigit numbers on automata, Soviet Physics Dokl., 7 (1963), pp. 595-596. [4] A. L. TOOM, The complexity of a scheme of functional elements realizing the multiplication of integers, Soviet Math. Dokl., 4 (1963), pp. 714-716. [5] J. W. COOLEY AND J. W. TUKEY, An algorithm for the machine calculation of complex Fourier series, Math. Comp., 19 (1965), pp. 297-301. [6] V. STRASSEN, Gaussian elimination is not optimal, Numer. Math., 13 (1969), pp. 354-356. [7] S. WINOGRAD, On the number of multiplications necessary to compute certain functions, Comm. Pure Appl. Math., 23 (1970), pp. 165-179. [8] , On Multiplication of 2 x 2 Matrices, Linear Algebra Appl., 4 (1971), pp. 381-388. [9] V. STRASSEN, Evaluation of rational functions, Complexity of Computer Computations, R. E. Miller and J. W. Thatcher, eds., Plenum Press, New York, 1972, pp. 1-10. [10] J. HOPCROFT AND J. MUSINOSKI, Duality applied to the complexity of matrix multiplication and other bilinear forms, SIAM J. Comput., 2 (1973), pp. 159-173. [11] R. W. BROCKETT AND D. DOBKIN, On the optimal evaluation of a set of bilinear forms, Proc. 5th Annual ACM Symp. on Theory of Computing, 1973, pp. 88-95. [12] S. WINOGRAD, Some bilinear forms whose multiplicative complexity depends on the field of constants, Math. Systems Theory, 10 (1977), pp. 169-180. [13] R. C. AGARWAL AND J. W. COOLEY, New algorithms for digital convolution, IEEE Trans. Acoustics, Speech and Signal Processing, 25 (1977), pp. 392-410. [14] S. WINOGRAD, On computing the discrete Fourier transform, Math. Comput., 32 (1978), pp. 175-199. [15] , On multiplication in algebraic extension fields, Theoret. Comput. Sci., 8 (1979), pp. 359-377. [16] , On multiplication of a polynomial modulo a polynomial, IBM Report RC 6791, 1977. [17] , On the multiplicative complexity of the discrete Fourier transform, Advances in Math., 32(1979), pp. 83-117. 93 (continued from inside front cover) MORTON E. GURTIN, Topics in Finite Elasticity THOMAS G. KURTZ, Approximation of Population Processes JERROLD E. MARSDEN, Lectures on Geometric Methods in Mathematical Physics BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans M. WOODROOFE, Nonlinear Renewal Theory in Sequential Analysis D. H. SATTINGER, Branching in the Presence of Symmetry R. TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis MIKLOS CSORGO, Quantile Processes with Statistical Applications J. D. BUCKMASTER and G. S. S. LUDFORD, Lectures on Mathematical Combustion R. E. TARJAN, Data Structures and Network Algorithms PAUL WALTMAN, Competition Models in Population Biology S. R. S. VARADHAN, Large Deviations and Applications KIYOSI ITO, Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces ALAN C. NEWELL, Solitons in Mathematics and Physics PRANAB KUMAR SEN, Theory and Applications of Sequential Nonparametrics LASZLO LOVASZ, An Algorithmic Theory of Numbers, Graphs and Convexity E. W. CHENEY, Multivariate Approximation Theory: Selected Topics

Arithmetic Complexity of Computations (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Arithmetic Complexity of Computations Read more Arithmetic complexity of computations Read more The Mathematics of Diffusion (Cbmsnsf Regional Conference Se) Read more Geometric Probability (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Lectures on Three-Manifold Topology (Regional conference series in mathematics) (Cbms Regional Conference Series in Mathematics) Read more Mathematics of Genetic Diversity (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Mathematics of Genetic Diversity (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Topological Quantum Computation (Cbms Regional Conference Series in Mathematics) Read more Classical Aspherical Manifolds (Cbms Regional Conference Series in Mathematics) Read more The Theory of Best Approximation and Functional Analysis (Regional Conference Series in Applied Mathematics - Vol 13) Read more The Linear Sampling Method in Inverse Electromagnetic Scattering (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Methods of Dynamic and Nonsmooth Optimization (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Reservoir Simulation: Mathematical Techniques in Oil Recovery (CBMS-NSF Regional Conference Series in Applied Mathematics, 77) Read more Taylor Approximations for Stochastic Partial Differential Equations (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Spline Models for Observational Data (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Spline Models for Observational Data (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Ten Lectures on Wavelets (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Data Structures and Network Algorithms (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Handbook for Matrix Computations (Frontiers in Applied Mathematics) Read more Lectures on the Measurement and Evaluation of the Performance of Computing Systems (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces (CBMS-NSF Regional Conference Series in Applied Mathematics) Read more Applied Fuzzy Arithmetic Read more Selfsimilar Processes (Princeton Series in Applied Mathematics) Read more Selfsimilar Processes (Princeton Series in Applied Mathematics) Read more Algebraic and analytic aspects of operator algebras (CBMS regional conference series in mathematics) Read more Solving Systems of Polynomial Equations (CBMS Regional Conference Series in Mathematics) Read more Lectures on the edge-of-the-wedge theorem (CBMS regional conference series in mathematics 6) Read more Banach Spaces of Analytic Functions and Absolutely Summing Operators (Regional Conference Series in Mathematics ; No. 30) Read more Feasible Computations and Provable Complexity Properties Read more

Recommend Documents Arithmetic Complexity of Computations (CBMS-NSF Regional Conference Series in Applied Mathematics) Arithmetic Complexity of Computations Arithmetic complexity of computations The Mathematics of Diffusion (Cbmsnsf Regional Conference Se) The Mathematics of Diffusion CB82_ni_fM.inddDownloaded 1 8/17/2011 12:15:59 PM 09 Dec 2011 to 129.174.55.245. Redistrib... Geometric Probability (CBMS-NSF Regional Conference Series in Applied Mathematics) CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ... Lectures on Three-Manifold Topology (Regional conference series in mathematics) (Cbms Regional Conference Series in Mathematics) Mathematics of Genetic Diversity (CBMS-NSF Regional Conference Series in Applied Mathematics) CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ... Mathematics of Genetic Diversity (CBMS-NSF Regional Conference Series in Applied Mathematics) CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ... Topological Quantum Computation (Cbms Regional Conference Series in Mathematics) ... Classical Aspherical Manifolds (Cbms Regional Conference Series in Mathematics)