Algorithms and Parallel VLSI Architectures

PREFACE This book aims at giving an impression of the way current research in algorithms, architectures and compilation...

Author: M. Moonen | F. Catthoor

278 downloads 2424 Views 23MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

PREFACE This book aims at giving an impression of the way current research in algorithms, architectures and compilation for parallel systems is evolving. It is focused especially on domains where embedded systems are required, either oriented to application-specific or programmable realisations. These are crucial in domains such as audio, telecom, instrumentation, speech, robotics, medical and automotive processing, image and video processing, TV, multimedia, radar and sonar. Also the domain of scientific, numerical computing is covered. The material in the book is based on the author contributions presented at the 3rd International Workshop on Algorithms and Parallel VLSI Architectures, held in Leuven, August 29-31, 1994. This workshop was partly sponsored by EURASIP and the Belgian NFWO (National Fund for Scientific Research), and organized in co-operation with the IEEE Benelux Signal Processing Chapter, the IEEE Benelux Circuits and Systems Chapter, and INRIA, France. It was a continuation of two previous workshops of the same name which were held in Pont-&-Mousson, France, June 1990 [1], and Bonas, France, June 1991 [2]. All of these workshops have been organized in the frame of the EC Basic Research Actions NANA and NANA2, Novel parallel Algorithms for New real.time Architectures, sponsored by the E S P R I T program of Directorate XIII of the European Commission. The NANA Contractors are IMEC, Leuven, Belgium (F. Catthoor), K.U. Leuven, Leuven, Belgium (J. Vandewalle), ENSL, Lyon, France (Y. Robert), TU Delft, Delft, The Netherlands (P. Dewilde and E. Deprettere), IRISA, Rennes, France (P. Quinton). The goal within these projects has been to contribute algorithms suited for parallel architecture realisation on the one hand, and on the other hand design methodologies and synthesis techniques which address the design trajectory from real behaviour down to the parallel architecture realisatlon of the system. As such, this is clearly overlapping with the scope of the workshop and the book. An overview of the main results presented in the different chapters combined with an attempt to structure all this information is available in the introductory chapter. We expect this book to be of interest in academia, both for detailed descriptions of research results as well as for the overview of the field given here, with many important but less widely known issues which must be addressed to arrive at practically relevant results. In addition, many authors have considered applications and the book is intended to reflect this fact. The real-life applications that have driven the research are described in several

vi

Preface

contributions, and the impact of their characteristics on the methodologies is assessed. We therefore believe that the book will be of interest also to senior design engineers and CAD managers in industry, who wish either to anticipate the evolution of commercially available design tools over the next few years, or to make use of the concepts in their own research and development. It has been a pleasure for us to organize the workshop and to work together with the authors to assemble this book. We feel amply rewarded with the result of this co-operation, and we want to thank all the authors here for their effort. We have spent significant effort in trying to deliver as much as possible consistent material, by careful editing. The international aspect has allowed us to group the results of many research groups with a different background and "research culture," which is felt to be particularly enriching. We would be remiss not to thank Prof. L. Thiele of Universit~t des Saarlandes, Saarbriicken, Germany, who was an additional member to the workshops organizing committee, and F. Vanpoucke who was a perfect workshop managing director and also did a great job in collecting and processing the contributions to this book. We hope that the reader will find the book useful and enjoyable, and that the results presented will contribute to the continued progress of the field of parallel algorithms, architectures and compilation.

Leuven, October 199~t, the editors

References [1] E.Deprettere, A.Van der Veen (eds.), "Algorithms and Parallel VLSI Architectures", Elsevier, Amsterdam, 1991. [2] P.Quinton and Y.Robert (eds.), "Algorithms and Parallel VLSI Architectures II", Elsevier, Amsterdam, 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

ALGORITHMS

AND PARALLEL VLSI ARCHITECTURES

F. CATTHOOtt IMEC Kapeldreef 75 800I Leuven, Belgium [email protected] M. MOONEN ESAT Katholieke Universiteit Leuven 800I Leuven, Belgium Marc. Moonen @esat.kuleu ven. ac. be

ABSTRACT. In this introductory chapter, we will summarize the main contributions of the chapters collected in this book. Moreover, the topics addressed in these chapters will be linked to the major research trends in the domain of parallel algorithms, architectures and compilation.

1

STRUCTURE

OF BOOK

The contributions to the workshop and the book can be classified in three categories: 1. Parallel Algorithms: The emphasis lies on the search for more efficient and inherently parallelisable algorithms for particular computational kernels, mainly from linear algebra. The demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedical signal processing, and so on. The work is motivated by the belief that preliminary algorithmic manipulations largely determine the success of, e.g, a dedicated hardware design, because radical algorithmic manipulations and engineering techniques are not easily captured, e.g, in automatic synthesis tools. Most of the contributions here deal with real-time signal processing applications, and in many cases, the research on these algorithms is already tightly linked to the potential parallel realisation options to be exploited in the architecture phase.

F. Catthoor and M. Moonen

2. Parallel Architectures: Starting from an already paraUelized algorithm or a group of algorithms (a target application domain), the key issue here is to derive a particular architecture which efficiently realizes the intended behaviour for a specific technology. In this book, the target technology will be CMOS electronic circuitry. In order to achieve this architecture realisation, the detailed implementation characteristics of the building blocks - like registers/memories, arithmetic components, logic gates and connection networks- have to be incorporated. The end result is an optimized netlist/layout of either primitive custom components or of programmable building blocks. The trend of the last years is to mix both styles. So more custom features are embedded in the massively parallel programmable machines, especially in the storage hierarchy and the network topologies. In addition, (much) more flexibility is built into the custom architectures, sometimes leading to highly-flexible weakly-parallel processors. The path followed to arrive at such architectures is the starting point for the formalisation into reusable compilation methodologies. 3. Parallel Compilation: Most designs in industry suffer from increasing time pressure. As a result, the methods to derive efficient architectures and implementations have to become more efficient and less error-prone. For this purpose, an increasing amount of research is spent on formalized methodologies to map specific classes of algorithms (application domain) to selected architectural templates (target style). In addition, some steps in these methodologies are becoming supported by interactive or automated design techniques (architectural synthesis or compilation). In this book, the emphasis will be on modular algorithms with much inherent parallelism to be mapped on (regular) parallel array styles. Both custom (application-specific) and programmable (general-purpose) target styles wiU be considered.

These categories correspond to the different parts of the book. An outline of the main contributions in each part is given next, along with an attempt to capture the key-features of the presented research. 2

PARALLEL ALGORITHMS

In recent years, it has become clear that for many advanced real-time signal processing and adaptive systems and control applications the required level of computing power is well beyond that available on present-day programmable signal processors. Linear algebra and matrix computations play an increasingly prominent role here, and the demand for fast matrix computations has arisen in a variety of fields, such as speech and image processing, telecommunication, radar and sonar, biomedical signal processing, and so on. Dedicated architectures then provide a means of achieving orders of magnitude improvement in performance, consistent with the requirements. However, past experience has shown that preliminary algorithmic manipulations largely determine the success of such a design. This has led to a new research activity, aimed at tailoring algorithmic design to architectural design and vice versa, or in other words deriving numerically stable algorithms which are suitable for parallel computation. At this stage, there is also interaction already with the parallel architecture designers who

Algorithms and Parallel VLSI Architectures have to evaluate the mapping possibilities onto parallel processing architectures, capable of performing the computation efficiently at the required throughput rate. In the first keynote contribution, CHAPTEK 1 (~egalia), a tutorial overview is given of so-called subspace methods, which have received increasing attention in signal processing and control in recent years. Common features are extracted for two particular applications, namely multivariable system identification and source localization. Although these application areas have different physical origins, the mathematical structure of the problems they aim to solve are laced with parallels, so that, e.g. parallel and adaptive algorithms in one area find an immediate range of applications in neighbouring areas. In particular, both problems are based on finding spanning vectors for the null space of a spectral density matrix characterizing the available data, which is usually expressed numerically in terms of extremal singular vectors of a data matrix. Algorithmic aspects of such computations are treated in subsequent chapters, namely CHAPTHR 6 (G&tze et al.] and CHAPTER 7 (~axena et al.), see below. Linear least squares minimisation is no doubt one of the most widely used techniques in digital signal processing. It finds applications in channel equalisation as well as system identification and adaptive antenna array beamforming. At the same time, it is one of the most intensively studied linear algebra techniques when it comes to parallel implementation. CHAPTHRS 2 through 5 all deal with various aspects of this. Of the many alternative algorithms that have been proposed over the years, one of the most attractive is the algorithm based on QR decomposition. To circumvent pipelining problems with this algorithm, several alternative algorithms have been developed, of which the covariance-type algorithm with inverse updating is receiving a lot of attention now. In CHAPTER 2 (Mc Whirter et al.), a formal derivation is given of two earlierly developed systolized versions of this algorithm. The derivation of these arrays is highly non-trivial due to the presence of data contra-flow in the underlying signal flow graph, which would normally prohibit pipelined processing. Algorithmic engineering techniques are applied to overcome these problems. Similar algorithmic techniques are used in CHAPTHR 3 (Brown et al.), which is focused on covariance-type algorithms for the more general Kalman Filtering problem. Here also, algorithmic engineering techniques are used to generate two systolic architectures, put forward in earlier publications, from an initial three-dimensional hierarchical signal flow graph (or dependence graph). In CHAPTEK 4 (Schier), it is shown how the inverse updates algorithm and systolic array treated in CHAPTER 2 may be equipped with a block-regularized exponential forgetting scheme. This allows to overcome numerical problems if the input data is not sufficiently informative. Finally, in CHAPTER 5 (Kadlec) the information-type RLS algorithm based on QR decomposition is reconsidered. A normalized version of this algorithm is presented which has potential for efficient fixed point implementation. The main contribution here is a global probability analysis which gives an understanding of the algorithms numerical properties and allows to formulate probability statements about the number of bits actually used in the fixed point representation. A second popular linear algebra tool is the singular value decomposition (and the related symmetric eigenvalue decomposition), which, e.g., finds applications in subspace techniques as outlined in CHAPTER 1. The next two chapters deal with the parallel implementation of

F. Catthoor and M. Moonen such orthogonal decompositions. In CHAPTER 6 (G6tze et al.), it is explained how Jacobitype methods may be speeded up through the use of so-called orthonormal p-rotations. Such CORDIC-like rotations require a minimal number of shift-add operations, and can be executed on a floating-point CORDIC architecture. Various methods for the construction of such orthonormal p-rotations of increasing complexity are presented and analysed. An alternative approach to developing parallel algorithms for the computation of eigenvalues and eigenvectors is presented in CHAPTER 7 (Sazena et al.). It is based on isospectral flows, that is matrix flows in which the eigenvalues of the matrix are preserved. Very few researchers in the past have used the isospectral flow approach to implement the eigenvalue problem in VLSI, even though, as explained in this chapter, it has several advantages from the VLSI point of view, such as simplicity and scalability. CHAPTER 8 (Arioli et al.) deals with block iterative methods for solving linear systems of equations in heterogeneous computingenvironments. Three different strategies are proposed for parallel distributed implementation of the Block Conjugate Gradient method, differing in the amount of computation performed in parallel, the communication scheme, and the distribution of tasks among processors. The best performing scheme is then used to accelerate the convergence of the Block Cimmino method. Finally, CHAPTER 9 (Cardarilli et al.) deals with RNS-to-binary conversion. RNS (Residue Number System) arithmetic is based on the decomposition of a n u m b e r - represented by a large number of bits - into reduced wordlength residual numbers. It is a very useful technique to reduce carry propagation delays and hence speed up signal processing implementations. Here, a conversion method is presented which is based on a novel class of coprime moduli and which is easily extended to a large number of moduli. In this way the proposed method allows the implementation of very fast and low complexity architectures. This paper, already bridges the gap with the detailed architecture realisation, treated in the second category of contributions.

3

PARALLEL ARCHITECTURES FOR HIGH-SPEED NUMERICAL AND SIGNAL PROCESSING

Within this research topic, we have contributions on both customized and programmable architectures. For the application-specific array architectures, the main trend is towards more flexibility. This is visible for instance in the high degree of scalability and the different modes/options offered by the different architectures. We can make a further subdivision between the more "conventional" regular arrays with only local communication and the arrays which are combined with other communication support like tree networks to increase the speed of non-local dependencies. In the first class, two representative designs are reported in this book. In CHAPTER 10 (Riern et al.), a custom array architecture for long integer arithmetic computations is presented. It makes use of redundant arithmetic for high-speed and is very scalable for word-length. Moreover, several modes are available to perform various types of multiplication and division. The emphasis in this paper lies on the interaction with the algorithmic transformations which are needed to derive an optimized architecture and also on the methodology which is used

Algorithms and Parallel VLSI Architectures throughout the design trajectory. Similarly, in CHAPTER 11 (Rosseel et al.), a regular array architecture for an image diffusion algorithm is derived. The resulting design is easily cascadable and scalable and the data-path supports many different interpolation functions. The extended formal methodology used to arrive at the end result - oriented to fixed throughput applications - forms a red thread throughout the paper. Within the class of arrays extended with non-local communication also two representative designs are reported, again including a high degree of scalability. The topic of CHAPTER 12 (Duboux et al.) is a parallel array augmented with a tree network for fast and efficient dictionary manipulations. The memory and network organisation for handling the keyrecord data are heavily tuned to obtain the final efficiency. Also in CHAPTER 13 (Archambaud et al.), a basic systolic array is extended with an arbitration tree to speed up the realisation of the application. In this case, it is oriented to genetic sequence comparison including the presence of "holes". In order to achieve even higher speed, a set-associative memory is included too. For the class of programmable architectures, both massively and weakly parallel machines are available. Apparently, their use depends on the application domain which is targeted. For high-throughput real-time signal processing, in e.g. image and video processing, the main trend nowadays is towards lower degrees of parallelism (4 to 16 processor elements) and more customisation to support particular, frequently occurring operations and constructs. The latter is especially apparent in the storage and communication organisation. The reduced parallelism is motivated because the amount of available algorithmic parallelism is not necessarily that big and because the speed of the basic processors has become high enough to reduce the required parallelisation factor for the throughput to be obtained. Within the programmable class, the main emphasis in the book lies on the evolution of these novel, weakly parallel processor architectures for video and image processing type applications. Ill CHAPTER 14 (Vissers et al.), an overview is provided of the VSP2 architecture which is mainly intended for video processing as in HDTV, video compression and the like. It Supports a highly flexible connection network (cross-bar) and a very distributed memory organisation with dedicated register-banks and FIFO's. In CHAPTER 15 (Roenner et al.), the emphasis lies on a programmable processor mainly targeted to image processing algorithms. Here, the communication network is more restricted but the storage organisation is more diversified, efficiently supporting in hardware both regular and data-dependent, and both local and neighbourhood operations. The two processor architectures are however also partly overlapping in target domain and the future has to be show which of the options is best suited for a particular application. Using such video or image signal processors, it is possible to construct flexible higher-level templates which are tuned to a particular class of applications. This has for instance been achieved in CrIAPTER 16 (De Greef et al.)where motion-estimation like algorithms a r e considered. A highly efficient communication and storage organisation is proposed which allows to reduce these overheads considerably for the targeted applications. Real-time

F. Catthoor and M. Moonen execution with limited board-space is obtained in this way for emulation and prototyping purposes. In addition, higher efficiency in the parallel execution within the data-path can potentiaUy be obtained by givingup the fully synchronous operation. This is demonstrated in CHAPTER 17 (Arvind et al.), where the interesting option of asynchronously communicating micro-agents is explored. It is shown that several alternative mechanisms to handle dependencies and to distribute the control of the instruction ordering are feasible. Some of these lead to a significant speed-up. FinaUy, there is also a trend to simplify the processor data-path and to keep the instruction set as small as possible (RISC processor style). Within the class of weakly parallel processors for image and video processing, this was already reflected in the previously mentioned architectures. In CHAPTER, 18 (Hall et al.) however, this is put even more to the extreme by considering bit-serial processing elements which are communicating in an SIMD array. The use of special instructions and a custom memory organisation make global data-dependent operations possible though. This parallel programmable image processor is mainly oriented to wood inspection applications. Within the class of massively parallel machines, the main evolution is also towards more customisation. The majority of the applications targeted to such machines appears to come mainly from the scientific and numerical computing fields. In CHAPTER 19 (Vankats), a new shared memory multi-processor based on hypercube connections is proposed. The dedicated memory organisation with a directory based cache coherence scheme is the key for improved speed. An application of a fast DCT scheme mapped to such parallel machines is studied in CHAPTER 20 (Christopottlo8 et al.). Here, the emphasis lies on the influence of the algorithmic parameters and the load balancing on the efficiency of the parallel mapping. Efficient massive parallelism is only achievable for large system parameters. The power of a "general-purpose" array of processors realized on customizable fieldprogrammable gate arrays (FPGAs) is demonstrated in CHAPTER, 21 (Champeau et al.). This combination allows to extend the customisation further without overly limiting the functionality. An efficient realisation of parallel text matching is used as a test-case to show the advantages of the approach. Compiler support is a key issue for all of these parallel programmable machines so all the novel architectures have been developed with this in mind. Hence, each of the contributions in CHAPTER 14 (Vissers et al.), CHAPTER 15 (Roenner et al.), CHAPTER 18 (Hall et al.), CHAPTZ~t 21 (Champean et al.)and CHAPTZR 19 (Vankats)devotes a section to the compilation issues. Most of these compilers can however benefit from the novel insights and techniques which are emerging in the compilation field, as addressed in section 4.

4

PARALLEL COMPILATION FOR APPLICATION-SPECIFIC GENERAL-PURPOSE ARCHITECTURES

AND

As already mentioned, the key drive for more automated and more effective methodologies

Algorithms and Parallel VLSI Architectures comes from the reduced design time available to system designers. In order to obtain these characteristics, methodologies have to be generally targeted towards application domains and target architecture styles. This is also true for the domain of parallel architectures. Still, a number of basic steps do reoccur in the methodologies and an overview of the major compilation steps in such a targeted methodology is provided in CHAPTER 22 (Featrier). In that contribution, the emphasis lies on array data-flow analysis, scheduling of the parallel operations on the time axis, allocation to processors and processor code generation including communication synthesis. Even though this survey is mainly oriented to the compilation on programmable machines, most of the concepts recur for the field of custom array synthesis (see also CHAPTER 10 (Riera et al.) and CHAPTER 11 (Rosseel et al.)). Still, the detailed realisation of the algorithmic techniques used for the design automation typically differs depending on the specific characteristics of the domain (see also below). The other papers in the compilation category are addressing specific tasks in the global methodology. Representative work in each of the different stages is collected in this book. The order in which these tasks will be addressed here is not fully fixed, but still most researchers converge on a methodology which is close to what is presented here. The first step is of course the representation of the algorithm to be mapped in a formal model, suitable for manipulation by the design automation techniques. The limitations of this model to affine, manifest index functions have been partly removed in the past few years. Important in this process is that the resulting models should still be amenable to the vast amount of compilation/synthesis techniques which are operating on the afnne model. This also means that array data-flow analysis should remain feasible. Interesting extensions to this "conventional" model which meet these requirements, are proposed in CHAPTER 23 (Held et al.) and CHAPTER 24 (Rapanotti et al.). The restriction to linear or affine index functions can be extended to piece-wise regular affine cases by a normal form decomposition process. This allows to convert integer division, modulo, ceiling and floor functions to the existing models, as illustrated in CHAPTER 23 (Held et al.). Moreover, also so-called linearly bounded lattices can then be handled. The restrictions can be even further removed by considering the class of "integral" index functions, as studied in CHAPTER 24 (Rapanotti et al.). This allows to handle also more complicated cases as occurring e.g. in the knapsack algorithm. By especially extending the so-called uniformisation step in the design trajectory, it is still possible to arrive at synthesizable descriptions. There is also hope to deal with part of the data-dependent cases in this way. Finally, it is also possible to consider the problem of modelling from another point of view, namely as a matching between primitive operations for which efficient parallel implementations are known, and the algorithm to be mapped. This approach is taken in CHAPTER 25 (Rangaswarni), where a functional programming style with recursion is advocated. By providing a library of mappable functions, it is then possible to derive different options for compiling higher-level functions and to characterize each of the alternatives in terms of cost. Once the initial algorithm has been brought in this manipulatable form, it is usually nee-

F. Catthoor and M. Moonen essary to apply a number of high-level algorithmic transformations to improve the efficiency of the eventual architecture realisations (see also CHAPTER 10 (Riem et al.) and CHAPTER 11 (Rossee! et al.)). Support for these is considered in CHAPTER 26 (Durrieu et al.), where provably correct small transformations allow the designer to interactively modify the original algorithm into the desired form. Also the uniformisation transformation addressed in CHAPTER 24 (Rapanotti et al.)falls in principle under this stage, but for that purpose also more automated techniques have become available lately. Now that the algorithm has a suitable form for the final mapping stages, it is usually assumed that all index functions are uniform and manifest, and that the algorithm has been broken up into several pure loop nests. For each of these, the scheduling, allocation and code generation/communication synthesis steps then have to be performed. Within the target domain of massively parallel machines (either custom or programmable), the notion of affine mapping functions has been heavily exploited up to now (see also CHAPTEIt 22 For instance, the work in CHAPTER 27 (Bouchittg et al.) considers the mapping of evaluation trees onto a parallel machine where communication and computation can coincide. This assumption complicates the process a lot and heuristics are needed and proposed to handle several practical cases within fine- and coarse-grain architectures. It is however clear from several practical designs that purely affine mapping are not always leading to optimal designs. This is clearly illustrated in C~APTER 28 (Werth et al.) for both scheduling and communication synthesis, and this for the test-case of the socalled Lamport loop. Therefore, several researchers have started looking at extensions to the conventional methods. A non-unimodular mapping technique including extended scheduling/allocation and especially communication synthesis is proposed in CI~APTER 29 (Reffay et al.). For the Cholesky factorisation kernel, it is shown that significantly increased efficiency can be obtained, while still providing automatable methods. Up till now, we have however still restricted ourselves to mapping onto homogeneous, locally connected parallel machines. As already demonstrated in section 3, the use of weakly parallel and not necessarily homogeneous architectures is finding a large market in highthroughput signal processing, as in video and image applications. As a result, much research has been spent lately on improved compilation techniques for these architectures too. Most of this work is originating from the vast amount of know-how which has been collected in the high-level synthesis community on mapping irregular algorithms onto heterogeneous single processor architectures. Several representative contributions in this area are taken up in this book. In CHAPTER 30 (Schwiegershausen et al.), the scheduling problem of coarse grain tasks onto a heterogeneous multi-processor is considered. The assumption is that several processor styles are available and that the mapping of the tasks on these styles has been characterized already. Given that information, it is possible to formulate an integer programming problem which allows to solve several practical applications in the image and video processing domain.

Algorithms and Parallel VLSI Architectures When the granularity of the tasks is reduced, an ILP approach is not feasible any longer, and then other scheduling/allocation techniques have to be considered. This is the case for instance in ('~,HAPTI~,R.31 (reimS, wh~ra. ~. evc.Jn-~tatle li~t ~rhpr11,1~n~ t~r1~n~n11~ |~ nr-'.~-'nf-'r]

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) © 1995 Elsevier Science B.V. All rights reserved.

S U B S P A C E M E T H O D S IN S Y S T E M I D E N T I F I C A T I O N LOCALIZATION

13

AND SOURCE

P.A. REGALIA D~partement Signal et Image Institut National des T~ldcommunications 9, rue Charles Fourier 91011 Evry cedez France [email protected]

ABSTRACT. Subspace methods have received increasing attention in signal processing and control in recent years, due to thelr successful application to the problems of multlvariable system identification and source localization. This paper gives a tutorial overview of these two applications, in order to draw out features common to both problems. In particular, both problems are based on finding spanning vectors for the null space of a spectral density matrix characterizing the available data. This is expressed numerically in various formulations, usually in terms of extremal singular vectors of a data matrix, or in terms of orthogonal filters which achieve decorrelatlon properties of filtered data sequences. In view of this algebraic similarity, algorithms designed for one problem may be adapted to the other. In both cases, though, successful application of subspace methods depends on some knowledge of the required filterorder of spanning vectors for the desired null space. Data encountered in real applications rarely give rise to finite order filtersif theoretically "exact" subspace fits are desired. Accordingly, some observations on the performance of subspace methods in "reduced order" cases are developed. KEY WORDS. callzation.

1

Subspace estimation, autonomous model, system identification, source lo-

INTRODUCTION

Subspace methods have becomes an attractive numerical approach to practical problems of modern signal processing and control. The framework of subspace methods has evolved simultaneously in source localization [1], [2], [6], [8], and system identification [4], [10]. Although these application areas have different physical origins, the mathematical structure

14

P.A. Regalia

of the problems they aim to solve are laced with parallels. The intent of this paper is to provide a brief overview of the structural similarities between system identification and source localization. To the extent that a common objective may be established for seemingly different application areas, numerical algorithms in one area find an immediate range of applications in neighboring areas. Our presentation is not oriented at the numerical algorithm level, but rather abstracted one level to the common algebraic framework which unerlies subspace methods. Section 2 reviews the underlying signal structure suited for subspace methods, in terms of an autonomous model plus white noise. Section 3 interprets the underlying signal structure in the context of multivariable system identification. Section 4 then shows how this same signal structure intervenes in the broadband source localization problem, and stresses similarities in objectives with the system identification problem. Section 5 then examines the approximation obtained in a particular system identification problem when the order chosen for the identifier is too small, as generically occurs in practice where real data may not admit a finite dimensional model. We shall see that subspace methods decompose the available data into an autonomous part plus white noise, even though this may not be the "true" signal structure. 2

BACKGROUND

Most subspace methods are designed for observed vector-valued signals (denoted by {y(.)}) consisting of a usable signal {m(.)} and an additive disturbance term {b(.)}, as in y(n) = s(n) + b(n) We assume that these (column) vectors consist of p dements each. In most subspace applications, one assumes that the disturbance term is statistically independent of the usable signal, and that it is white: ~ Ip, m = n; O, m~n. (Here and in what follows, the superscript * will denote (conjugate) transposition). E[b(n) b*(m)]

l

The usable signal is often assumed to satisfy an autonomous model of the form Bo s(n) + B1 , ( n - l ) + . . . + B M s ( n - M ) = O,

for all n,

(1)

for some integer M. Here the matrices B k axe "row matrices," i.e., a few row vectors stacked atop one another. Examples of this relation will be brought out in Sections 3 and 4. If we consider the covariance matrix of the usable signal, niz

E

s(n) s(n-1) ." s(n-M)

[']* =

Ro R~ .' R~

where Rk =

= R'_k,

R1 ... R M Ro ". ' "'. ". R1 ... R~ Ro

A = Tg.M,

Subspace Methods in System Identification

15

then the assumption that {s(.)} satisfiesan autonomous model implies that

RI IBm1iol

R~

Ro

".

'

B~

:

".

".

R~

'

=

o

(2)

o

~M This suggests that the matrix coemcients B k of the autonomous model could be found by t h . n111] ,'.n.'e-, nf t11-, m , , t r l v ~..

id".ntlf,,in,,,

16

P.A. Regalia

linear system concatenated into a single time-series vector: s(n)-- [sl(n)] } , -

r inputs

s2(n) }r outputs Suppose the inputs and outputs are related by a linear system with unknown transfer function H(z):

s2(n) = HCz) sl(n).

(3)

(This notation means that s2(n) is the output sample at time n from a linear system with transfer matrix H(z) when driven by the sequence {sl(.)}, with sl(n) the most recent input sample). Suppose that H(z) is a rational function. This means that H(z) can be written in terms of a matrix fraction description [3] H(z) = [D(z)] -1 N(z)

(4)

for two (left coprime) matrix polynomials N(z) = N o + N l z - I + ' ' ' + N M z -M D(z) = Do + D1 z -1 + " " + DM Z - M The relations (3) and (4) combine as

[rx(p-r)] (r • r)

D(z) s2(n) = N(z) sl(n) which is to say Do s2(n) + D1 s2(n-1) + . . . + DM s 2 ( n - M ) = No sl(n) + N1 s l ( n - 1 ) + . . . + N M s l ( n - M ) This in turn may be rearranged as

[No 9 -Do "

' 9.. " N . , ! - D M ]

t3o

B,

BM

for all n.

, ( . -.1 )

=0,

for alln,

s(n-M)

which leads to a simple physical interpretation: The autonomous relation (1) holds if and only if the signal {s(.)) contains the inputs and outputs from a finite-dimensional linear system. We also see that the coefficients of a matrix fraction description may be concatenated into null vectors of the matrix T~M. One subtle point does arise in this formulation: The dimension of the null space of ]~M may exceed r (the number of outputs) [2], in such a way that uniqueness of a determined matrix fraction description is not immediately clear. Some greater insight may be obtained by writing the subspace equations in the frequency domain. To this end, consider the power spectral density matrix OO

s.(ei ) =

e

(p • p)

which is nonnegative definite for all w. At the same time, let B(z) = Bo + BI z -I + ' " + BM z -M,

[(p-r) • p]

Subspace Methods in System Identification

17

where the matrix coefficients Bk are associated with the autonomous signal model. By taking Fourier transforms of (2), one may verify B(e j~)

,Sa(ejw) = O,

for all w.

(5)

The row vectors of B(e j~) then span the null space of ,.qs(ejw) as a function of frequency. In case B(z) consists of a single row vector, it is straightforward to verify that the smallest order M for the vector polynomial B(z) which may span this nun space [as in (5)] is precisely the smallest integer M for which the block Toeplitz matrix T~M becomes singular [as in

(2)]. For the system identification context studied thus far, one may verify that the spectral density matrix ,Sa(ej~) may be decomposed as

3"(eJ~) : [H(eJ )] Sa~ (e j~) [Ip-r H*(eJ~)], where OO

=

•

is the power spectral density matrix of the input sequence {sl(')}. This shows that the rank of S,(e jw) is generically equal to the number of free inputs to the system (= p - r ) , assuming further dependencies do not connect the components of {sl(')} (persistent excitation). As the outputs {s2(.)} are filtered versions of {Sl(.)}, their inclusion does not alter the rank of

Next, we can observe that [N(e jw) -D(eJW)]

,S,(e jw) : 0,

for all w.

With a little further work, one may show that provided the r row vectors of the matrix [N(e jw) -D(eJW)] are linearly independent for (almost) all w (which amounts to saying that the normal rank of [N(z) -D(z)] is full), then the ratio [D(z)] -1N(z) must furnish the system H(z). Note that if N(z) and D(z) are both multiplied from the left by an invertible matrix (which may be a function of z), the ratio [D(z)] -1 N(z) is left unaltered. As a particular case, consider a Gramian matrix [N(e jw) -D(eJW)l [ -D*(eJ~)] N*(ejw) : F(e/~)

F*(eJ'),

with the r x r matrix F(z) minimum phase (i.e., causal and causally invertible). It is then easy to verify that the row vectors of the matrix [F(eJ~)] -1 [N(e j~) -D(eJ~)]

18

P.A. Regalia

are orthonormal for all w, and thus yield orthonormal spanning vectors for the null space of 3a(eJW). The system identification problem is then algebraically equivalent to finding orthonormal spanning vectors for the null space of 8a(eJW).

4

BROADBAND SOURCE LOCALIZATION

Source localization algorithms aim to determine the direction of arrival of a set of waves impinging on a sensor array. We review the basic geometric structure of this problem, in order to obtain the same characterization exposed in the system identification context. The outputs of a p-sensor array are now modelled as

y(n) =

b(n)

where the elements of s(u) are mutually independent source signals, and where {b(.)} is an additive white noise vector. The columns of A(z) contain the successive transfer functions connecting a source at a given spatial location to the successive array outputs. Each column of ~4(z) is thus called a steering vector, which models spatial and frequential filtering effects proper to the transmission medium and array geometry. The problem is to deduce the spatial locations of the emitting sources, given the array snapshot sequence {y(n)}. The spectral density matrix from the sensor outputs now becomes oo

8~(e jw) = =

~

E[y(n) y*(n-k)] e -jk~ so( s

+

provided the noise term {b(.)} is indeed white. Here 8,(e jw) is thepower spectral density matrix of the emitting sources. Provided the number of sources is strictly less than the cc number of sensors, the first term on the right-hand side (i.e., the signal-induced component) is rank deficient for all w. It turns out that its null space completely characterizes the solution of the problem [6], [8]. For if we find orthonormal spanning vectors for the null space of the signal induced term, then we will have constructed the orthogonal complement space to that spanned by the columns of .A(eJw). This, combined with knowhdge of the array response pattern versus emitter localition, is sufficient to recover the information contained in A(z), namely the spatial locations of the sources [8]. More detail on constructing orthonormal spanning vectors for this null space, in the context of adaptive filtering, is developed in [2], [6] and the references therein. We can observe some superficial similarities between the system identification problem and the source localization problem. In both cases, the usable signal component induces a rank-deficient power spectral density matrix, and in both cases, the information so sought (a linear system or spatial location parameters) is entirely characterized by the null space of the singular spectral density matrix in question. Accordingly, algorithms designed for subspace system identification can be used for subspace source localization, and vice-versa. See, e.g., [5], [9].

Subspace Methods in System Identification 5

19

T H E U N D E R M O D E L L E D CASE

The development thus far has, for convenience, assumed that the order M of the autonomous signal model (1) was available. In practice, the required filter order M is highly signaldependent, posing the obvious dilemma of how to properly choose M is cases where a priori information on the data is inadequate.

20

P.A. Regalia

to show the result. Note also that, as expected, a vector in the null space of 7~M yields the coefficients of the ARMA model in question. Suppose that the actual sequence (sl(')} and (s2(')} are related as OO

=

(6)

k=O To avoid an argument that says we can increase the chosen order M until we hit the correct value, we assume that the transfer function OO

k=O is infinite dimensional (i.e., not rational). In this case, the covariance matrix T~M will have full rank irrespective of what value we choose for the integer M. Consider then trying to find an ARMA signal model which is "best compatible" with RM. To this end, let two sequences {-~1(')} and {s2(')} be related as M M k=O k=O where the coefficients { a k ) and {bk) remain to be determined. We note here that, with fi(n) = [h(n)Jh(n)]and ...

~(n-1)

7?.M -- E

.

"1"

[

,

w we shall have bo --a0 "

= o,

(7)

v so that T~Mis always singular. Set now 2(n)

where the disturbance terms {bl(')} and (b2(')} are chosen to render {~r(.)} compatible with the true data. In particular, the covariance matrix built from 3)(n), . . . , ~ ( n - M ) takes the form A

A

RM + Rb, where 7~b is the covariance matrix built from {bl(')} and {b2(')}. This becomes compatible

Subspace Methods in System Identification

21

with the true data provided we set A

~b -- ~ M -- ~M. A

Given only that ~ M is singular, though, a standard result in matrix approximation theory gives [l~b[[ = [ITEM- ~M[[ >_. ~,~i,(7~M). As a particular case, consider the choice ~M = ~M-

A~I.

This retains a block Toeplitz structure as required, but is now positive sen~-definite. We

P.A. Regalia

22

we have similarly

1-

Amdn

so that the matching of cross correlationterms from (8) may be expressed as

hk=

i

~,~'

~=0,1,...,M;

0 (= hk),

k = -I,-2,...,-M.

This shows that the first few terms of the impulse response of H(z) agree to within a factor 1 / ( 1 - Amd,~)with those produced by the true system H(z). Similarly, we can also observe that OO

Z[.~(~).~(~-k)] = ~ ~ ~+k =A ~k, i=O if {81(')} is unit-variance white noise. This gives the kth term of the autocorrelation sequence associated to H(z). For the reconstructed model, we likewise have OO

z{~(~) ~(~-k)l = ~ 1 -- Amin

~ h~ h~+k = (1 - ~ ) ~ k . i=0

The matching properties (10) then show that

I r0- Amin ~k=

'i-Ami~

'

k=O;

rk

'I- A,,i~

'

k = 1,2,...,M;

which reveals how the correlationsequences compare. A slightlydifferentstrategy is investigated in [7],which builds the function H(z) from an extremal eigenvector of a Schur complement of ~M. This can improve the impulse and correlationmatching properties considerably [7]. 6

CONCLUDING REMARKS

We have shown how the system identification and source localization problems may be addressed in a common framework. In both cases, the desired information is characterized in terms of spanning vectors of the null space of a power spectral density matrix. Numerical methods for determining the null space have appeared in different state space formulations [2], [4], [6], [10], which are, for the most part, oriented around orthogonal transformations applied directly to the available data. We have also examined the influence of undermodeUing. Some recent work in this direction [7] shows that subspace methods correspond to total least-squares equation error methods. This can yield weaker subspace fits in undermodelled cases compared to Hankel norm or 7/oo subspace fits. The reduced order system so constructed, however, is intimately connected to low rank matrix approximation, which in turn can be expressed in terms of

Subspace Methods in System Identification

23

interpolation properties relating the impulse and correlation sequences between the true system and its reduced order approximant. More detail on these interpolation properties is available in [7]. References

[1] J. A. Cadzow, "Multiple source location--The signal subspace approach," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38, pp. 1110-1125, July 1990. [2] I. Fijalkow, Estimation de Sous.Espaces Rationnels, doctoral thesis, Ecole Nationale Sup~rieure des Tdldcommunications, Paris, 1993. [3] T. Kailath, Linear Systems, Prentice-Hall, Englewood Cliffs, NJ, 1980. [4] M. Moonen, B. DeMoor, L. Vandenberghe, and J. VandewaUe, "On- and off-line identification of linear state-space models," Int. J. Control, vol. 49, pp. 219-232, 1989. [5] P. A. Regalia, "Adaptive IItt filtering using rational subspace methods," Proc. ICASSP, San Francisco, March 1992. [6] P. A. Regalia and Ph. Loubaton, "Rational subspace estimation using adaptive lossiess filters," IEEE Trans. Signal Processing, vol. 40, pp. 2392-2405, October 1992. [7] P. A. Regalia, "An unbiased equation error identifier and reduced order approximations," IEEE Trans. Signal Processing, vol. 42, pp. 1397-1412, June 1994. [8] G. Su and M. Morf, "The signal subspace approach for multiple wide-band emitter location," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 31, pp. 15021522, December 1983. [9] F. Vanpoucke and M. Moonen, "A state space method for direction finding of wideband emitters," Proc. EUSIPCO-94, Edinbourgh, Sept. 1994, pp. 780-783.

[lo] M. Verhaegen and P. Dewilde, "Subspace model identification (Pts. 1 and 2)," Int. J. Control, vol. 56, pp. 1187-1241~ 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

25

P I P E L I N I N G T H E I N V E R S E U P D A T E S R L S A R R A Y BY A L G O R I T H M I C

26

J.G. McWhirter and I.K. Proudler

still uses orthogonal transformations but also produces the optimum coefficients every sample time. Verhaegen[13] has shown that, provided the input is persistently exciting (i.e. is sufficiently wideband), this algorithm has bounded errors and should therefore be numerically stable. It is worth noting that the two algorithms discussed above can be classified in terms of the nomenclature of Kalman filtering. It is well known that RLS optimisation is equivalent to a Kalman filter in which the state transition matrix is unit diagonal. In these terms, the original QRD-based algorithm[ 1][4] constitutes a square-root information algorithm whereas the inverse updates method[9] constitutes a square-root covariance algorithm. Viewed in this way the inverse updates algorithm is not new; indeed, Verhaegen's analysis of this algorithm[13] predates its publication in the signal processing literature. In this paper we address the problem of pipelining the inverse updates algorithm. This is highly non-trivial since the basic algorithm requires a matrix-vector product to be completed before the same matrix can be updated. This limits the extent to which the algorithm can be pipelined and hence the effectiveness of any systolic implementation. In terms of the signal flow graph (SFG) representation used here, the algorithm exhibits a long feedback loop which defeats the usual methods for deriving a systolic array. We begin, in section 2, by reviewing the inverse updates method. In section 3, the basic algorithm is transformed into a form which has no long feedback loops using the emerging technique of algorithmic engineering (McWhirter[5], Proudler and McWhirter[ 10]). The derivation of a systolic array is then reduced to straightforward application of the cut theorem and retiming techniques (Megson[7]). Two alternative systolic arrays are derived; the first is identical to the one originally presented (without proof) by Moonen and McWhirter [8]; the other was first presented by McWhirter and Proudler in [6]. 2 INVERSE UPDATES METHOD Consider the least squares estimation of the scalar y(n) by a linear combination of the p components of the vector ~p(n). The (p-dimensional) vector of optimum coefficients, at time n, O0p(n) is determined by Min

ly_(n)+ Xp(n)_~p(n)12

(1)

where ~(n) = [y(1), ..., y(n)] T

and

Xp(n) = [~p(1), ..., X_p(n)]T

(2)

The solution to this problem using QR decomposition is well known[2]. The optimum coefficient vector _~p(n) is given by ~_p(n) = -Rp l(n)up(n)

(3)

where Rp(n) is a p • p upper triangular matrix and up(n) is a p-dimensional vector. These quantities may be calculated recursively via the equation

Pipelining the Inverse Updates RL~ Array

[[3Rp(n- 1) [3U-p(n- 1)1 [Rp(n)ttp(n)1 xT(R) v(n) = | 0T a(n~l"

Op(n)/

27

(4)

28

J.G. McWhirter and L K. Proudler

related to the Kalman gain vector). Secondly, the orthogonal matrix 0y(n) can be generated from knowledge of the matrix RyT(n - 1) and the new data vector. Specifically, t~y(n) is the orthogonal matrix given by: =

(8)

n where ~_y(n) = RyT(n-1)F~p(n)l

(9)

b, (n)J This can easily be proved as follows. Let

Oy(n) L[~,(n), y (n)

= U

(10)

From equation (9) it follows that

[o]

(11)

and hence ~_ = 0. If 0(n) is constructed as a sequence of Givens rotations which preserves the structure of the upper triangular matrix in equation (10), it follows that U = Ry(n). Hence t~y(n) is equivalent to the orthogonal matrix defined in equation (5). The inverse updates algorithm can thus be summarised as follows: Given the new data X_p(n),y(n), calculate _~y(n) (equation (9)). Using e y(n), calculate 0y(n) (equation(8)). Using t)y(n), update RyT(n- 1) (equation (7)). Extract the least squares coefficients from RyT(n) (equation (6)). We will now show how a systolic array to implement this algorithm may be designed fairly simply by means of algorithmic engineering.

Pipelining the Inverse Updates RLS Array

29

3 ALGORITHMIC TRANSFORMATIONS Algorithmic engineering[5][10] is an emerging technique for representing and manipulating algorithms based on the SFG representation. The power of this technique, for algorithm development, is twofold: firstly, the cells of the SFG are given precise meanings as mathematical operators thus endowing the SFG, and any SFG derived from it, with a rigorous mathematical interpretation; secondly, unnecessary complexity in the SFG can be removed by the formation of 'block' operators. This latter concept leads to what may be termed a hierarchical SFG (HSFG) and allows the SFG to be simplified so as to reveal any pertinent structure in the algorithm. Once a suitable SFG has been derived, creating a systolic array implementation is then straightforward by means of standard techniques such as the cut theorem[7]. '

x1 0

C

x2 0

.

.

.

.

.

n x3 0

y, 0 e

yolt . ,f(~.nl) 2 + (1~-1~)2

Y;n~

c ffi Y~out

S ffi

p_~

--B

. . . . . .

-1 Yout

X, Kin -

4-0

-

eout ~. , S, c ~ ~ L ~

.~

ein ~ S, C tin

X, Kout, rou t 9

yyl

9

..

.

9

..

.

.

.

.

Xl' ICy, 1 t~ 1 x2' ICy,2t~ 2 ] x3' ~:y, 3t~ 3 I

.

.

.

Y, ~:y, 4 e~l

eou t = ein + rinX

D Figure 1. SFG for the inverse updates algorithm

L~~

.

tF0- ., l L qo J

A SFG for the inverse updates algorithm is shown in figure 1 for the case p=3. This SFG is obtained by combining SFGs for the three basic operations involved in the inverse updates algorithm: a matrix-vector product operator[5]; a rotation operator to update Ry T [12]; and the operator for the rotation calculation defined in equation (8). The first two operators are triangular in shape and can be conformally overlaid, combining the original SFGs into one. The mathematical definitions of the elementary operators (or cells) shown in figure 1 assume that the matrix (~y(n) is to be constructed using Givens rotations. Note that the matrix Ry T is stored in the cells of the triangular block. The elements of this matrix, which has the decomposition shown in equation (6), are explic-

J.G. McWhirter and I.K. Proudler

30

itly shown and for notational convenience are denoted by rij. Furthermore, we define the energy normalised weight vector ~p by ~p(n) = epl(n)mp(n)

(12)

The sequence of events depicted in figure 1 is as follows: at time n, the new data [ xT(n), y(n)] is input at the top of the triangular part of the SFG. It flows through the array, interacting with the stored matrix to form the vector ~_y(n). This vector is accumulated from right to left and emerges at the left hand side of the triangular army. Here, the rotation matrix Qy(n) is calculated and fed back into the triangular array where it serves to update the stored matrix RyT(n - 1). It is clear that the SFG can be pipelined in the vertical direction by making horizontal cuts (e.g. cut AB in figure (1)). However, itcan not be pipelined in the horizontal direction due to the contraflowing data paths. Any vertical cut (e.g. cut CD) through the SFG will cut these lines in the opposite sense and so a delay applied to one path would necessitate an unrealisable 'anti-delay' on the other. The algorithm must be transformed so as to avoid this problem e.g. by creating a delay on one of the contraflowing lines that can be paired with the anti-delay introduced by the action of cutting the SFG.

xx(n) 0 I

I

y(.)] 0

~~ l(n)

[~pT z(n), e;'(n)]

Figure 2. HSFG for the inverse updates algorithm The structure shown in the SFG of figure 1 is too detailed for our purposes. In what follows we will only need to consider the structure of the first column of the triangular array. Figure 2 consti-T tutes a HSFG based on figure 1 and shows this first column explicitly (labelled Ry, 1 )" The left hand block represents the operator that calculates the rotation parameters, whilst the right hand block represents a p • p block of multiply/rotate cells which stores the matrix R -T y, 2 where

31

Pipelining the Inverse Updates RLS Array

y, 2 triangular block consists of p rows of multiply/rotate cells whereas both the Note that the R -T rotation calculator and the first column contain (p+ 1) cells. As such, each of the latter two operators has been split conformally into a column of dimension p and a single cell (which corresponds to the top row of the SFG in figure 1). Again for the sake of clarity, the only outputs shown in figure -P - [t~p, 1' t.o~,2] and the normalisation factor (epl). This 2 are the normalised weight vector (0T "

HSFG, although visually different to the SFG in figure 1, does not represent a change in the algorithm and accordingly, the data contraflow is still evident. y, 2 triangular operator. From figure 1 it is easy to see that this Consider the function of the R -T operator performs two tasks: 1. matrix-vector product:

2. matrix update:

e.2(n) = RTT2(n- 1) [~2(n)] 9 [y (n)J 0y, 2(n)I[5-1RyT2(n=T 0 1)1

(14)

EKe,yT2(n JR.2(n)j 1

(15)

where the subscript '2' signifies the quantity corresponds to the reduced order problem (i.e. without T

the first column). The problem with pipelining the algorithm is that the matrix Ry,2(n - 1) cannot beupdated in time (equation (15)) until 0y, 2(n) is known but the latter matrix depends on the vector ~.2(n). In order to pipeline the algorithm this dependency can be broken as follows. Using equations (14) and (15) and defining 62(n) = RT,T2(n_ 2) r~2(n)1 [y (n).]

(16)

it can be shown that

= 0y,2(n-1)I~-lRyY2(n-2)l[~T(n I T 0 Ly(n)J

Lr_~,2(n-1

L y(n)j

Lrly, 2(n)/

(17)

32

J.G. McWhirter and I.K. Proudler

where the term fly, 2(n) is defined by this operation. Equation (17) indicates that it is possible to calculate the matrix-vector product ~2(n) with an out-of-date matrix and still obtain the correct product ~2(n) by means of an extra rotation step. Figure 3 shows the SFG for this rotation operator. xl(n) ;1' el

A

J,~in

0

~

[0

~KT(n)'y(n)]

#"4

....--,

;2' C2

S , ~ ~out

~4 Ry~z(n -

;3' C3 Lni.J

~4

fop, l(n - 1)

~T [__p, 2( n - 1), ep- 1(n - 1)_']

;4' C4

Figure 3. SFG for rotation operator

Figure 4. HSFG after 1st algorithmic transformation. Rotation operator ~ is defined in figure 3.

The small circular symbols will be explained later and should be ignored for the moment. The utility of the above observation is that the out-of-date matrix (RyT2(n - 2)) does not require knowledge of r

2(n) in order to be updated; in fact it is thematrix (~y,2(n- 1) that is required. However,

because e2 (n) can still be calculated using R -T y, 2 (n - 2) the HSFG of figure 2can be transformed into that shown in figure 4 which has a delay in the rotation parameter data line. As the right hand -T block now stores Ry, 2 (n - 2 ) , its output is ~p, 2 (n - 1) .The output top, 1 (n) from the left hand column has been aligned in time by delaying it accordingly. In order to create a fully systolic implementation of the inverse updates algorithm is necessary to create a delay on both of the horizontal data paths. One approach to introducing the missing delay is to invoke the "k-slowing lemma'[7] with k=2. This amounts to replacing each delay in figure 4 with two delays and reducing the input data rate by a factor of two (i.e. inputting zero data every second clock cycle). It is then possible to move one delay from the rotation parameter line to the other horizontal data path by applying the type of pipeline cut labelled AB in figure 4. The left hand -T -T column Ry, 1 and the triangular array Ry, 2 then constitute independent pipeline processing stages. If the algorithmic transformation is repeated to create a delay between every pair of adjacent columns in the original SFG (before 2-slowing), a complete set of pipeline cuts (equivalent to the one labelled AB in figure 4) may be applied to produce the systolic array defined in figures 5 and 6. This

Pipellning the Inverse Updates RIs Array

0

0, X2,0

%, ~

0, x3, 0

4, %, ~

33

0, y, 0

4, %, 3

Figure 5. Systolic array for RLS by inverse updates I

y~nI, x, Kin

e ~ e + rinX

S) C r, n

"~?n1

C == - Yoult

S ==

I

l]in, x, Kin

I~-1~.

eo= S) C

~

"/;I t

--" ein + rinX

ei,,

S, C rin

~d

Lni,,J

~'.1

L '~;- J

r,lr+ ; v;~, ,,, ,,:~,, ,'~,

L":+d

L % J

',o,,,, x, ,,:~,, %

Figure 6. Definition of processing cells in figure 5 the one presented recently by Moonen et al.[8]. An alternative approach to creating the extra delay required in figure 4 is to apply the algorithmic transformation again thereby generating the HSFG in figure 7. The rotation parameters are now delayed twice between the first column operator and the R -T y, 2 operator whilst the matrix-vector product term is rotated twice in compensation. 'Ikvo delays are now required on the output from the left hand column R -T y, 1 to align it in time with the output from the triangular array. The HSFG in figure 8 is obtained by applying the pipeline cut AB to the HSFG in figure 7 and then moving the delay on the rotation data path for the rotation operator ~ from input to output. This move is valid pro-

34

J.G. McWhirter and L K. Proudler

x,(n)

o

[~T(n), Y(n)]

R y,~(n B

b ~,b . , = . . ~ . ~ .=b .sb ~ , ~ ~r

.,. ~ , . . ~

.~ -,..,,..,.

|

E~pT 2(n- 2); epl(n- 2)]

~p,l(n-2)

Figure 7. HSFG after 2nd algorithmic transformation xl(n)

0

0

,

y(.)] 9

~p,l(n-2)

~,

~,

1

E~pT2(n - 2), epl(n - 2)]

Figure 8. HSFG of figure 7 after pipeline cut and retiming vided that the rotation operator is modified to include extra delays in place o[ the small circular symbols in figure 3. The resulting "delayed" rotation operator is simply denoted by z-lO. In order to derive a fully pipelined systolic array the entire procedure applied to transform the HSFG in figure 2 to that in figure 8 must be repeated to create delays between each pair of adjacent columns in the original SFG. The resulting systolic array, which is identical to the one proposed by McWhirter and Proudler[6], is defined in figures 9 and 10. The different processing cells on the diagonal boundary of the array have not been defined explicitly since they are just special cases of the internal cells and can easily be deduced from them.

Pipelintng the Inverse Updates Rts Array

35

4 CONCLUSIONS By means of algorithmic engineering, we have shown how to derive a fully pipelined systolic array for the inverse updates RLS algorithm. This proves to be a non-trivial task since the inverse updates algorithm involves a major computational feedback loop. Two distinct systolic array designs have been presented. The original one, defined in figures 5 and 6, was derived using a 2-slowing procedure and can only process one new data vector every two clock cycles. The other one, defined in figures 9 and 10, can process a new data vector every clock cycle but requires an extra rotation to

36

J.G. McWhirter and L K. Proudler

be performed in every cell. It has been estimated that the cells in figure 9 would require ~ 60% more silicon area than their counterparts in figure 5. Since both arrays require the same number of cells, it would appear that the one in figure 9 is more efficient (~ twice the throughput for only 60% extra circuitry). However, since adjacent cells of the array in figure 5 are idle (i.e. processing zero data) on alternate clock cycles, it is possible in the normal way to combine them in pairs with little additional overhead and so reduce the hardware requirement by almost a factor of two. The array in figure 9 would then be less efficient requiring almost twice as many cells, each - 60% bigger, in order to double the maximum throughput rate. Both arrays have been derived in this paper to illustrate how easily the different designs are obtained using the techniques of algorithmic engineering. References

[1] W. M. Gentleman and H. T. Kung, "Matrix Triangularisation by Systolic Arrays", Proc. SPIE Real Time Signal Processing IV, Vol 298, pp 19-26, 1981. [2] G. H. Golub and C. F. Van Loan, "Matrix Computations", North Oxford Academic Publishing CO., Johns Hopkins Press, 1988. [3] S. Haykin, "Adaptive Filter Theory", 2nd Edition, Prentice-Hall, Englewood Cliffs, NJ, USA, 1991. [4] J. G. McWhirter, "Recursive Least Squares Minimisation using a Systolic Array", Proc. SPIE Real Time Signal Processing IV, Vol 431, pp 105-112, 1983. [5] J. G. McWhirter, "Algorithmic Engineering in Adaptive Signal Processing", IEE Proc., Pt F, Vol 139, pp 226-232, 1992. [6] J. G. McWhirter and I. K. Proudler, "A Systolic Array for Recursive Least Squares Estimation by Inverse Updates", Proc. lEE Int. Conf. on Control, Warwick (Mar 1994) [7] G. M. Megson, "An Introduction to Systolic Algorithm Design", Oxford University Press, 1992. [8] M. Moonen and J. G. McWhirter, "Systolic Array for Recursive Least Squares by Inverse Updating", Electronics Letters, Vol 29, No 13, 1993. [9] C-T Pan and R. J. Plemmons, "Least Squares Modifications with Inverse Factorisation: Parallel Implications", J. Comput. and Applied Maths., Vol 27, pp 109-127. 1989. [10] I. K. Proudler and J. G. McWhirter, "Algorithmic Engineering in Adaptive Signal Processing: Worked Examples", lEE Proc., -Vis. Image Signal Proc.., Vol 141, pp 19-26, 1994 [11] R. Schreiber, "Implementation of Adaptive Array Algorithms", IEEE Trans. ASSP, Vol 34, pp 1038-45, 1986. [12] T. J. Shepherd, J. G. McWhirter and J. E. Hudson, "Parallel Weight Extraction from a Systolic Adaptive Beamformer" in "Mathematics in Signal Processinglr', J. G. McWhirter (Ed), Clarendon Press, Oxford, pp 775-790, 1990. [13] M. H. Verhaegen, "Round-off Error Propagation in Four Generally Applicable Recursive Least Squares Estimation Schemes", Automatica, Vol 25, pp 437-444,1989. 9 British Crown Copyright 1994

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

HIERARCHICAL SIGNAL FLOW GRAPH REPRESENTATION OF THE SQUARE-ROOT COVARIANCE KALMAN FILTER

D.W. BROWN, F.M.F. GASTON

Control Engineering Research Centre Department of Electrical and Electronic Engineering The Queen's University of Belfast Ashby Buildino, StranmiUis Road, Belfast BT9 5AH

37

D.W. Brown and F.M.F Gaston

38

robust than non square-root forms, as they are less susceptible to rounding errors and prevent the error covariance matrices from becoming negative definite. A number of different architectures have been proposed in the literature and most have been outlined in a survey paper by Gaston and Irwin, [5]. Algorithmic engineering has grown up out of paraUd processing techniques used in designing systolic arrays and sees the resulting diagrams as illustrations of the algorithm itself and not just a possible systolic architecture. It shows the data tiow and computational requirements of a particular algorithm. However, the parallel algorithms are not necessarily unique and therefore one should be able to transform one parallel form to another for the same algorithm using simple graphical techniques. McWhirter and Proudler have illustrated this in [7]. In this paper, we will demonstrate that all systolic square-root covariance Kalman filter architectures can be obtained from the corresponding hierarchical signal flow graph. In particular, the architectures proposed by Gaston, Irwin and McWhirter [8] and by Brown and Gaston [6], will be verified, using algorithmic engineering methodology, from the overall hierarchical signal flow graph. In the next section, the notation and defining equations for the square-root covariance Kalman filter are given, followed by a section illustrating hierarchical signal flow graphs. Section 4 develops the full hierarchical signal flow graph for the square-root covariance Kalman filter. Sections 5 and 6 illustrate the systolic architectures, [6] and [8], formed by considering different projections of this hierarchical signal flow graph. 2

SQUARE-ROOT COVARIANCE KALMAN FILTERING

The general Kalman filtering algorithm can be numerically unstable in some applications and for this reason several square-root algorithms have been proposed. The square root covariance algorithm is summarised as follows [8]'

Q(k)

P'/'(klk-

0

=

WT/2(~) pre- array

(1)

[ V~re/2(k) Ve-1/'(k)C(k)PT(k]k-1)A:r(k) PT/'(k "Fllk) 0

0

~.

post -*-array where

=

I)cT(k)+ V(k)

(2)

and

~(k + ilk) = A(k)~_(klk- 1)+ (3) A(k)P(klk- 1)CT(k)g,-l(k) ~(k)- C(k)fc_(klk- 1)] where ~_(klk- 1) is the (n x 1) predicted state estimate vector at time k given measurements

HSFG Representationof the SRCKF

39

up to time k - 1, z_(b) is the ( m x 1) measurement vector, A(k) is the (n x n) state matrix, C(k) is the ( m x n) measurement matrix, WT/2(k) and VT/2(k) are the square-roots, or Cholesky factors, of the state and measurement noise covarlance matrices, and P(k[k - 1) is the (n x n) predicted state error covariance matrix. The Cholesky factors are usually taken to be positive definite and can be either upper or lower triangular, i.e.

V(k) -" V1/~(k)VT/2(k)

(4)

From now on, all timescripts have been removed, and for simplicity ~ ( k ) - C ( k ) ~ ( k l k - 1)] r may be referred to as z'.

3

H I E R A R C H I C A L S I G N A L F L O W G R A P H S (HSFGS)

For the purposes of algorithmic engineering a hierarchical signal flow graph (HSFG) may be regarded as an instantaneous input-output processor array. For example, consider the case of matrix-matrlx multiplication in equation 5. c =

(5)

The corresponding HSFG is shown in figure 1. The input matrices, A & B, flow in the i ~ j directions respectively, while the product, C, is propagated in the k direction. The value of considering the 3-D HSFG is seen in figure 2. Figure 2 shows the projected HSFG obtained by projecting figure 1 along the k-axis, i.e. along the direction in which the C matrix is propagated. This results in this product matrix being stored in memory with matrices A & B being passed through the array. This is illustrated diagrammatically by shading the "stationary" data. Figure 2 is not an HSFG in the strictest sense of the meaning, due to the fact that the data has to be fed in sequentially thus not making it an "instantaneous input-output processor". Despite this fact, these type of projections are valuable in determining the architecture and cell descriptions of the resulting systolic array. These cell operations depend, not only on the actual function of the HSFG but also, on the chosen projection, which explains why different systolic architectures can be generated from the same HSFG to produce the same overall mode of operation.

4

HSFG FOR THE SQUARE-ROOT COVARIANCE KALMAN FILTER

A full HSFG for the square-root covarlance Kalman filter can be built up by considering the following steps: (i) Formation of the pre-array in equation 1. (il) Error Covarlance Update : PT/2(k). (iii) State Update : ~(k) .

40

D.W. Brown and F.M.F Gaston

4.1

Formation Of The P r e - A r r a y

The various product terms included in the pre-array fall into two categories:

(i)

Post-multiplication by C r.

(ii) Post-multiplication by Ar. 1

The computation of PT/=CT and ~. - C ~ T can be described by using the HSFGs shown in figure 3. ~1.I.1

C T Products

In the left-hand HSFG, pT/2 and C T are passed in together with the null matrix from above and the product PT/2CT emerges from the bottom of the flow graph. Note that both pT/~ and C T pass through unchanged in their respective directions. In the right-hand HSFG a similar calculation takes place, C T and z T are multiplied together and combined with z T which is fed in from above to produce [ z - Cz] T which emerges from the bottom of the HSFG. These two diagrams can be combined into one HSFG by joining the flow graphs along identical data flow directions, i.e. letting C T flow uninterrupted from the left hand to the right hand HSFG, forming the HSFG shown in figure 4. It can be seen that the two product terms are generated from the bottom of the HSFG ( - j direction). The input matrices are fed in along the i and - k directions and pass through unchanged. From now on, all outputs of unchanged data have been removed to clarify the diagrams. ~.I.2 A T Products The products _PT/2AT and z T A T are generated in much the same way as before with pT/2 and z T being multiplied by A T with the products emerging from the bottom of the flow graph, as shown in figure 5.

Figures 4 and 5 can also be combined by "glueing" the HSFGs together along common data flow directions, i.e. letting pT/2 and z T flow directly from figure 4 to figure 5, producing an HSFG for the generation of all the product terms in the pre-array of equation 1, shown in figure 6. As can be seen, the product terms are produced from the bottom of the array and any null inputs are removed for clarity.

4.2

Generating The Error Covariance Update

-

pT/2

Having generated all the product terms in the pre-array, the post-array can now be formed to update pT/2,/ equation (1), and a by applying a set of orthogonal (Givens) rotations ' ' Schur complement calculation to update ~T, equation (3). The generation of the updated error covariance matrix, pT/~, can be described by the HSFG in figure 7. Rotating PT/2CT into VT/2 and passing the resulting Givens rotations across PT/~AT before it is rotated into W T/~ produces the updated error covariance matrix pr/2. Also produced as by-products are the terms yT/2 and ~-I/2CpTAT which are needed in the calculation of the updated state estimate, ~T.

HSFG Representationof the SRCKF 4.3

41

Generating The State Update ~T(k)

The state update is performed by taking the Schur complement of the compound matrix in equation (6).

[

veTl'(k) [ ~ ( k ) - C ( k ) ~ ( k - 1)]T

Ve-ll'(k)C(k)Pr(k-1)AT(k,)] [A B] ~..T(k- 1)AT(k) = C D

(6)

If the sub-matrix C is zeroed by computing the Schur complement of the compound matr|~ /3. (7. 4-1R |~ nrnttur~.rl Thg.r~.fnr^ 1~]T ~. . . . rn^rl h,, rnt-t~n~ it

[~,(t,~_~(t,~,(t, ....

42

D.W. Brown and F.M.F Gaston

inserted resulting in an identical architecture to that shown in figure 10. This demonstrates that the ad-hoc methods of systolic design can be replaced by a formal design methodology via the use of signal flow graphs and algorithmic engineering. While the above architecture is very efficient and fast, O(2n) timesteps per iteration, it does require feedback loops to produce the pre-array from the new pT/~ located in the triangular part of the array. In the next section, an architecture will be described briefly which does not have feedback loops but which has the same iteration time and higher cell efficiency. 6

T H E S t t C K F S Y S T O L I C A R C H I T E C T U R E OF B R O W N A N D G A S T O N

The architecture given in figure 11 is not unique as the next example demonstrates. TO obtain the systolic architecture documented in [6] is a more complex task than that given in the previous section. Three different projections are needed: (i) i-axis projection of the multiplication layer (ii) j-axis projection of the state update layer (iii) k-axis projection of the error covariance update layer These projections are shown separately in figure 12. Note that the k-axis projection has been flipped upside down. These projections in themselves are valid systolic architectures but can be combined on top of one another in the following way to produce a more efficient array. The shaded area of projection (ii) is identical to the results produced from projection (i) and can be combined as illustrated in figure 13. The products PT/~AT and x T A T overwrite pT/2 and x T respectively, followed by the Schur complement calculation to update the state estimate, x ~, which in turn overwrites x T A M in memory. Finally, appending projection (iii) to figure 13 by storing the lower triangular W T/2 in a secondary memory under the existing lower triangular pT/2 will result in the updated error covariance matrix being formed in the correct position for calculations in the next iteration. It should also be noted that the measurement vector, z, which has been replaced by a unity matrix in memory, is now fed into the array with the C T matrix, producing the architecture given in figure 14. This architecture is identical to that described at length in [6], again showing that a formal design method exists for the generation of systolic square-root Kalman filters. 7

CONCLUSIONS

To conclude, we have demonstrated that: 1. The SRCKF algorithm can be represented as an HSFG. 2. Using algorithmic engineering techniques, numerous systolic architectures can be obtained by projecting the 3-D HSFG in various planes.

HSFG Representation of the SRCKF

43

3. A formal design method for systolic architectures has been shown using HSFGs.

Acknowledgements The authors gratefullyacknowledge the support of the Defence Research Agency, M~Ivern and the financial assistance given by the Department of Education for Northern Ireland.

References [1] J.M. Jover and T. Kailath, "A Parallel Architecture for Kalman Filter Measurement Update and Parameter Estimation.", Automatica, 1986, Vol. 22, No.l, pp. 43-57. [2] M.J. Chen, K. Yao, "On Realizations of Least-Squares Estimation and Kalman Filtering by Systolic arrays.", Proc. 1st Int. Workshop on Systolic Arrays, Oxford, 1986, pp. 161-170. I3] P. Gosling, J.E. Hudson, J.G. McWhirter and T.J. Shepherd, "Direct Extraction of the State Vector from Systolic Implementations of the Kalman Filter.", Proc. Int. Conf. on Systolic Arrays, Killarney, Ireland, May 1989, pp. 42-51. [4] H.T. Kung and C.E. Leiserson, "Introduction to VLSI systems", edited by C.A.Mead and L. Conway, Addison-Wesley, 1980. [5] F.M.F. Gaston, G.W. Irwin, "Systolic Kalman filtering: an overview.", IEE Proceedings-D Control theory and applications, Vol. 137, No. 4, pp. 235-244, 1990. [6] D.W. Brown and F.M.F Gaston, "Systolic Square-root Kalman Filtering without Feedback Loops.", to be presented at IEEE European Workshop on Computer-Intensive Methods in Control and Signal Processing, Prague, September 1994. [7] I.K. Proudler, J.G.McWhirter, "Algorithmic Engineering in Adaptive Signal Processing II - Worked Examples.", to appear in IEE Proc. VIS. [8] F.M.F. Gaston~ G.W. Irwin, J.G.McWhirter, "Systolic Square Root Covariance Kalman Filtering.", Journal of VLSI Signal Processing II, pp. 37-49, 1990. [9] G.M. Megson, "An Introduction to Systolic Algorithm Design.", Clarendon Press Oxford, 1992. [10] M. Moonen and J.G. McWhirter, "Systolic Array for Recursive Least Squares by Inverse Updating.", Electronic Letters 29, No. 13, pp.1217-18, 1993.

44

D.W. Brown and F.M.F Gaston

Figure 1 9 HSFG for Matrix-Matrix Multiplication

Figure 2" Projection along the k-axis

Figure 3" HSFGs for pT/2cT and I _z- c $ l r

HSFG Representation of the SRCKF

zT CT

~T~ 1L\ 9

V_

t i

1,/"

pT/2cT i

[z_Cx]T

Figure 4" HSFG for C T Products ,

,

,

0 /lk

xT~ IIV 9

i

! V

xTATPT/2AT

i

Figure 5 9HSFG for AT Products

zT

k

i

pT/2cT [z-Cx] T

cT

Ir pT/2AT xTAT

'

,,,,,

Figure 6" HSFG for all Products

45

D.W. Brown and F.M.F Gaston

46

pT/2cT

pT/'2AT

..........

............../ o

Figure 7" HSFG for the Gen'cration Of the Updated p T / 2

[z.Cx]T V:/2

xTAT ]

Ve'I/2cpTAT

I,_ _f/.

Complement

l

v:: i

NV/

Transformations

, V

VrI/2CpTAT xT

Figure 8 9HSFG for the Generation of the Updated State Estimate

zT

' cT ....

I

/

/NI

I

AT. . . . . . . .

/

V.....///I~ ./~ PTt2.._L.-~\~I ! xXP! i_ N~ xT ,

LI

~N~

VT/2 9 -e

/ ,,

A

~

! I

A

[ ~

~ [/-/N~ "K-

i/~

P" I N ~ Vr I/2CpTAT ' T Y [ / . . . . .

pT~

.

.

MULTIPLICATION LAYEI~

.

[

W2STATEUPDATELAYER

[

ERROR COVARIANCE

UPDATELAYER "

.

Figure 9 9HSFG for the SRCKF

,

HSFG Representation of the SRCKF

Figure 10 : Existing SRCKF Architecture

lz-Cx] r PT/2cT

[

xTAT pT~ AT

[

Figure 11 : Systolic Architecture obtained by k-axis Projection

47

48

D.W. Brown and F.M.F Gaston

Figure 12" Projections of the three layers of the HSFG in figure 9

Figure 14" Systolic Architecture given by Brown and Gaston, [6]

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

A SYSTOLIC ALGORITHM

FOR BLOCK-REGULARIZED

49

J. Schier

50

In [3], a block-regularized parameter estimator has been presented, compatible with the requirements of implementation on a pipelined systolic architecture. The throughput that it achieves is an order of magnitude higher in comparison with the general framework presented in [5] and half as fast as that of the standard ttLS systolic array [11]. In this paper, we shall apply the concept of the block-regularization to the ItLS array with inverse updates [12]. As a result, we shall get a systolic array, with increased robustness to weakly exciting data, completely pipelined and using only nearest neighbour cell connections, which has the additional advantage of explicitly produced transversal filter weights. 2

I D E N T I F I C A T I O N OF T H E S Y S T E M M O D E L

2.1

Linear Regression Model

The system is modeled by the linear regression

y= e'~+e,

(I)

where the scalar output measurement, y is assumed to be related through an unknown parameter vector 0 to a known n-dimensional vector 7~, which is composed of the recent inputs, outputs and measurable disturbances contributing to the model output, and e is a scalar Gaussian white noise with zero mean. 2.2

P a r a m e t e r Estimation

~.~.1

Minimization of Cost Function. In system identification, we choose an estimate

which minimizes the cost function Je

where P is the covariance matrix, V denotes the extended information matrix and A coincides with the remainder after the least squares estimation.

2.2.2 Notational Conventions. Since we work only with estimates in the following text, we shall not refer to them explicitly by the 'hat' symbol (()). Instead, we introduce the following notation where suitable: the 'tilde' symbol above the variable denotes the value before the data update (~5), stacked 'bar' and 'tilde' symbols refer to the value after the data update, but before the time update (~), the 'breve' symbol denotes the value after exponential forgetting (/3) and the 'bar' symbol refers to the value after the time update 2.2.3 Data Update. For the data update, we use the well-known formulae of the recursive least squares (RLS) identification ~" = ~'.P~, (3) Ir

=

(1-t-~'),~/3~o,

(4)

e

=

y-

(5)

Block-regularized R ~ Identification O

=

(6) (7)

O+~e,

P = P - (1 + r which is equivalent to v=~7+

[ ][ ] y

51

y

.

(8)

We do not have to compute A, since the estimates are independent of it. To compute the parameter estimates from the extended information matrix V, we divide it into the following submatrices

l'v

,,.-. I

52

J. Schier

~.~.5 Block Regularization. Regularized exponential forgetting in the standard version is not suitable for a systolic implementation. To preserve pipelining of the systolic estimator, the block regularization was proposed in [4, 2]. The idea is to keep the alternative parameters V~ and @* constant over N >_ n periods of identification, where n is the dimension of O

9 = v$(k) = v;(k +

~) =...= v$(k + i v - ~),

(19)

{9* = e*(k) = @*(k + 1) = . . . = e*(Ir + IV - 1).

and include the addition of the alternative parameters defined in the time update (15, 16), in accumulated form only after every N periods of identification 1. Standard exponential update over N periods := V(l[0), A(1,N):= 0 fo_s i := 1 to N

a)

:= ~(i)(V- + [~,(i) v(i)][~,(/) ~(i)1)

b)

A(1, N ) : = A(i)A(1, N)-I-- (1 - A(i))

(20)

c)

en,d 2. Accumulated regularization in the N-th period

~,~y := vse* v~(;v + ~llv):= ~ + ;~(~,N)V;

a) b)

(oCN + 1IN):= Vj~(N + llN)v~(/V + IIN))

d)

(21)

where X(1, N ) > 0 is an accumulated forgetting factor. 2.3

S q u a r e - R o o t I m p l e m e n t a t i o n of t h e RLS A l g o r i t h m

Usually, we use the square-root version of the estimator, because it guarantees symmetry and positive definiteness of the covariance/information matrix, and also it can be implemented on a systolic array. 2.3.1 Square.root Decomposition of the R L S Algorithm. Let us introduce the triangular square-root decomposition of the cowriance matrix by the formula

(22)

P = R R I,

where R is an upper triangular matrix. Using this decomposition, the formulae of the RLS identification (3) - (7) transform to a matrix-vector multiplication 0

lI-:l

(23)

and to an inverse update

I] ' 1 [1 0

R

o

~'

= GflQ

Q _ ,~....,~2,~1,

a ~' ~'

e

,

G=

[1 ] I

-els~

a = diag{1,(l/~)Xr

(24)

1

1),

(25)

Block-regularized RLS Identification

53

where Q is an orthogonal matrix given as a product of elementary rotations ~ 1 . . . ~n with rotation ~i zeroing the i-th element of vector a with respect to the first element of the same column in the composed matrix; f~ is a weighting matrix and G is a non-orthogonal transformation used to update the parameters.

~.g.~ Regularization as Input of Alternative Data. We can consider the regularization process to be an input of some alternative data. To show this, let us discuss the time update of the information matrix V. Using the partitioning (9), we shall introduce a square root decomposition of matrix V* (** represents a scalar don't care term) and express V* as a sum of data dyads

[" "1 V~ )' v~,

=

"

*

i----1

1

(26)

where

EUi*l u~

--" i-th row of the matrix

["" 0

.*

(27)

'

An analogous square-root decomposition may be used for V. We can write the addition of the regularizing matrix V* to the information matrix 1~ in a recursive form

a)

u* := U*O* fo....rri = 1 to n .-

§ =

[

1'[

~ + ,/~(i,s)[ v,. ~r ]',/~(i,s)[ v,.

ena v(k + ~lk):= ~:

b)

(28)

c)

The first formula (28 a) results from (26) and (10). The relation for the recursive regularization (28 b) has the same form as the formula of exponential update (20 b). The only difference is that forgetting is not applied to the information matrix, but to the input data. We can conclude that the process of evolution of the covariance matrix P and of the information matrix V~ must be equivalent given the same input data. Hence, we can use the same regularizing data no matter which matrix we work with.

3

SYSTOLIC IMPLEMENTATION

In this section, we shah describe the systolic implementation of the regularization, which is the main contribution of this paper. To be able to do that, let us remind the reader of the systolic algorithm for the inverse-updated RLS identification [8, 9, 12].

J. Schier

54

3.1

Systolic A l g o r i t h m for RLS Identification with Inverse U p d a t e s

The square-root RLS algorithm (23, 24) is implemented on a lower triangular systolic array. The transposed factor of the covariance matrix/V and the vector of parameter estimates 0 reside in the cells of the array, as shown in Fig. 1. The input vector/3 with initial value/3 = [0 ... 0] accumulates the expression V ~ (4). The input vector a, initiated by a = [1 0 ... 0] is necessary for proper pipelining [12] and its first element accumulates v ~ while being passed through the array.

Figure 1: Mapping of R' and |

to the systolic array, input and output of the array

3.1.1 Function of the Cells. The function of the cells in the RLS array is described in Fig. 2, but the forgetting factor is not included for simplicity. The notation used in the figure refers to the data update, nonetheless, the same formulae are also used for regularization. 3.1.~ Propagation of Forgetting. If we assume A to be time variable, we have to synchronize its changes in the array with the propagation of the rotations (25). For this reason, it is entered in the upper left cell and propagated through the array as shown in Fig. 3. Because A(k) is used to compute the accumulated forgetting coefficient A(1, N) (20 c), it cannot be entered in the square-rooted and inverted form. 3.2

I m p l e m e n t a t i o n of the Block-Regularization

Implementation of the block-regularized forgetting in the systolic array for RLS identification with inverse updates involves implementation of the foUowing mechanisms: , Multiplication of n

rows

of matrix U* (26, 27) with O* (28 a)

9 Switching between the identified and the regularizing data 9 Computation of the accumulated forgetting factor A(1, N) (20 c) 9 Switching of the exponential forgetting (when processing the regularizing data, the exponential forgetting is not used - - cf. (20 b) and (21 b, c).

Block-regularized RIs Identification

55

~I ~I 0~I ,.,!

ai -- Ril ~ l

@i := arctan ~1

C~

"t -z! a l l := [ cos ~bi sin~~.~,i ] [ ~I 0 - sin &i cos ~il

ai

al"l ai - ~il~Pl

Left column cell

[

~

aj

COS~i :=

.R~j ai

- sin @~ cos ~'

k~#

Internal cell Upper part of array

e - 01~i W := -

-

Oll 01

,~==E

Ol

0

:=

~ C~1

~

1

(~I

g -- E)I~j

Left column cell

Oj E

Oj

e

a~ 1

Oj e-Oj~pj

4.... e

Internal cell Bottom row of array Figure 2: Function of cells in the R L S array

a~ - R~#~#

]

J. Schier

56

IHE3 i !

I

r E3 ...... E] IJE] ...... UE] Figure 3: Movement of ~ through the RLS array 9 Storing of e* during multiplication with U*, writing of new O*.

3.~.1 Selection of O* for the Block Regularization. For implementation reasons (simplification of control mechanisms), we choose O, computed n steps before the end of the data block, for e*. 3.2.2 Loading of U*. The parameter estimates e , that we Use as the regularizing parameters O* for multiplication with the matrix U*, are computed in the bottom row of the array. Hence, a straightforward choice is to load U* into the array from the bottom, skewed in time, and to perform the multiplication also in the bottom row, as shown in Fig. 4.

Figure 4: Loading of matrix U* into the array and its multiplication with parameters Since the parameter estimates change with every new data input, while all rows of U* must be multiplied with the same e*, it is necessary to store e* = e(k), k being the time before the start of the multiplication, in registers. Prom the bottom row, the elements of U* are shifted up through the array to the diagonal, where they are entered as the alternative data.

Block-regularized RLS Identification

57

The movement of U* has to be synchronized with the input of the data samples, so that U~ arrives at the diagonal just after the last sample of the data block has been processed. To ensure this, U~I has to be entered the bottom row of the array n steps before the end of the data block.

8.2.3 Control Signal. To switch on and off writing the O estimates to the storage of O*, to switch between the real data samples and the regularizing data and to switch on and off the forgetting, a control signal is used. This signal, aligned with U*, is first propagated upwards through the array. After entering the array, it controls writing of the estimates O to the O* storage in the bottom row. In the diagonal cells, it controls switching of the data entry (Fig. 5). After it has reached the diagonal, it is sent back from the upper left cell, in the same way as we propagate )~ (Fig. 3), to switch on and off the forgetting.

Figure 5' Propagation of the control signal

3.2.~ Buffering of Input Data. Since the processing of the input data is interrupted by regularization for n periods every N periods, it is necessary to buffer the input data and to sample the identified system at a slower rate than the systolic estimator runs. This slow-down is equal to 1/2 in the worst case (for N = n). 8.2.5 Multiplication of U* with ,~(1, N). As mentioned previously, ,~ is entered through the upper left cell of the array (Fig. 3). There we shall also compute the accumulated forgetting coefficient ,~(1, N). To implement the product ~/~(1,N)U* (28 b), we use the methods of algorithmic engineering [10]: instead of computing the product before loading U* to the array, we do that in the left column cells of the upper part of the array, before using vector a (23) to compute the rotations ~ (25). We have yet to implement the product ~/,~(1,N)u*. This product is entered during

58

J. $chier

regularization instead of y, and through the computation of the prediction error e (23), it is used to compute the transformation w (Fig. 2): e

X/~(1,N)u~' - ~'~/A(I,N)U* _ VA(1,N) = ..... V~" .... Vfs ( u 7 - ~'U~*).

= -~

(29)

The fraction x/~)~N )~r is computed in the cen storing R~,~ (the bottom left cell of the upper part of the array). y

4

-

SIMULATION EXAMPLE

The influence of the block-accumulated regularization on the robustness of the estimator is shown on the graphs in Fig. 6. The following simple system was identified: y(k) = alYCk - 1) - a2yCk - 2) + bou(k) + blu(k - 1) + ce(k),

(30)

where al = -0.05, a2 = 0.2, b0 = 0.2, bl = 0.3, c = 0.02 and e(k) and u(k) is a white A / ' ( 0 , 1 ) noise. Matrix U* was set to U* - 0.07I, weighting factor ~ = 0.8.

The oscillations of Rll for zero input are due to the accumulated data updates. 5

CONCLUSIONS

In this paper, we have implemented the block-regularized exponential forgetting in the square-root RLS algorithm with inverse updates [12]. The principle of regularization consists of weighting the RLS cost function (2) with an alternative function, specified by the user. This weighting prevents the parameters of the cost function from numerical instability in the case of non-informative data, because they converge to the regularizing values in this case. The block-accumulated regularization accumulates the regularization step over several steps of identification. This is necessary for pipelined systolic implementation. Unlike the standard regularization [5, 7], which is not suitable for systolic implementation, the block-regularization reduces the throughput of the systolic algorithm by only 1/2 in the worst case, compared with the exponentially weighted RLS. Other advantages are that the implementation preserves the compactness of the original array and that it directly provides the transversal filter weights. Acknowledgements This research was supported by Research Grant Nr. 102/93/0897 of the Grant Agency of the Czech Republic. References [1] L. D. J. Eggermont et al., editors. VLSI Signal Processing VI, New York, 1993. IEEE Signal Processing Society, IEEE Press. Proceedings of the IEEE Signal Processing

Block-regularized RLS Identification

Figure 6: Comparison of exponential and regularized forgetting

59

60

J. Schier Society Workshop, held October 20-22, 1993, in Veldhoven, The Netherlands.

[2] J. Kadlec. The ceU-level description of systolic block regularised QR filter. In Eggermont et al. [1], pages 298-306. Proceedings of the IEEE Signal Processing Society Workshop, held October 20-22, 1993, in Veldhoven, The Netherlands. [3] J. Kadlec, F. M. F. Gaston, and G. W. Irwin. Parallel implementation of restricted parameter tracking. In J. G. McWhirter, editor, Third IMA International Conference on Mathematics in Signal Processing, Mathematics in Signal Processing, University of Warwick, December 15-17 1992. Oxford Publishers Press. [4] J. Kadlec, F. M. F. Gaston, and G. W. Irwin. Systolic implementation of the regularised parameter estimator. In K. Yao et al., editors, VLSI Signal Processing V, pages 520529, New York, 1992. IEEE Signal Processing Society, IEEE Press. Proceedings of the IEEE Signal Processing Society Workshop, held October 28-30, 1992, in Napa, CA. [5] R. Kulhav~. Restricted exponential forgetting in real-time identification. A utomatiea, 23:589-600, 1987. [6] L. Ljung and S. Gunnarsson. Adaption and tracking in system identification - - a survey. A utomatica, 26:7-21, 1990. [7] L. Ljung and T. S6derstrfm. Theory and Practice of Recursive Identification. MIT Press, Cambridge, MA, 1983. [8] J. G. McWhirter. Systolic array for reeursive least squares by inverse iterations. In Eggermont et al. [1], pages 435-443. Proceedings of the IEEE Signal Processing Society Workshop, held October 20-22, 1993, in Veldhoven, The Netherlands. [9] J. G. McWhirter. A systolic array for recursive least squares estimation by inverse updates. In International Conference on Control '9~, University of Warwick, London, March 21-24 1994. IEE. [10] J. G. McWhirter. Algorithmic engineering in adaptive signal processing, lEE Proc., Pt. F, 139(3), June 1992. [11] J. G. McWhirter and I. K. Proudler. The QR Family, chapter 7, pages 260-321. Prentice Hall International Series in Acouetics, Speech and Signal Processing. Prentice Hall International Ltd., 1993. [12] M. Moonen and J. G. McWhirter. A systolic array for recursive least squares by inverse updating. Electronics Letters, 29( 13):1217-1218, 1993. [13] J. Schier. Parallel algorithms for robust adaptive identification and square-root LQG control. PhD thesis, Inst. of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague, 1994. [14] J. Schier. A systolic algorithm for the block-regularized rls identification. Res. report 1807, Inst. of Information Theory and Automation, Prague, 1994. Also accepted for publication in Kybernetika (Prague).

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) © 1995 Elsevier Science B.V. All rights reserved.

NUMERICAL ANALYSIS OF A NORMALIZED RLS FILTER USING A PROBABILITY DESCRIPTION OF PROPAGATED

61

DATA

J. KADLEC

Control Engineering Research Centre Department of Electrical and Electronic Engineering The Queen's University of Belfast Ashby Building, Stranmillis Road, Belfast BTg 5Att Northern Ireland eegOO~O@vZ,qub.ac. uk ABSTRACT. The normalized version of the Qtt algorithm for recursive least squares estimation and filtering is presented. An understanding of the numerical properties of a normalized ttLS algorithm is attempted using a global probability analysis. KEYWOttDS. Systolic array, normalization, fixed point, probability.

1

INTRODUCTION

A normalized version of the QR algorithm [3], [7] for recursive least squares estimation and filtering is presented. All data and parameters of the main triangular section of the array have the guaranteed range of values [-1,1]. The array has the potential for a minimallatency-implementatlon, because the normalized section can use DSP, or VLSI, fixed point hardware (look-up-tables) [2]. An understanding of the numerical properties of the normalized ttLS algorithm is attempted using a global probability numerical analysis. This approach derives the analytic formulas for the probability density functions (distributions) describing the data (normalized innovations) propagated in the normalized filter. This approach is used to formulate probability statements about the number of bits used in the fixed point representation of propagated data. The derived analytic formulas for the probability distributions are verified by the comparison with the data histograms measured on a fixed point normalized array. The array was simulated by C-coded functions under Matlab.

J. Kadlec

62 2

NESTED RLS IDENTIFICATION PROBLEMS

We consider the recursive least squares (RLS) identification of a single-output system described by the regression model

=

(1)

+ e(")

where n is discrete time, the p-vector, ~(n), is the data regressor, y(n) is the output signal and e(n) represents (in the ILLS con+,ext) the equation error. The unknown p-vector, 0, of the regression parameters is estimated by the p-vector 8(n). To prepare the ground for the numerical analysis of the algorithms, we will operate with p(p "1- 1)/2 different regression models (indexed by i = 1 , 2 , . . . m - 1;m = 2,...p+ 1). Let us denote 7h:p+l(n), the vector of data measurements as ~Ol:p+l (R) =

~ 1...i

i+l...m-1

. m

m+l...p

p+l

The set of RLS models is given by

(3)

~m(") = ~Ti[m(R)~Ol:iCrt) Jr ei[m(n). The standard, maximal order RLS model (1) is part of the set (3) for m = p + 1; i = p. The estimates

01ql,n(n) minimize

the sum of weighted squares

j~--'TI,

J(Ol:i[m(")) -- E

j=l

j~2(n-j) ((Pro(j) - ~Ti[m(")~l:i(J)) 2

(4)

where 0
as

J(Rl.il,n(n))-

[,

[

Vx:ilm(n) -;Ox'il'n(n)l

1

]

(5)

where

V1:i[m (") -=~E ~2(n-j)

L

kO)

(6)

,,

and this can be computed recursively by

Vl:i[m(. ) -- fl2Vl:i[m(.- i ) +

~om(n) ' L ~ m ( " ) '

'

(7)

In (7) the two indices i, m indicate the type of data forming the (i+ 1, i-F 1) matrix Vx:i[m(n). The elements of the maximal dimension (p + l,p + I) matrix V(n) = Vl:ivlp+l(n ) will be denoted vi,j(t); i , j - I, 2, ...p+ 1. 2.1

Nesting in Q R Algorithms

We shall deal with algorithms implementing the recursion (7) by a numerically robust factorization of the matrix V(n). The QR algorithm recursively updates the matrix R(n)

Numerical Analysis of a Normalised RLS Filter

63

defined by

V(.)-'-Rr(-)R(-).

(s)

In (8), R(n) is a (p+ 1,p+ 1) upper triangular matrix with elements ri,j(n). The a posteriori version of the standard a priori QR update of elements ri,j(n) is well known [3]. The array implementation is summarized in Fig. 1 including the cell descriptions.

~'0(")

~1(") -

r

-

~03(-)

~/(") ~'0(")

-

: 7~:-],

",2

II, ...... ,

II

'

I

, .......

~01(B) ~2(")

~3(B)

~(B)

I

,, !,,_

:r314(n) zi-lli(n)

7,-~(.) ~

c~(.); 8~(n) I

~(2);~'(n) ~,(~)

t i

i'iii(B);:(~2rili(B--

1)+ 2T,i_l'i(R)) 1/2

si(n) := Zi-lli/ri,i(n) ~,(-) := ~i.i(--1)/~.~(.) ~(~) := ~,(~)7~-~(-)

~(n); s~(n) 9 ,,mc,,) .:

Figure 1. Standard array for RLS identification. Left: Dashed blocks mark the variables related to 01:3{4(B)filter. Right: Dashed blocks mark 2 RLS subproblems (related to 01.113(n) and 01:213(n) ) nested in maximal dimension array. Our numerical analysis will be based on the probability description of the data propagated in the algorithm. To do this we need to understand the relationship of the maximal order QR algorithm (Fig. 1. left) and the nested QR algorithms (Fig. 1. right) related to the estimation of Ol:ilm(n ) by recursively updating the square root factor of Vl:ilm(n ). This is explained now. Let us consider the QR decompositions of the matrices Vl:ilm(n); m = 2, 3, ...p + 1; i = 1, 2, ...m - 1 defined by

Vl:i]~ ( n )~ RTl:ilm ( n )l~l:ilm ( n ). In (9), R1..il,~(n) is a (i + 1, i + 1) upper triangular matrix.

(9)

J. Kadlec

64

The following relations (10)-(14) hold as the consequence of the nesting of V~.ilm(n) in

V(n) and of the selected RTR type of factorization: ~1,1(-) ... ~,1(.) ~ , . ( - ) R.1:ilm(n) =

....

ri,i(n)

ri,m(n)

(10)

d~l.,(n) :

=

"..

~i,~(-)

: ,,,~(-)

~:~lm(n)

di~irn(n) = vm,m(n) - [rl,m(n)...ri,mCn)][ra,,n(n)...ri,mCn)] T = dLl[m(n ) - r~,m(n ) din-aim(n) = rm,,n(n)

d0~l~(.) = ,,,,,,,.(.).

(11)

(12) (13)

(~4)

Therefore the recursions of all the matrices Rl.i[m(n) is already present in the maximal order recursion of elements rl,j(n) forming the matrix R(n). See the example in Fig. 1. Data xi[m(n) propagated in the array (Fig. 1) are the 7i(n) weighted prediction errors

Xilm(n ) -~ ~i[m(n)TiCn)

(15)

where the prediction errors eilm(n) are defined for m = 1, 2, ...p+ 1; i = 1, ..., m - 1 by (16) and the angle variables 7i(n) are defined for i = 1,2, ...,p by

~,(.)~ (~ + ~T,c.)~-~V~(.- 1)~1.,(-)) -1/~ ; ~ o ( " ) ~ .

(17)

The matrix Vl:i(n- 1) denotes the (i, i) square matrix nested in the uppr left part of the matrix V(n - 1). The scalars dilm(n) are not directly present in the maximal order QR algorithm. Their time recursion is given by =

(18)

1)+

with Zolm(n) = 7~,n(n). The variables dilm(n) will be used for the definition of the normalized version of the QK algorithm in the next section.

3

THE NORMALIZED

QR ALGORITHM

The coefficients ri,m(n) can be normalized into the range [-1,1] by suitable scaling factors. This approach was first used by [5] for the normalized RLS Lattice. Normalized coefficients are defined as

~,,.~(.) =

~,,~(.)/d~_,l~(.)

; ~,~ = 1.

(19)

Numerical Analysis of a Normalised RLS Filter

65

The scaling factors ddlrn(n) are used in the definition of the normalized variables: Xdlm = ~.il,r,(n)Ti(n)/dil,n(n) The transformation of the Mgorithm from Fig. lowed using the relation (19.). It expresses the tion by the normalized parameters, present in dilm(n)/di-llm(n) = ( 1 - ~,m(n))1/l. The final Fig. 2.

~(.)

~(n) i'm

' m

~(n)

(9.0) 1. to the normalized algorithm is alfractions resulting from the transformathe processors of the normalized array: normalized systolic array is specified by

y(n)

cpl(n) (p2(n) (p3(n)

I /~z~(.) I 7 : ~ - - 1 - ' ~ ~'~' 2 .

m

' "~--' ,l(~,,14), II

,,

...,r I ,

I.~I.

",,'

'

~ , *IIr~-~II ~,.

1....~,1~411

i~,~-' ~

I

i L"

I

,- . . . .

,,, 'I, I I

,,

I I I I I

II

I

II

v

.I

I,.,.

I

=..i

m.l

I...

I/(n)

,(~~,~, ~,,

I t"

~m(~)

9oi.(n) = ,;.(,~)

~oi,~(-) := (n~a~o,~(n-~)+~o~,m(-))'/~ ~ol,~(n) := zolm(n)/dolm(n)

eol,~(n) e~-~li(n) ~_~(~) ~ ~ i ( n ) , ~(n) 9 ~(~) ~i_11.~(n) ci(n); si(n) i.. ! i ci(n); si(n)

8i(n) :---'-~'i_lli(n) ~(n) := ( i - "~("))~/~ "Yi(") := Ci(")"[i-l(n) ~i,,~( n ) : = c~(n)fi,,n(n - 1) (1 - ~Lllr~(n))112 + si(n)~i-ll,~(n)

1

~il,n(n) Figure 2. Normalized array for RLS identification. All variables of the triangular section of the normalized array are in the range of values [-1, 1]. Dashed boxes present the examples of the RLS subproblems nested in the maximal order array. The array propagates data xilm(n) = ~ilm(n)7i(n)/dilm(n). With the exception of the

J. Kadlec

66

input data normalization in the row of processors updating d01m(n), all the parameters and data propagated in the normalized algorithm have the guaranteed range of values [-1,1]. 4

PROBABILITY IDENTIFICATION

Our objective is to use the concepts of probability identification to analyze the quality of the data representation in the normalized algorithm (Fig. 2.) operating in fixed-point arithmetic with finite precision. The probability identification [8] works by the recursively updated probability density functions of unknown parameters. To link the general theory with the normalized algorithm and the RLS systolic array implementations we have to make the following assumptions: 4.1

Assumptions

~1.1.1 Assumption A Assume that the stochastic part e~l,~(n) of models (3) is a zero mean normal white noise, with unknown variance a~,n 7~m(n) = OTl:i[m(n)~ol:i(n)+ eilm(n) ; ei[rn(n) "~ N(O, a~m).

(21)

Therefore, the identified systems can be described by Pilm (~m(n)ln - 1, Ol:ilm(n), O'~m(n)) = p-1 (2-)-1/2crilm(n)ex

I.~- 1 -2

(7~m(n)

:i(n)) }2

(22)

where Pilm([) denotes the conditioned probability density function of the random variable

~m at time (n) conditioned by all the observed data up to the time n - 1 and by the parameters 01.ilm(n),

a~,,~(n).

4.1.2 Assumption B The (prior) probability distribution of unknown parameters at time n - 1 is assumed to have the Gauss-Wishart distribution ~. -(,,(,,-~)+i+2)~. Pilm (O,:ilm(n- 1),a~,(n 1)In 1) = k,(n- 1,..ilm ,,o- 1). 9

_

_

1 - , m [-Oz:il 'n(1 n - l ) ] .exp { -~ail

T Vl.ilm(n- 1) [ - 01.ilm(n- 1) ]} 1

(23)

~.1.8 Assumption C The parameter tracking is described by the generalized exponential weighting [4]:

p,l (0

-

1)

=

kt(n) {Pilm(Ol.ilm(n- 1),a~lm(n- 1)In- 1)} #' {qi,m(Ol:i}m(n),a~m(n)]n- 1)} 1-'' (24) where

qilm(Ol:ilm(n),a~m(n)ln- 1)= ~qail-(v,~+i+2) m (n). .

.exp {-2Cr~m2(n)[--01:i~ re(n) ]Tyq [.'01:i'lm(n)]}

(25)

Numerical Analysis of a Normalised RIS Filter

67

is specified by a scalar v~ > 0 and (i + 1, i + 1) matrix V~ > 0 selected as uq ~ O; ~ --, 0 in order to characterize a flat, noninformative alternative distribution. The weights k/(n) and k~ are constants weighting the functions to have the integral equal to 1.

4.2

Bayes Formulas

The posterior density of the parameters the Bayes formulas p~l~(o~:~l~(~), ~ ( ~ ) I ~ )

pq,,(ex.q.,(n),~q,,(n){n)

can be computed from

=

pqrn(~(n)[n- 1,Ol:ilm(n!, o'~m(n)) p~',.,(~(n~ln,..., p~].,(e,:~l.,Cn),,~,.(n)ln-- I)

(26)

Pqm (~/(n)[n-1)=/Pilrn (~/(n)[n-I,ex=qm(n),cram(n)). .~l,~(orq,~(n), ~,~(n)ln - :)dO~.~lm(n)d~,~(n )

(2:)

. . . . . .

-

: ) ....

where

4.3

Link to RLS

Under the above assumptions the posterior density Pilm(Ol:ilm(n), aq,,(n)ln) has the reproducing Gauss-Wishart form: . . . .

.exp

2 '[

(v(,,)+++2)..

i

"'

with the recursion of the parameters Vl.qm(n) and u(n) given by:

v~'~l'~(")-

~2v~'~l'~("-:)+

~,,~(n)

[~,.~(.)

'

(29)

v(n) - ~ v ( n - 1 ) + l (30) This can be derived by substitution of (22) and (23) into (26), (27). See [8] for details. The time recursion of Vl.q,~(n) (29) is identical with the RLS recursion (7). Therefore, both filters from Fig. 1. and Fig. 2. implement the recursion of the factorized matrices, characterizing the probability densities related to the probability identification for all above defined (nested) models. This is the base for the analysis of the data propagated in the normalized array. By integration we can derive the mean values of the estimated parameters and dispersions

E

:} =

E (~rn(n)In _ i}

=

:) ~l.~(n)=/~ 'd~l~ ' (n -

1)l(/~2v(n-

I) _ 2).

(31)

The distribution of the prediction error ~ilm(n) can be computed from (22) and (23) by integration (27). It is equal to

p(~i[rn(n)Ini) - ke(n)Ti(n)~-Id~l(nI).

(32)

.{1 "I"e1:ilm(n)72(n)l(/~'d21m(n1))}-(~'~cn-1)+')12 -2

(33)

J. Kadlec

68 where the normalizing constant

ke(n)

is given by

k.(n) = r-'/'l" ((ft'v(n - 1)+ 1)/2)/1" (,6'v(n - 1)/2)

(34)

and I' is the Gamma function. The distribution (33) is known as the t-distribution with Notice, how the uncertainty of the regression coefficients is projected by 7~'2(n) > 1 (for i > 0) in the increased variance of the prediction error.

E{~.ilra(n)ln-1} - O and E{~.~lm(n)ln-1} - 7["(n)5~lra(n ).

Ii

P R O B A B I L I T Y ANALYSIS OF T H E N O R M A L I Z E D A L G O R I T H M

5.1

Analysis of Normalized D a t a

The conditioned probability density function p(s 1) will be derived for the norrealized data ~ilm(n) in this section. The variables ~ilm(n) are defined as _

Xilrn(n) = h (eilm(n)) - eil,(n)Ti(n) {fl dilm(n- 1)+ eil.

.

.

(35)

The conditional density p(~il,r,(n)ln- 1) specified in (33) is now transformed to the conditional density p(~.il,n(n)ln- 1) using the transformation (35). In(35) , scalars fl 2dilm(n2 1) and 72(n) are deterministic functions of the data in the condition and can therefore be considered as constants in this transformation. From the standard lemma about the transformation of random variables it follows that

- d~.il,n(n)-

.

(36)

The inverse transformation for - 1 < Zilm(n) < 1 is equal to

and the Jacobian for - 1 <

d~l,,,(n)

s

< 1 is equal to

= "rE~(n)Zd~l,,(n-

1)

-

.

(3S)

Finally, by substitution into (36), the conditioned density can be found:

p(~,il,n(n)lnwhere

ke(n)is

1)-

ke(n)(1-

9~lm(n))(B2v(n-1)-u)/2

(39)

defined by (34).

To get a clear understanding of the derived result it will be convenient to plot the conditioned densities parametrized only by the exponential weighting factor/~2. This can be done, because the parameter v ( n - 1) converges for a constant/32 and n > > 1 to the values v ( n - 1) = 1 / ( 1 - B2) as follows from the time recursion (30). Therefore the final

Numerical Analysis of a Normalised RtS Filter

69

conditioned density parametrized by ~2 is given by

p~(~,i~(n)ln

-

1)

=

~-,/2r . ((n21(1 - n 2) + 1)12) (1 r (. ~ 2 /. ( 1 _. .~2))/2)

-2 (n))(~'lO-~')-2)l 2 . x~i~

(40)

The conditioned density pz(s 1) is plotted in the left part of Fig. 3. Notice, that for ~2 = 0.666 the density is uniform. Notice also, that for g2 < 0.666; the data close to +1 and - 1 are more and more probable.

5.2

Analysis of Unused Leading Zero Bits

The density functions of the unused bits of the fixed point representation is now derived. Let us analyze the absolute value of [X~lm(n)] only. The density function is given for o < I~l,,(n)l < 1 by P ~ C l ~ t , , ( n ) l l n - 1) = 2p~(~il,,,(n)ln- 1). Let us denote the number of unused leading zero bits in the fixed point representation by a continuous variable b. The continuous variable is used to keep the formulas simple. The number of unused bits is for 0 < I~,lm(n)l < 1 given by b= -log 2 (]~.,im(n)l). The inverse transformation for b > 0 is given by I~ilm(n)l = 2 -b. The Jacobian of the transformation is given by dl~lm(n)l - ln(2)2 -b. Therefore the d conditional density of the number of unused leading zero %its is given by %

p~(bln- 1)= 2 lr-i/iln(2) I~ ((I-'~ + 1)/2) 2- b (1-

2-ib) <~-'~-i,/i

(41)

The density parametrized by the ~2 is shown in the right side of Fig. 3.

5.3

Comments

The area below the parametrized density (Fig. 3 right) gives the probability of the numbers of unused leading zero bits in the fixed point data representation. The derived conditioned density p~(s identical for all variables s 2, 3, ...,p + 1, i = 1, 2, . . . m - 1 propagated in the normalized identification array.

m=

The result describes for i = m - 1 the conditional density of the sine variables si(n), too (because si(n) = ~ i _ l l i ( n ) ) . Densities are parametrized by the exponential weighting factor f12 only. Therefore the value of ~2 is critical for the control of the distribution of data within the range [-1, 1] in the normalized algorithm. 6

COMPUTER

SIMULATION

A computer simulation was designed to compare the results of the probability analysis of the normalized algorithms with measured histograms of data, propagated in the fixed point

70

J. Kadlec

implementation of the normalized algorithm. A 3-rd order time variable FIR regression model ~1:314(n); n • 1 , . . . , 1500 is simulated with the first 2 parameters time variable. The data regressor 7~1:4(n) is formed by T ~a.4(n) = [u(n- 1), u ( n - 2), u ( n - 3), y(n)] with white noise input u(n) ~ N(O, cry); ~ 0.5. The normal white noise (added to the simulated system output) is characterized by N(0, ~r2); a 2 = 0.005. This data is quantized by 12 bit A/D converters. The system is identified by the normalized systolic array Fig. 2. for values of exponential weighting equal to ~ 2 _ 0.99; ~2 = 0.9 and ~a = 0.5.

F i g u r e 3. Densities of the normalized data s Left - probability densities of values. Right - densities of the numbers of lost leading zero bits of the fixed point representation. Lines are parametrized by the exponential weighting factor ~2. Fig. 4 (left) shows the first 2 elements of the identified vector of regression coefficients 01:314(n) and the prediction errors ~314(n) reconstructed from the normalized data used in the normalized array. All the normalized variables, normalized parameters and all the intermediate results are computed and represented by 14-bit fixed point arithmetic, in the range [-1,1]. The elementary operations have built in tests to substitute the limit values -1 or 1 if the result is out of the theoretically guaranteed range [-1, 1].

Numerical Analysis of a Normalised RI~ Filter

71

Fig. 4. (right) presents the comparison of the derived analytical distributions (first line) with the histograms (lines 2 - 11) of normalized data s measured on a 14-bit fixed-point arithmetic implementation of the normalized array.

#2 =

0.5:01:s14(n);01.s14(n);esl4(n) Analyt. density (1) & real histograms (2-11) of X'~lm(n).

Figure 4. Experimental results from the fixed point identification for ~2 = 0.99;/~2 = 0.9 and/~2 = 0.5. The histograms of s are compared with the analyticaly derived

J. Kadlec

72

densities (the first lines in the right side graphs). The lines 2,..., 11 are the histograms of :~3[4; X214;"~'1[4;~0[4, "~2[3;~I[3; "~OJ3, XI[2; "~012, and ~011 measured for n = I , . . . , 1500 on the 14-bit fixed point implementation of the normalized array.

7

CONCLUSIONS

We have shown how the probability identification can produce information about the quality of the fixed-point data-representation for the normalized systolic identification array. The analytic formulas for the probability of the number of leading zero bits lost in the fixed point representation of the data have been derived. The computer simulation shows how it compares with the data histograms measured from 14-bit fixed-point implementation of the array. The Bayesian dimension of the standard RLS identification framework'provides important additional information to DSP and VLSI designers, requiring the optimal word-length of fixed-point-arithmetic for normalized RLS identification filters.

Acknowledgment This work has been partiallysupported by the GA(~R, Grant No. 102/93/0897 (CZ) and by the SERC Grant "Algorithm Engineering for Concurrent Control" (UK). References [1] Kadlec, J. The Cell level Description of Systolic Block Regularized Q R Filter. 1993 IEEE Workshop on VLSI Signal Processing VI, Veldhoven, The Netherlands. pp. 298306., 1993. [21 Kadlec, J. Systolic Arrays for Identification of Systems with Variable Structure. Preprits of 199~ IEEE C M P Workskop, Prague., Sept. 94.,pp. 123-132., 1994. [3] Kalouptsidis,N., S. Theodoridis. Adaptive system identification and signal processing algorithms. Prentice Hall International (UK) Limited, 1993. [4] Kulhav~, R. and M. B. Zarrop. On a general concept of forgetting. Int. J. Control, 58, no. 4., pp. 905-924., 1993. [5] Lee, D., M. Morf and B. Friedlander. Recursive Least Squares Ladder Estimation Algorithms. IEEE Trans., ASSP-~9, pp. 627-641., 1981. [6] Ling, F., D. Manolakis, and J.G. Proakis. A recursive modified Gram-Schmidt algorithm for least squares estimation. IEEE Trans., ASSP-gJ, pp. 829-835., 1986. [7] McWhirter, J.G. Recursive Least Squares Minimisation Using a Systolic Array, Proc. SPIE, Vol. 431, Real Time Signal Processing 171,pp. 105-109., 1983. [8] Peterka, V. Bayesian approach to system identification In P. Eykhoff (Ed.) Trends and Progress in System Identification, Pergamon Press, Eindhoven, Netherlands, 1981.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

ADAPTIVE

APPROXIMATE

ROTATIONS

73

FOR

COMPUTING

THE

J. C~tze and G.J. Hekstra

74 1

INTRODUCTION

Computing the EVD of a n • n symmetric matrix A, i.e., A = QrAQ,

where Q is an orthogonal matrix and A is a diagonal matrix, is a frequently encountered problem in a great number of scientific applications, e.g. signal processing. In order to meet the real time requirements present in many of these applications, it is often necessary to compute the EVD on a parallel architecture. The method of choice for the fast parallel computation of the EVD is the Jacobi method, since it offers a significantly higher degree of parallelism than the respective QR-method [1, 3]. One sweep of a cyclic Jacobi method (consisting of n(n - 1)/2 rotation evaluations and their application to the matrix) can be implemented on an upper triangular array of processors with nearest neighbour interconnections in O(n) time [1, 14]. Usually, O(log 2 n) sweeps are required. The complexity of the parallel or sequential implementations is mainly determined by the complexity of the rotation evaluation (angle computation). Therefore, different strategies for modifying the rotations have been presented: - M I : approximate rotations [14, 2], M2: factorized rotations [5, 15, 12] , -

-M3: CORDIC [3,4], -

M4: Combination of the modifications M1-M3: M I + M 2 : factorized approximate rotations [10, 7] M1-}-M3: CORDIC-based approximate rotations (p-rotations) [11, 7] -

-

Which of these modified schemes yields the most efficient parallel implementation depends on the particular parallel architecture. With respect to an efficient VLSI-architecture a CORDIC-based method is favourable, whereby using CORDIC-based approximate rotations is advantageous compared to the use of the original (exact) CORDIC [11]. In this paper an algorithm for the computation of the EVD by a Jacobi-type method is presented, which can be mapped to an efficient, parallel VLSI architecture. The Jacobi-type method is implemented using different types of orthonormal p-rotations. These orthonorreal p-rotations realize approximate rotations (i.e. the off-diagonal element is only reduced instead of annihilated as by an exact rotation). They can be executed on the floating point CORDIC architecture presented in [13]. The cost and complexity (number of required shift-add operations) of the different p-rotations decreases as the rotation angle decreases. Since the required rotation angle decreases during the course of the Jacobi method, the type of p-rotation used during the course of the algorithm becomes less complex, i.e., the type of p-rotation is adapted to the stage of the diagonalization. Furthermore, it is also possible to adapt the accuracy of the rotation by increasing the number of p-rotations used in the approximation of the angle in order to regain the quadratic convergence of the exact Jacobi method [11, 9]. In section 2 we review the cyclic Jacobi algorithm for computing the EVD of a symmetric matrix and describe the idea of using approximate rotations. Section 3 gives the basic

Computing the Symmetric EVD

75

definition of fast, orthonormal/~-rotations and presents several methods for unscaled orthonormal p-rotations (method I, II, III), or for p-rotations (method IV) requiring extra scaling iterations to make them orthonormal. It is also shown how to construct a set .A of approximation angles using the above methods, and how to select the optimal approximation angle from this set, i.e. that one which effects in the greatest reduction of the off-diagonal norm. Section 4 demonstrates how the convergence speed of the algorithm can be steered by adaptively varying the number of p-rotations used in the approximation of the rotation. Section 5 conchdes the paper.

2

JACOBPS METHOD

With respect to a parallel implementation of Jacobi's method for computing the E V D a cyclic Jacobi method is used [I].

2.1

Cyclic Jacobi M e t h o d

The cyclic Jacobi method computes the E V D of a n • n symmetric matrix by applying a sequence of orthonormal rotations to the left and right of A, as shown in the following algorithm: h := 1 ; AO):= A for 8 := 1 to number of sweeps for p := 1 to n - 1 for q := p + 1 to n A(h+ 1) :_- Jpq(Oh)A(h)jTq(oh) h:=h+l end end end where Jpq(~O) is an orthonormal plane rotation over the angle ~o in the (p, q) plane, and defined by (cos 7~,- sin ~, sin 7~, cos 7~) in the (pp, pq, qp, qq) positions of the n • n identity matrix. The index pairs are choosen in the cycllc-by-row manner (p,q) =

(1,2)(1,3)...(1,n)(2,3)...(2,n)...(n- 1,n).

(1)

The execution of all N -- n ( n - 1)/2 index pairs (p, q) according to (1) is called a sweep. Since the matrices A (h) remain symmetric for all h, it is sufficient to work with the upper triangular part of the matrix throughout the algorithm. Defining the off-diagonal quantity S(h) by

J. Gatze and G.J. Hekstra

76

where [1" UF denotes the Frobenius norm, the execution of a similarity transformation A (h+l) yields:

Jpq(Oh)A(h)JTq(Oh)

..(h+l) Obviously the maximal reduction of S (h) is obtained if .pq = O. Therefore,

lim S (h) --. 0 ~ n m A (h) --* diag(Al,...,An). (4) h--.oo h--,oo Without loss of generality we will drop the index h and only consider the 2 x 2 symmetric EVD subproblem app apq

apq a~ qq

_

sin 0

-

cos 0

app apq

apq aqq

sin 0

cos 0

(5)

in the sequel. Rewriting apq gives us' , = 1 [(2am) cos 20 - (aqq - %p) sin 20] apq

(6)

Solving apq - 0 for the above gives us the optimal angle of rotation 0opt, for which maximal reduction is achieved.

(7)

Oopt = 1 arctan(l")

where 1" = = 2a~q , and where the range of 0opt is limited t o ]0optl _< ~.. aqq--app

2.2

Approximate Rotations

Using approximate rotations enables the reduction of the complexity of the vectorization and the rotation mode [14, 2, 10]. For a reduction of the off-diagonal quantity S it is not necessary to meet a m - O, but it is sufficient, when using an approximate angle Oapp, that ['apql- Id[" lapq[ with 0 _< Idl < 1,

(8)

where the reduction d is a function of the approximative angle 0app and the matrix data, given by' d(0app, 1-) - cos 20ap p

_1sin 20app . r This is the basis for the design of approximate rotations, i.e. rotations that meet (8). _

(9)

The approach used is to construct a set ~" - {F0, F - l , . . . } of fast orthonormal p rotations, where Fk implements the rotation over an angle ak in a minimal number of shift-add operations. The angles ak form the ordered set of approximation angles A {a0, c~-1,...}. Both angle a~ and orthonormal p-rotation are selected by the integer angle index k, satisfying k _< 0. In section 3, we will present methods to construct such fast orthonormal p-rotations and how to determine the optimal angle index k.

Computing the Symmetric EVD

77

Using approximate rotations must be paid by an increase in the number of required sweeps. It has been shown in [2, 10, 11], however, that the overall cost, in terms of the number of shift-add operations, obtained by using approximate rotations is significantly lower as compared to the exact Jacobi method. 3

METHODS

FOR

ORTHONORMAL

ROTATIONS

We define the close-to-orthonormal #-rotation F as given by the 2 • 2 rotation matrix

F =

a~

~

= ~' 3(0)

with rh being the scaling factor of the matrix, given by ~h =

v/~ 2

4- ,~2 = 1 + ~.

(11)

The ~, ~ are pairwise approximations of a sine / cosine pair, satisfying 0 _< ~, ~ _< 1, and chosen such that 1. the multiplication with ~ and ~ is cheap to compute, or that the combined evaluation of the rotation has a cheap implementation, basically only a small number of shift and add operations. 2. the error in scaling ~ is smaller than the required accuracy. When this is the case, the effect of the error in the scaling (orthonormality) is overshadowed by the rounding error in the computation. Hence the term "close-to-orthonormal" applies for this type of rotations. The angle of rotation 8 = a . a is determined by the direction of rotation a E {-1, +1}, signifying clockwise or counterclockwise rotation, and by the absolute angle of rotation c~, with ~ < ~r/2, which is fixed through the choice of the ~, .~ pair as: c~ = arctan(~)

(12)

We have at our disposal a number of methods to systematically arrive at ~, } pairs of varying accuracy and for varying angles of rotation (= varying angle index k). We will present two classes of methods for orthonormal p-rotations, and select four methods from these, taking into account the cost involved for performing such rotations. 3.1

Unscaled O r t h o n o r m a l p-Rotations

The simplest of the methods is the method I rotation, which is the same as the CORDIC g-rotation, with = =

1 2k

=

(I + 22k)89

where k, with k _< 0 and integer, is the angle index. The smaller the value of k, the smaller the angle, and the more the scaling factor rh approaches unity.

J. GOtze and G.J. Hekstra

78

For (floating-point) computation, with the mantissa size nm, the maximum round-off error is equal to 89 = 2-("'n+l). Hence we define the working limit GI for the angle index such that if k < GI then rh satisfies: (14)

1 - 2 -(""+I) < rh < 1 + 2 -(""+x)

and the scaling effect of the rotation can be neglected. Solving (14) for the working limit G I by substituting the actual value of rh gives us:

For example, for nm = 32, we can use this method I for k = - 1 6 , . . . , - 3 2

without scaling.

The more accurate method H rotations are given by = 1 -- 22k-I ~ = 2k -- (I + 24k-2) ~

(16)

while the even more accurate m e t h o d III rotations are given by

1 - 2 2k-I

= =

2 k -- 2 3k-3

=

(1 + 26k-8)89

(17)

These methods have working limits G II and G III respectively, derived in a similar manner to G i, given by GII

=

4

(18) GIII

=

{-nrn+6]6

As each of the previous three methods is limited to a certain working limit, new methods must be developed for more and more accurate rotations. An alternative is to use a fixed method of given accuracy and apply additional scaling iterations to achieve the required accuracy.

3.2

Orthonormal

p - R o t a t i o n s w i t h E x t r a Scaling I t e r a t i o n s

The rotation used is a double rotation of method I, rotating twice over an angle defined by the angle index of k - 1, resulting in: = (i)2-(2k-I)2 = I - 2 2~-2 = 2(2 ~-I) = 2k (19) rh = x/l + 22(k-I)2 - 1 -[- 22(k-l) Note that the scaling is no longer a square root function. following fast converging scaling sequence.

We exploit this fact in the

Computing the Symmetric EVD

79

~t

W e define the additional scaling K m = FI vi, where m >_ 0 is the number of scaling steps i-0 vi, with vo Vl

=

1

=

(I- 2 2(k-1))

v~ =

(20) for i > 2

(I+22~(k-I))

These scaling steps appear in the right-hand side of the well-known factorization

1 - 9 3m = ( 1 - ~)(1 + ~)(1 + ~2)(1 + ~4)...(1 + ~c--,~)

(21)

after substitution of z = 22(~-I). It is then trivial to show that the overall scaling, K m . rh, satisfies

Km. rh = 1 :k 22(m+')(k-1) .

(22)

Hence the method IV" scaled orthonormal #-rotation is given by ~, ~ of (19) and the 11.,

,

,1,,

1~

+

(nn'~

J. GOtze and G.J. Hekstra

80

The orthonormal p-rotations are chosen such that they satisfy the accuracy condition (14) using the cheapest possible method.

a~gie in'dex

method

k

cosi;

angle

i

C~k

rot.

scl.

arctan (2 k)

1

0

-oo < k < Gz

I

Gt < k <_ G H

II

arctan ( ~ )

2

0

III

arctan ~ i_22~_i-)

3

0

IV

arctan ( t _ 2 , , , , )

2

M

Gil

< k <

Gm

Gol

< k <

0

2k

2k

"

Table 1' The selection of p-rotations methods depending on the angle index k, also showing the angle and overall cost

3.4

D e t e r m i n i n g the Optimal A p p r o x i m a t e R o t a t i o n

The crucial point for approximate rotations is to find the approximative angle 0 a p p ----- O" O~k, where ~ E {-1, +1} is the direction of the rotation and the angle ~k E ~4, with the angle index k, is chosen such that Id(0app, r)l is minimal. The direction of rotation ~ follows from the sign of the optimal angle 0opt, and is given by: = ,ign(~) = ,ign(%~). , i g n ( ~

- %~)

(27)

In order to determine the angle index k, we introduce the working domain limits gk, and define the working condition for the angle ak as being gk-z < I~'1-< gk

(28)

where gk is determined such that, when r satisfies the above then Id(r, ~. ~k)l is minimal over the set of angles .A. The limit gk follows from the solution of d(gk, ~k) = -d(g~, ak+l)

(29)

which results in gk = tan(ak -I- c~k+l) = tan 7k

(30)

where 7k = c~k + ~k+l is the working limit in the angle domain. In [11], for this particular set of approximative angles, and through the use of floatingpoint arithmetic the search is narrowed down to checking which of three consecutive angles gives us the maximal reduction (minimal Idl). Here we show that the selection between two consecutive angles can be done using p-rotations as follows. Construct the vector v given by

Computing the Symmetric EVD

81

The angle between v and the x-axis is equal to the required arctan(r) = 20opt. To test whether [r[ is greater or smaller than the limit gk is equivalent to checking whether this angle is resp. greater or smaller in absolute value to the limit angle 7k. Hence we rotate v over the angle - a ' T k ---- - a ( ' a k + ak+l), to obtain v~ = [v~ v~]T. When the y-component v~ is of the same sign as v~ then the angle between v and the x-axis is larger than 7k, and we will select ak+l for the approximation angle. If, however, v~ is of opposite sign to v~, then the angle is smaller, and we sehct ak. The rotation over - ~ ' T k is performed as two consecutive rotations over ~k and ~k+l, implemented as fast orthonormal p-rotations v'

=

Fk+1(-a)

" Fk(-O')

"V

(32)

The same method can be adapted to testing which one of three consecutive angles to use. For this we compute vt and v" according to

"0"

=

~k-I

('-if)' Fk('-ff ) "

(33)

and choose the correct angle of rotation, using the selection tree, based on the resulting t signs %, v~t!

ak+l ak-1 ak

if sign (v~) = sign (vzl,) if sign(w) ~ sign(v~') otherwise

(34)

If the intermediate results of the rotations are re-used, this selection mechanism costs at most three orthonormal #-rotations. The resulting total reduction Id(r, 0app)l, using the above method for determining the angle index k is shown if figure 1 for an example set of angles .,4 with 32-bit accuracy.

Figure 1: The total reduction as a function of r, for the set of angles with 32-bit accuracy

82

4

J. Gdtze and G.J. Hekstra

A D A P T A T I O N OF T H E R O T A T I O N M E T H O D S

The nature of the Jacobi method guarantees that the less complex p-rotation methods are used as the matrix becomes more and more diagonal dominant during the course of the algorithm, i.e. ultimately only method I (the least complex method) is required. Therefore, a Jacobi method realized as a wavefront processor array automatically speeds up the computations if the #-rotation method is adapted according to table 1 as the number of sweeps increases. This behaviour is illustrated for a random matrix of dimension n = 6. The values of k computed during the algorithm are shown in figure 2. Each of the 11 required sweeps contains N = 15 rotations (i.e. N = 15 values of k). Obviously, the optimal angle index k decreases (i.e. absolute value increases) during the algorithm. Only the first 4 sweeps require method IV. The last 3 sweeps only require method I, i.e the least complex orthonormal p-rotation. In figure 3 (r = 1, solid line) the reduction of the off-diagonal norm is shown vs. the sweep number.

Figure 2: Values of optimal angle index k during the course (sweeps) of the Jacobi method (random matrix n = 6, i.e. each sweep consists of N = 15 rotations/angle indices) Obviously, using only one #-rotation per approximate rotation shows no ultimate quadratic convergence (see figure 3, solid line) as does the Jacobi method with exact rotations. As shown in [11] the quadratic convergence of the Jacobi method can be regained by adapting (increasing) the number of orthonormal #-rotations executed per plane rotation Jpq. Increasing the number of orthonormal #-rotations executed, i.e., executing r #-rotations with optimal angle index per plane rotation, increases the accuracy of the approximate rotation. Therefore, the conditions for quadratic convergence [10, 11] are met as the diagonalization advances. Figure 3 shows the off-diagonal norm vs. sweeps. The Jacobi method is executed

Computing the Symmetric EVD

83

with exact rotations (dashed line),one ~-rotation per simUarity transformation (solid line) and an adaptive number r of #-rotations per similarity transformation (dash-dotted line). The adaptation scheme used for this figure was r - [Ikmeanl/10J, where kmea~ is the mean value of the optimal angle indices in the respective sweep. Note that the increase of r comprises a decreasing cost for the #-rotations as the simpler rotation methods are used for smaller k.

Figure 3: Off-diagonal norm vs. sweeps

5

CONCLUSIONS

In this paper a Jacobi-type algorithm for computing the EVD of a symmetric matrix was presented. It uses different types of orthonormal #-rotations as approximate rotations. These orthonormal #-rotations are distinguished by decreasing complexity for decreasing angles of rotation. The evaluation of the angle index k, which determines the approximate rotation angle and the used type of orthonormal #-rotation, can also be executed by at most three (unscaled) #-rotations. Therefore, the entire Jacobi-type method can be performed by the execution of #-rotations (pairs of shift-add operations, respectively), i.e., it can completely be executed on a floating point CORDIC architecture. Thus, it is highly suitable for a VLSI implementation. Furthermore, the nature of the Jacobi method (i.e. increasing diagonaUzation -~ decreasing rotation angles) supports the use of different types of orthonormal #-rotations as well as an adaptation of the accuracy of the approximate rotation, since in both cases the simpler types of #-rotations are used as the required rotation angle (angle index k) decreases.

84

J. GOtze and G.J. Hekstra

Finally, note that the presented methods also apply to the SVD of a rectangular matrix if the SVD problem is mapped to a symmetric EVD problem with increased dimension [6, 8]. Acknowledgements The work of the first author was performed while on leave at the Delft University of Technology. The research fellowship of the Delft University of Technology is acknowledged. Thanks are also to the Network Theory Group (Prof. P. DewUde, Prof. E. Deprettere) of the Delft University of Technology for the hospitality and the great time in Delft. References [1] R.P. Brent and F.T. Luk. The solution of singular value and symmetric eigenvalue problems o n multiprocessor arrays. SIAM J. Sci. Star. Comput., 6:69-84, 1985. [2] J.P. Charlier, M. Vanbegin, and P. van Dooren. On efficient implementations of Kogbetliantz's algorithm for computing the singular value decomposition. Numer. Math., 52:279-300, 1988. [3] J.-M. Delosme. Bit-level systolic algorithm for the symmetric eigenvalue problem. In Proc. Int. Conf. on Application Specific Array Processors, pages 771-781, Princeton (USA), 1990. [4] M.D. Ercegovac and T. Lang. Redundant and on-line CORDIC: Application to matrix triangularization and SVD. IEEE Trans. on Computers, 39:725-740, 1990. [5] W.M. Gentleman. Least squares computations by Givens rotations without square roots. J. Inst. Maths Applies, 12:329-336, 1973. _ .

[6] G.H. Golub and C.F. van Loan. Matrix Computations. The John Hopkins University Press, second edition, 1989. [7] J. G6tze. Parallel methods for iterative matrix computations. In Proc. IEEE Int. Syrup. Circuits and Systems, pages 233-236, Singapore, 1991.

on

[8] J. G6tze. CORDIC-based approximate rotations for SVD and QRD. In Proc. European Signal Processing Conference, Edinburgh (Scotland), 1994. [9] J. G~tze. Monitoring the stage of diagonalization in Jacobi-type methods. In Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, page to appear, Adelaide (Australia), 1994. [10] J. G6tze. On the parallel implementation of Jacobi's and Kogbetliantz's algorithm, to appear SIAM J. Sci. ~ Star. Comput., 1994. [11] J. GStze, S. Paul, and M. Sauer. An efficient Jacobi-like algorithm for parallel eigenvalue computation, to appear IEEE ~rans. on Computers, 1993. [12] J. GStze and U. Schwiegelshohn. A square root and divis_ion free Givens rotation for solving least squares problems on systolic arrays. SIAM J. Sci Star. Comput., 12:800-807, 1991. [13] G.J. Hekstra and E.F. Deprettere. Floating point CORDIC. In 11th Symp. on Computer Arithmetic, Winsor (Canada), 1993. [14] J.J. Modi and J.D. Pryce. Efficient implementation of Jacobi's diagonalization method on the DAP. Numer. Math., 46:443-454, 1985. [15] W. Rath. Fast Givens rotations for orthogonal similarity transformations. Numer. Math., 40:47-56, 1982.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

85

PARALLEL IMPLEMENTATION OF THE DOUBLE BRACKET MATRIX FLOW FOR EIGENVALUE-EIGENVECTOR COMPUTATION AND SORTING'

N. SAXENA and J.J. CLARK Harvard University Division of Applied Sciences Cambridge, MA 02138 USA sa~:ena @hrl. ha rva rd. edu

ABSTRACT. In this paper we present a new systolic algorithm based on isospectral flows for the computation of eigenvalues and eigenvectors of a symmetric matrix and for sorting lists. This algorithm can be implemented in the form of both a linear array and a two-dimensional array. We describe the implementation details of both the linear and two-dimensional arrays. We analyze the implementations from the viewpoint of ease of implementation in VLSI. A major advantage our implementation has over other VLSI implementations is that it is easily scalable, meaning that several small chips can be put together to form a large system. This is quite important from the point of view of cost-effectiveness. Each processing node in our implementation is very simple, and is consequently smaller in area than most other implementations. Our implementation also gives sorted eigenvalues, while other systems do not sort. An implementation of the linear version of the array is being fabricated in VLSI. KEYWORDS. Systolic arrays, isospectral flows, eigenvalues, eigenvectors, sorting, VLSI.

1

INTRODUCTION

One of the most important developments in the field of parallel algorithms for eigenvalue computation has been the design of special-purpose systolic arrays [1]. The demand for such high performance algorithms is driven by the real-time constraints of digital signal processing applications. Unfortunately, the practical impact of these algorithms has been limited because of the difficulties encountered in implementing these algorithms in VLSI. Most of the parallel implementations to date are based on the Jacobi algorithm, in which

85

IV. Saxena and J.J. Clark

the off-diagonal elements are systematically reduced to zero by applying a sequence of plane rotations to the original matrix. The calculation of rotation parameters can be made faster and simpler by applying CORDIC (Coordinate Rotation Digital Computer) algorithms [3, 6]. However, this algorithm is essentially two-dimensional and, as will be explained later, its VLSI implementation is limited to small matrices [7] or to wafer scale integration [10]. The latter is, of course, too expensive for many potential applications. Thus there is a need for an algorithm that can be applied to large systems and yet does not require wafer scale technology. In this paper we present an implementation which reaches this goal through the use of a scalable design, whereby one can use several chips together to make a fast system for large problems. Our computation algorithm is based on isospectral flows, that is matrix flows in which the eigenvalues of the matrix are preserved. Isospectral flows have been studied extensively by the numerical analysis community for several reasons, one of them being that they can be used to calculate eigenvalues of matrices, and also that they can be related to well-known matrix factorizations [4, 8, 13]. The double bracket matrix flow, as studied by Brockett[2], is one such isospectral flow, that can be used to calculate eigenvalues and eigenvectors of a symmetric matrix, sort lists, and solve linear programming problems. Very few researchers in the past have used the isospectral flow approach to implement the eigenvalue problem in VLSI, probably because of the large number of steps it requires for convergence compared to other competing algorithms. However, as will be seen later in this paper, the isospectral flow can be used to design a systolic scheme, which has several advantages from the VLSI point of view. The systolic scheme can be implemented in the form of both linear and two-dimensional toroidal systolic arrays. In this paper, we firstdescribe the double bracket matrix flow, and how it can be used for finding the eigenvalues and eigenvectors of symmetric matrices, and for sorting. Next, we describe the two-dimensional systolic implementation and analyze the design in terms of communication complexity, area occupied, and speed. This is followed by a description of the linear systolic implementation. We then discuss the scalability of our implementations. The implementations are then compared to existing schemes for eigenvalue and eigenvector computation. We conclude with a summary of the paper and a description of our current work.

T H E D O U B L E B R A C K E T M A T R I X F L O W A P P L I E D TO E I G E N V A L U E EIGENVECTOR COMPUTATION AND SORTING In [2] Brockett investigated the double bracket matrix flow, which is described by the following matrix differential equation :

= IX, IX, where H , N ~. R "• are symmetric matrices, and [A,B] = A B - B A denotes the Lie bracket. Given appropriate choices of matrices H(0) and N, one can solve many standard optimization problems using the double bracket matrix flow. We will limit our concern to two specific applications: determining eigenvalues and eigenvectors of real symmetric matrices, and sorting lists of real numbers. In addition to the fact that the flow is isospectral,

Parallel Implementation of the Double Bracket Matrix Flow

87

it can also be shown that the gradient flow equation operating on a symmetric matrix causes the diagonalization of the matrix. The diagonal elements of the resulting matrix are equal to the eigenvalues of H(O), due to the isospectral property of the flow. Thus the system performs eigenvalue computation. We can find the eigenvectors of H(O) by using the following matrix differential equation coupled to the double bracket equation : b =

where B E R nxn. If B(0) is the identity matrix, then on reaching equilibrium the columns of B give the eigenvectors of H (0). Brockett also shows in [2] that if N is diagonal with distinct eigenvahes, then the only stable equilibria of the flow are those in which the diagonal dements of H and N are similarly ordered. If the matrix N is constructed so that/Vil < N2~ < ...< Nn, then, once H is diagonalized, its diagonal entries will be ordered so that ff,1 < H22 < ...< Hnn. Or equivalently, A1 < A2 < ...< An. Thus, the system not only finds the eigenvalues of H(0). it sorts them as well. This system can be used for sorting lists by choosing H(0) to be diagonal and setting the (diagonal) elements to the unsorted list. W h e n the system reaches equiUbrium, H becomes diagonal again with sorted eigenvahes. Since the eigenvalues are preserved, the diagonal elements give the sorted version of the list. Clearly, this approach can be used to find order statistics,such as minimum, median, and m a x i m u m value of a list of real numbers, just by picking off elements HII,H(n+I)/2,(,~+I)/2,and ft,,, once the system reaches equilibrium. If we have an n • n real symmetric matrix H, with elements hij (where i,j = I, 2, ...n), that is to be diagonalized, and we have a fixed diagonal n x n real matrix N, with diagonal dements nl = i (where i = 1, 2, ...n),then the discretized form of the double bracket flow equation can be written as"

6h~# = ~ h~khkj(i + j - 2k)~t k where ~t is the step size. The corresponding equation for the eigenvectors is given by : J where bi is the ith column of B. W e implement the above two equations using twodimensional and linear systolicarrays. The next two sections describe the implementations. 3

TWO-DIMENSIONAL

SYSTOLIC

IMPLEMENTATION

For a systolicimplementation of the discretized equation for eigenvahe solving, we propose a two-dimensional toroidal array of processors as shown in Fig. 1(a). A similar array can be employed for the implementation of the equation for eigenvector solving, but we will concentrate only on the implementation of the eigenvalue solver. Each element in the processor array calculates the flow for the corresponding element in the n • n matrix g, i.e.processing element Pij calculates the flow for hij. The schematic diagram of each processing element is shown in Fig. 1(b). The multiplier-accumulator, indicated by the left dashed box in Fig. l(b), calculates the

~!~,~m~q:) 8 (q) X~a~ ~go~,sXs I~p!oao~, l~UO!SUatu!p-o~, to tu~a~!p ~!~,~tu~q~ S (~) :I ~:m~!~I

(q) nl:POl= II=l|aeA

ll:POi s IIIIlJeA

I v. J/- !

~L 119019 IIIlUOXllOH

.

.

ir"

.

l,~

.?,

---|

| ue0.u

,.+-,

~ - ' . .

dL .

.

"--"

_~

Jo|

Jo un ~.- ~ - ' . , i l l , u

9

-!+ ~

Jt -/

J O l l I l U l l lll=l :I v

i

+

~-~

I

.,..d.,..

,,

""

I[ Ioppy

---

"~

~ ~

II

,,

-

~-~~ ll~Ol:l llllllOll

JO H

l

'

"

,

,

,,,I.

,

-

010o1~ Ouleee:)e,I d

(+1

<- ~.

...

+++~!i+-+++~+,+~++~!++++~+:+!:+

. - -

T

+ +m

:

..:

:~:::~:ii!:" iii:Ni

: '+

.............................................

+++i +,+++i,i+

- ~,~+N~++-- ... . - - ii!~~i - - i!ii~ii@i~+ ++

,....

+++++++++++++++++++:.+.++~

~++

...

WvlD "f" f p u v vudxvs "N

88

Parallel Implementation of the Double Bracket Matrix Flow

89

fh; "t--'

uh,"bL (a)

1r

(b)

(d)

Figure 2' Example illustrating the flow of data in a 3 • 3 processor array. The hq's in each box represent the data that are in the horizontal and vertical shift registers of the processing element represented by the box after each cycle; the top hij's shift horizontally, while the bottom hlj's shift vertically. In the 3 • 3 case, we have one shift cycle (a), and 3 shift and process cycles (b, c and d). As you can see, the data in the respective boxes are aligned correctly for the final 3 cycles. At the end of 4 cycles, the final result is obtained,

product hikhkj(i + j - 2k), multiplies it by the step size, and adds it to hij stored in the accumulator at each clock cycle. The right dashed box in Fig. l(b) represents the scale calculator, which calculates (i + j - 2k) as the sum of the distance between hq and hik, (j - k), and the distance between hij and hkj, ( i - k). To obtain the distance of hij from a multiplicand, say hik, we start with distance as zero and increment it by one, every time the multiplicand is shifted- except when hik is shifted from Pin to Pil. In this case, we subtract n from the distance. The registers are used to shift the hij's between adjacent nodes. To ensure that hik and hkj are shifted into Pij simultaneously for i,j,k = 1, 2, ...n, we device the following scheme. We first load the array with the entries for H(0). This will take n cycles. Next, each node does the computation of ~ k hikhkj(i + j - 2k). To do this computation, we need ( 2 n - 2 ) cycles. The first ( n - 2 ) cycles are shift cycles only. The next

90

IV. Saxena and J.J. Clark

Figure 3: Simulation to verify concurrency of data

n cycles are shift and computation cycles. In the firstof the shift cycles, we clock only the first row and column of the processor array. In the next cycle, we clock the first two rows and columns of the processor array. W e continue in this way, and by the (n- 1)th cycle, we start shifting the first(n- 1) rows and columns of the processor array. At the (n- 1)th cycle, we also start doing the computation at all the processing dements. For the next ( n - 1) cycles, shifting and computation is done at all rows and columns. For the cycles in which computation is done, the hijsare updated in the accumulator. At the end of the (2n- 1)th cycle, hijsare fully updated in the accumulator, meaning that ~ k hikhkj(i+ j- 2k)~t will be added to hij and stored in the accumulator. Fig. 2 shows an example of the scheme for the 3 • 3 case. Once the update is done in the accumulator, the two registers associated with the node are set with the updated values of hij and the two registers in the scale calculator module are reset to zero using the switches shown in Fig. 1(b). This shifting, computation, and update is repeated until the flow reaches equilibrium. Once the flow reaches equilibrium, the values of hij are docked out, which takes n cycles. Thus, the total number of cycles required to compute the flow is approximately 2nN, where N is the number of iterations required for the flow to reach equilibrium. N depends on several factors including expected accuracy, step size, condition of matrix, etc. In order to verify the concurrency of data in our scheme, we simulated the shifting, computation, and update cycles for a 3 x 3 system. The hijswere chosen to be between 0 and 1, and 6t was chosen to be 0.1. Fig. 3 shows one of the results of the simulation. W e start with off-diagonal

Parallel Implementation of the Double Bracket Matrix Flow

91

values close to zero and unsorted diagonal values. As is expected, after the flow reached equilibrium, the diagonal values were sorted and the off-diagonal values had returned to approximately zero. Care must be taken in choosing the step size, St. If it is chosen too small the number of steps required for convergence will be large. If the step size is too large the errors may accumulate, the isospectrality of the matrix will be lost, and the flow will diverge. Basically, there is a race between the accumulation of errors and the convergence of the matrix. If the time for convergence is short, the flow converges before the errors become too large. However, if the matrix takes a long time to converge the accumulated errors may cause the matrix to lose isospectrality. This occurs in cases when some eigenvalues of the matrix are very close to each other. In this case the step size will have to be small. The choice of the step size also depends on the range of the matrix entries and the size of the matrix. By simulating the system we have empirically found that for moderately well spaced eigenvalues and for a 5 • 5 system, the step size should be such that the update is less than 0.1 times the value to be updated. For hijs between 0 and 1, that is achieved by choosing St to be 0.1. For hijs between 0 and 10, 6t should be 0.001. Everything else remaining the same, for larger matrices the step size has to be proportionally smaller. Since the array is symmetric, we need to do the processing at only the upper triangular part of the array. So the rest of the elements in the array do not need to have the module which multiplies and accumulates. Thus, for the two-dimensional implementation of the n • n systolic array, the main components required are 89 2 + n) multiplier-accumulators, n 2 integer adders, and 4n 2 registers. As with most systolic arrays, we have nearest neighbor communications. 4

LINEAR SYSTOLIC IMPLEMENTATION

In this section, we will describe the linear version of the systolic array that computes the double bracket matrix flow. However, we will refrain from providing all the details since the implementation is similar to the two-dimensional case. In the linear systolic array, we calculate the hij's one row at a time. The linear version of the systolic array has the same basic hardware as the two-dimensional version, except that the multiplieraccumulator structure is present in only one row of the processing array (see Fig. 4). The top row of the processing array also has the horizontally shifting registers, which provides the communication between the processors. The shift registers of each column are stacked to form two memory units, M1 and M2, as shown in Fig. 4(b). The accumulators in the multiplier-accumulator module are also stacked vertically to form M3. We align the data as necessary by hardwiring the shifts. To calculate the hij updates for the first row, we load the hlj'S into the accumulator of the corresponding processing element. At the same time, we load the first row of M2 into the horizontally shifting registers. The other input for each multiplier-accumulator comes from M1. We then update the hlj'S in the accumulators as we rotate the horizontally shifting register and the Mls in each processing element. After n cycles, we will have added ~ k hlkhkj(1% j - 2k)St to the hlj'S in the accumulators. We shift out the hu's and shift in the h2j's into the accumulators. We also load the second row of M1 into the horizontally shifting registers. We proceed with the computation of h2j's as with the

92

N. Saxena and J.J. Clark

Figure 4: (a) Schematic diagram of the linear systolic array (b) Schematic diagram of each processing cell

Parallel Implementation of the Double Bracket Matrix Flow

93

first row. We repeat this procedure for all the rows. After n 2 cycles of computation for an n × n matrix, we get the updated values for all h~j's. We shift the updated values of h~j into M1 and M2 as is hardwired. This takes just one clock cycle. As mentioned earlier, the hardwiring ensures alignment of data. We repeat this computation and shifting until the flow reaches equilibrium. Thus, the total number of cycles required to do the computation is approximately n2N, where N is the number of iterations required for the fiow to reach equilibrium. For the linear version of the array, the main components required are n multiplier-accumulators, n integer adders, and approximately 3n 2 registers. As with the two-dimensional case, we have nearest neighbor communications. 5

S C A L A B I L I T Y OF D E S I G N

One of our major goals in our project has been to design an elgenvalue solver that could be used for large matrices. If the implementation of an eigenvalue solver for a large matrix is fabricated on a single chip, the area becomes so large that technologies llke wafer scale integration have to be employed, which could be prohibitively expensive. However, if the design is scalable, one can combine several small chips to implement a large system. Because of limitations on the number of pin-outs of chips, a linear array has much better scaling properties than a two-dlmensional array. For example suppose one wants to construct a 36 × 36 system. Assuming that nine processors fit on a chip and that there are fifteen communication wires between two adjacent nodes, the numbers of pln-outs required for each chip just for interconnectivlty is 180 for a two-dlmensional array, but is only 30 for a linear array. Also, the number of chips required is 144 for a two-dimenslonal array, but is only 4 for a linear array. So, for a large system, a linear array will provide a more practical solution in most cases. In our case, the linear array is easily scalable but we have to keep a few points in mind. First, the size of the register stacks M1, M2, and M3 in each node of the linear array has to increase with the size of the eigenvatue solver. Hence, the area for each processor increases with the size of the system. Second, when we use several small chips to make a larger linear array, the size of the final array has to be predetermined. This is because the register stacks are connected in the form of a ring structure, and also the alignment of data is achieved through hardwlred connections between the register stacks. However, if the connections in the register stacks are programmable, the size of the final array need not be known beforehand. 6

COMPARISONS

We first compare our implementation with other architectures for finding the elgenvalues and eigenvectors of symmetric matrices. We then compare our implementation with other suggested implementations of the double bracket matrix flow. Researchers have been investigating parallel architectures for eigenvalue-eigenvector computation for more than a decade. As is mentioned in the introduction, most of the parallel implementations of elgenvalue solvers in VLSI are based on the Jacobi algorithm. The

N. Saxena and J.J. Clark

94

Jacobi algorithm is faster than the double bracket matrix flow due to the large number of sweeps or iterations required for the flow to reach equilibrium. However, each processing dement in the double bracket matrix flow implementation is much simpler than the corresponding element in the Jacobi algorithm implementation. This leads to smaller area and shorter design time for the double bracket flow implementations, both of which are very important issues in VLSI. Another important consideration in VLSI is communication complexity. While both implementations have nearest neighbor communications, the control of data flow required for the Jacobi implementation is more complex than that for our implementation. This results in further savings in terms of area and design time for our implementation. Moreover, as has been mentioned before, the linear array in our implementation is scalable, and a large eigenvahe solver can be constructed by combining several smaller chips. Jacobi algorithm based implementations are all implemented in two-dimensional arrays and do not have suitable scaling properties. Another advantage the double bracket matrix flow implementation has over Jacobi algorithm implementations is that if we use the coupled equation for the implementation of eigenvectors, we get sorted eigenvalues and eigenvectors when the flow reaches equilibrium. With the Jacobi method, we would need to calculate the eigenvalues, calculate the eigenvectors, and then sort, all in serial order. So, for applications which need sorted eigenvalues and eigenvectors, the double bracket matrix flow could be more efficient in terms of speed, area and design complexity. In [5] and [9], implementations of various forms of the double bracket flow have been discussed. However, both the implementations are in the analog domain. As the authors in [5] themselves have pointed out, analog circuit implementations have severe limitations on accuracy. This, in effect, drastically limits the applications for the implementations. As an example, we simulated the sorter implementation in [9] using analog multipliers with less than 1% total harmonic distortion, real operational amplifiers, and the circuit simulation program SPICE. The analog multiplier used in the simulation is described in [11], while the analog system and the various simulation details are described in [12]. We found that while the sorter did sort the analog values, the inaccuracy in the final values was at least 10%. This severely limits the applications the sorter could be used for. So, a more effective way to implement the double bracket equations would be to employ digital techniques, which can be arbitrarily accurate. However, digital integrated circuits occupy more space, consume more power and because of architecture limitations, require more time to do the computation. However, the advantages of higher accuracy and more flexibility in applications, far outweigh the disadvantages.

7

SUMMARY

AND

CURRENT

WORK

In this paper, we have described the details of the two-dimensional and linear versions of a systolic array based on the double bracket matrix flow. This array can be used both for calculating eigenvahes and eigenvectors of real symmetric matrices, and for sorting positive real numbers. The implementation has several attractive features from the VLSI point of view. The individual processors are simple and are easy to design. The communication between the processors is also very simple. The design can be Used to implement larger systems by combining several small chips. This makes for a cost-effective implementation

Parallel Implementation of the Double Bracket Matrix Flow

95

of large systems. In this paper, we also compare our implementation to other parallel implementations of double bracket matrix flow, and find our implementation to be more practical from a VLSI implementation standpoint. We are at present building the linear version of the systolic array in VLSI. A floating point system is being used so as to achieve the large dynamic range required for the problem. We have designed, fabricated, and tested the components required to design the processors. We have also put together the components to design each processor. The final design consists of three such processors in an area of 4.2ram • 6.3ram using a standard 2/~ CMO$ process. This design has been sent to MOSIS for fabrication. Once this chip is built, we will connect several of these chips to build larger systems.

96

N. Saxena and J.J. Clark

[10] C. M. Rader. Wafer-scale integration of a large systolic array for adaptive nuUing. The Lincoln Laboratory Journal, 4(1):3-29, 1991. [11] N. Saxena and J. J. Clark. A four-quadrant CMOS analog multiplier for analog neural networks. IEEE Journal of Solid-State Circuits, 29:746-749, June 1994. [12] N. Saxena and J. J. Clark. On the analog VLSI implementation of the h-dot equations. Harvard Robotics Lab Technical Report, in preparation, 1994. [13] D. S. Watkins. Isospectral flows. SIAM Review, 26(3):379-391, July 1984.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

PARALLEL BLOCK ITERATIVE COMPUTING ENVIRONMENTS

97

SOLVERS FOR HETEROGENEOUS

M. AttIOLI

A. DRUMMOND

Istituto di Analisi Numerica, CNR via A bbiategrasso 209 ~7100 Pavia ITALY arioli @bond.ian.pv, cnr. it

Centre Europgen de Recherche et de Formation A vancge en Calcul Scientifique (CERFA CS) 4Z Ave. G. Coriolis 31057 Toulouse, FRANCE drum mond @cerfacs.fr

I.S. DUFF

D. RUIZ

CERFA CS, and also Rutherford Appleton Laboratory Chilton, Didcot, Ozon 0Xll OQX ENGLAND [email protected]

Ecole Nationale Supgrieure d 'Electrotechnique, d 'Electronique, d'Informatique, et d'Hydraulique de Toulouse (ENSEEIHT) 31000 Toulouse, FRANCE [email protected]

ABSTRACT. We study the parallel implementations of a block iterative method in heterogeneous computing environments for solving linear systems of equations. The method is a generalization of the row-projection Cimmino method where blocks are obtained by partitioning the original linear system of equations. The method is referred to as the Block Cimmino method. In addition, we accelerate the Block Cimmino convergence rate by the Block Conjugate Gradient method (Block-CG). Firstly, we present three different implementations of the Block-CG to compare different paraUelization strategies of the method. Secondly, we present a scheduler to balance the computational load of the parallel distributed implementation of the Block Cimmino iteration. The computational resources used in these experiments are a BBN TC2000, and a network of Sun Sparc 10 and IBM RS-6000 workstations. We use PVM 3.3 to handle the interprocessor heterogeneous communication.

KEYWORDS. Block iterative methods, Cimmino method, conjugate gradient method, heterogeneous environments, parallel algorithms, P VM.

98 1

M. Arioli et al.

INTRODUCTION

The Block Conjugate Gradient method (Block-CG)([13, 3]) is an extension of the classical conjugate gradient method. The Block-CG algorithm simultaneously searches the solution to a linear system of equations in different Krylov subspaces. Moreover, the Block-CG method can also be used to simultaneously find more than one solution to linear systems of equations with multiple right-hand sides. Given the nature of the Block-CG method, we propose three different strategies for its parallel distributed implementation. The main differences between the three Block-CG implementations are the amount of computation performed in parallel,the communication scheme, and distribution of tasks among processors. The Cimmino method ([7])is a row projection method in which the solution of the linear systems of equations is obtained through row-projections of the original matrix in given subspaces. In the Block Cimmino method ([2, 5]), the blocks are obtained by dividing the linear system of equations into subsystems. At every iteration, it computes one projection per subsystem and uses these projections to construct an approximation to the solution of the linear system. The Block-CG method can also be used to accelerate the Block Cimmino convergence rate ([11,3]). Therefore, we present an implementation of a parallel distributed Block Cimmino method where the Cimmino iteration matrix is used as a preconditioner for the Block-CG algorithm. In the parallel distributed Block Cimmino implementation, we have developed a scheduler that assigns units of work of different sizes to a set of heterogeneous processors. With the scheduler, we generate heuristics to balance the work load among processors taking into consideration the size of the units of work and the potential communication between these units. Processors can be clustered to take advantage of particular network interconnections. Thus, for the scheduler, we classifythe processors into single processor nodes, shared memory clusters, and distributed memory clusters. In Section 2, we present the three parallel Block-CG implementations with the results obtained from these implementations. Later, in Section 3 we present a scheduler for the parallel distributed implementation of the Block Cimmino method. Lastly, we present conclusions and general observations in Section 4.

2

PARALLEL

DISTRIBUTED

BLOCK-CG

The Block Conjugate Gradient algorithm (Block-CG) is an extension of the classical conjugate gradient algorithm from [12], and in turn it belongs to a broad class of conjugate direction methods for solving linear systems of equations of the form: H x = k,

(2.1)

where H is an n x n symmetric positive definite matrix (SPD). These methods guarantee, in absence of roundoff errors, to reach a solution to (2.1) in a finite number of steps. In the Block-CG method one considers more than one right hand side vector, leading to the linear systems: H X = K,

(2.2)

Parallel Block Iterative Solvers

99

in which X and K are now matrices of order n x s, and s is the block size. For this reason the method is called Block-CG. Algorithm 2.1 is a stabilized Block-CG [3, 14], and from this algorithm we have developed three different parallel distributed implementations. A l g o r i t h m 2.1 (Stabilized B l o c k C o n j u g a t e G r a d i e n t )

Begin X (~ is arbitrary, tt (~ - K - H X (~ 1~(~ = It(~

such that (R--'(~176

I)

~-(o) = R---(o)/~ol such that (P-'(~176 for j - 0,1,..., until convergence do Aj = X(J+l) -_ xcJ) +

/•T

End

R--(J+1)

=

aj

=

l~(J +~)

=

I)

(rI,%

enddo

Figure 1: Example of a linear system of equations with a block slze of 3 partitioned into 4 subsystems (e.g., I = 4). For the parallel distributed implementations of Block-CG, we first partition the matrix H from (2.2) into I submatrices H 1 , H 2 , . . . , H I , in a way that every submatrix H~ has dimensions ni • ki. The k~ dimension of every submatrix Hi comes from a column partition

1O0

M. Arioli et al.

of the original matrix H, and the ni dimensions are determined by the number of rows with non-zero elements inside a column partition. Similarly, we partition the X(~ and K matrices into l submatrices with dimensions ki • s each. Afterwards, we distribute the l sets of < Hi, X~~ Ki > among p processors. An example of a matrix partitioning strategy is shown in Figure 1. In Algorithm 2.1 the parallel computations of the matrices" HX (~ H P (j), ~j, and 7j required to build in full the P(J), and It(J) matrices. The places in the algorithm where the matrices H X (~ H P (j), ~j, and 7j are computed require interproeess communication and synchronization, and these places can penalize the efficiency of the parallel implementation of the Block-CG method. In our first implementation, we minimize the number of required communications us-

Figure 2: Master-Slave : centralized Block-CG implementation. ing a master-slave computational approach in which the master performs the Block-CG

algorithm 2.1 with the help of p slaves to perform the H P (j) products. In Block-CG algorithm, the most expensive part in term of computations is the calculation of the H P (j) products and in this implementation we only parallelized these products. We refer to this implementation as "Master-Slave: centralized Block-CG." Figure 2 illustrates the flow of computations for the Master-Slave: centralized Block-CG implementation. As a second implementation, we consider a master-slave computing approach in which each of the p slaves perform iterations of the Block-CG Algorithm 2.1 in a set of matrices < Hi, X~~ Ki >. The role of the master in this case is to gather partial results from the

Parallel B l o c k lterative Solvers

101

R!J)rRI j) and P!J)rHP!J) products in order to build the 7j and ~j matrices respectively. At the same time, each slave has information about other slaves with whom it needs to exchange information to build locally a part of the full HP~ j) matrix. The implementation is illustrated in Figure 3. We will refer to this implementation as "Master-Slave: distributed Block-CG." Lastly, we develop an implementation based on an all - to - all computing model. This

Figure 3: Master-Slave: Distributed Block-CG implementation. time the motivation is to reduce the communication bottlenecks created by having a processor that acts as master and needs to receive messages from p slaves and broadcast the results back. In this implementation, we have an a l l - t o - a l l communication for computing the 7j and/~j matrices which means that, after the communication, the same 7/ and tgj information is local to every processor. To compute the full HP! j) matrix each processor communicates with only the processors that have information relevant to its computations. Figure 4 is an illustration of this implementation of the Block-CG algorithm. We refer to this implementation as the "AU-to-AU Block-CG". In all three implementations the interprocessor communication has an impact on performance. Therefore, we analyse the amount of information that needs to be communicated in each implementation. Let mk be the number of processors with whom the k-th processor must communicate in order to compute the full HPI j) or HX! ~ matrix. Notice that each processor only needs to communicate a part of its local information with its mk neighbour processors. In the

102

M. Arioli et al.

Master-Slave: centralized Block-CG implementation, these products are handled differently and each processor sends results back to the processors executing the master role. Table 1 summarizes the number of messages sent per iteration of the Block-CG. We observe in Table 1 that the Master-Slave: centralized Block-CG implementation needs

Table 1: Number of messages sent at every iteration the least amount of messages per iteration. However, in the Master-Slave: centralized Block-CG the length of every message is nl • s ( master processor sends P(J) and each slave processor sends back H P (j)). As stated before, in the AU-to-AU Block-CG and Master-Slave distributed Block-CG implementations, the processors communicate directly with their neighbours to compute the H P (j) products. These messages are in almost all

Parallel Block lterative Solvers

103

cases smaller than n~ x s except when the matrix H is a full matrix. The length of the messages used to exchange the inner product is s x s. In the Master-Slave: centralized Block-CG, the master processor assembles the full HP(j) from partial results sent by slave processors. In the other two implementations, the assembly of the full matrices happens in parallel because each slave processor builds the part of the full matrix it needs. Furthermore, the overhead of assembling the H P (j) matrix in a centralized way increases as the number of subproblems and degree of parallelism increase. The results shown in Tables 2 and 3 were run on a BBN TC2000 computer. We ran the

'Nu'mber of PE's ~ i 2 4

8 12 16

Laplace Matrix 4096X 4096 . . . . . . . (Block size = 4, 171 iterations) Elapsed Time of sequential version - 279142 All-to-All Mstr-slv: disi~ibuted Mstr-Slv: centralized Elps. Timt [Speed-uP ' Elps. Time Speed-up Elps.....Time {Speed?up .$' 278827 1.001 279436 0.999 1.951 143419 143083 1.946 301884 0.925 3.910 71244 3.918 278184 71393 1.003 40755 6.849 38798 7.195 273320 1.021 40668 6.864 29747 9.384 279414 0.999 57759 25452 4.833 lO.967 283649 0.984 .,

.

.

.

.

.

.

.

.

.

.

. . . . . . . . .

.

.

.

.

.

....

Table 2: Test matrix generated from a discretizationon a 64 • 64 grid: Laplace's equation. Times shown in table are in microseconds.

Number of PE's 1 2 4 8

12 16

L A N i ~ R O Matrix 960 x 960 (Block size = 4, 138 iterations) Elp. Time of sequential version = 64869 All-to-All ....Mstr-Slv: distributed Mstr-Slv: centralized

Elps. Time I Speed-up 64980 0.998 34063 1.904 19531 3.321 14108 4.598 20943 3.097 48054 1.350 ,,

. . . .

Elps. Time ] Speed-up 65455 0.991 34347 1.889 19451 3.335 12667 5.121 11319 5.730 11874 5.463 ,,.

Elps. Time I Speed-up . $ .

61942 53964 52175 53566 58400

1.047 1.202 1.243 1.211 1.110

Table 3: This matrix comes from The Harwell-Boeing Sparse Matrix Collection, and it is obtained from a biharmonic operator on a rectangular plate with one side fixed and the others free. Times shown in table are in microseconds. experiments with 1, 2, 4, 8, and 16 processors. We used two SPD matrices for running the experiments. The first matrix is the result of a discretization on a 64 • 64 grid: Laplace's equation. The matrix is sparse of order 4096, and has 20224 nonzero entries. The second matrix, L A N P R O of order 960 with 8402 nonzero entries, comes from the Harwell-Boeing Sparse Matrix Collection [9]. The sequential time reported in Tables 2 and 3 is the result of running a sequential implementation of Block-CG without any routines for handling parallelism. The sequential implementation uses the same BLAS and LAPACK [1] routines as the parallel Block-CG

104

M. Arioli et al.

implementations. We can see in Tables 2 and 3 that the larger the problem size is, the better the speedups we get with the Master-Slave: distributed Block-CG implementation. This is not the case for the Master-slave: centralized Block-CG implementation, for which the performance decreases as we increase the size of the problem, and the overhead from monitoring the parallelism by the master processor negates all the benefits from performing the HP(J)products in parallel. In the All-to-All Block-CG implementation, we have chosen to perform redundant computations in parallel instead of waiting for a master processor that gathers, computes and broadcasts the results from computations. As can be seen in Tables 2 and 3, an increase in the degree of parallelism penalizes the performance of the implementation due to the accompanying increase in interprocessor communication. We conclude from these experiments that the Master-Slave: distributed Block-CG implementation performs better than the other two implementations because the amount of work performed in parallel justifies better the expense of communication. Furthermore, we use this implementation to accelerate the rate of convergence of the Block Cimmino iterative solver to be presented in the next section.

3

PARALLEL DISTRIBUTED

BLOCK CIMMINO

The Block Cimmino method is a generalization of the Cimmino method [7]. Basically, we partition the linear system of equations:

Ax=b,

(3.1)

where A is a ~ x n matrix, into ! subsystems, with I _~ m, such that: A1 b1 A 2 9

x =

A'

b2 .

(3.2)

b'

The block method ([5, 2]) computes a set of I row projections, and a combination of these projections is used to build the next approximation to the solution of the linear system. Now, we formulate the Block Cimmino iteration as: i(k) =

=

Ai+b i - PR(A~T)X(k)

(3.3)

Ai+ (bi - Aix(k)) l

x(k+1) =

x (k) + v ~ 6i(k) i-1

(3.4)

In Equation (3.3), the matrix A i+ refers to the Moore-Penrose pseudoinverse of A i defined

as: A i+ = A ir (AiAiT) -1. However, the Block Cimmino method will converge for any other pseudoinverse of A i and in our parallel implementation we use a generalized pseudo-

Parallel Block Iterative Solvers

105

inverse [6], AG_ai- = G-1Ai r ""[AiG-1Air)-l, where G is an eUipsoidal norm matrix. The Plc(Air) is an orthogonal projector onto the range of A it. We use the augmented systems approach, [4] and [10], for solving the subsystems (3.3) G Ai

Ai~] [u i

0

with solution: v' = _ ( A ' G - 1 A ' T ) - l r

'

, and

u i = AG_li-(b i - Aiz) = 6i

(3.5)

The Block Cimmino method is a linear stationary iterative method, with a symmetrizable iteration matrix [11]. The use of eUipsoidal norms ensures the positive definiteness of the Block Cimmino iteration. An SPD Block Cimmino iteration matrix can be used as a preconditioning matrix for the Block-CG method. The use of Block-CG in this case accelerates the convergence rate of the Block Cimmino method. We recall that Block-CG will simultaneously search the next approximation to the system's solution in s-Krylov subspaces and, in the absence of roundoff errors, will converge to the system's solution in a finite number of steps. We use the MasterSlave Distributed Block-CG implementation presented in the previous section to develop a parallel block iterative solver based on the Cimmino iteration. At first, we solve the system (3.5) using the sparse symmetric linear solver M A 2 7 from the HarweU Subroutine Library [8]. The MA27 solver is a frontal method which computes the LDL r decomposition. The MA27 solver has three main phases: Analyse, Factorize, and Solve. These MA27 phases are called from the parallel Block Cimmino solver. First of all, the parallel Block Cimmino solver builds the partition of the linear system of equations (3.1) into (3.2) and generates the augmented subsystems. The solver then examines the augmented subsystems to count the number of nonzero elements inside each of them and identifies the column overlaps between the different subsystems. The number of nonzero elements per subsystem gives a rough estimation of the amount of work that will be performed on the subsystem. The column overlaps determine the amount of communication between the subsystems. In addition, the solver gathers information from the computing environment either supplied by the user or acquired from the message passing programming tool. Processors are classified into single processor, shared memory clusters, and distributed memory clusters. We assume that the purpose of clustering a group of processors is to take advantage of a specific communication network between the processors. The information from the processors is sorted in a tree structure where the root node represents the startup processor, intermediate level nodes represent shared or distributed clusters, and the leaf nodes represent the processors. Processors in the tree are sorted from left to right by their computer power. The tree of processors, the augmented subsystems, the number of nonzero elements per subsystem and the information from the column overlaps are passed to a static scheduler. The scheduler first sorts all the subsystems by their number of nonzero elements. Later, subsystems are assigned to processors following a postorder traversal visit of the tree of processors (e.g., first visit leaf nodes then the parent node in the tree). A cluster node receives

106

M. Arioli et al.

a number of subsystems to solve equal to the number of processors it has. In this case, a first subsystem is assigned to the cluster and the remaining ones are chosen from a pool of not-yet-assigned subsystems. To choose amongst the candidate subsystems, we consider the amount of column overlaps between them and the subsystems already assigned to the duster and, then, we select the candidate subsystem with the highest factor of overlapping. This choice aims to concentrate the communications between subsystems inside a cluster. Every time a subsystem is assigned to a processor or cluster, we update a workload factor per processor. This workload factor is useful in the event that there are more subsystems than processors. The subsystems that remain in the not-yet-assigned pool after the first round of work distribution are assigned to the least loaded processor or cluster one at the time. Every time the least loaded processor is determined from the workload factors. After assigning all the subsystems to processors, these subsystems are sent through messages to the heterogeneous network of processors. Each processor calls the MA27 Analyse and Factorize routines on the set of subsystems it has been assigned. Afterwards, it performs the Block Cimmino iteration on these subsystems checking the convergence conditions at the end of every iteration. The same parallel computational flow from Figure 3 is used in the parallel Block Cimmino solver. The only difference is a call to the MA27 Solve subroutine to solve the augmented subsystems and compute a set of projections 6i. The scheduler may redistribute subsystems to improve the current workload distribution. This redistribution may take place after the MA27 Analyse phase, MA27 Factorize phase, or during the Block Cimmino iterations. Moreover, the user specifies to the scheduler the different stages of the parallel solver where redistribution is allowed. Given the high expense of moving a subsystem between processors (move all the data structures involved in the solution of a subsystem, and update the neighbourhood information), we recommend allowing redistribution only before the MA27 Factorizatlon phase started because there are many data structures that are created per subsystem during the solve phase and sometimes the time to relocate these data structures across the network is more expensive than letting the unbalanced parallel solver finish its execution. In Table 4, we present some preliminary results of the Block Cimmino solver. We ran in a heterogenous environment of five SUN Sparc 10, and three IBM 1%$600 workstations. We used one of the IBM workstations to monitor the executions (master processor). The first test matrix is Gl%El107 from the HarweU-Boeing Sparse matrix collection [9]. This matrix is partitioned into 7 blocks (6 block of 159 rows and one of 153 rows) using a block size of 8 for Block-CG. As a second test matrix, we consider a problem that comes from a two dimensional wing profile at transonic flow (without chemistry effects). The problem is discretized using a mesh of 80 by 32 points. This leads to an unsymmetric, diagonal dominant, block tridiagonal matrix of order 80 • 32 • 3. In this case we test with three different partitionings. We use block size of 4 for the Block-CG algorithm only to increase the problem granularity since the problem converges very fast even with a block size of one for Block-CG. The numbers inside parenthesis in Table 4 show the relations between an execution time with a given number of slave processors and the execution time of the same problem with a single slave processor. We do not anticipate speed-ups in a network of workstations and we expect the Parallel Block Cimmino solver to perform better in parallel heterogeneous environments where we can take advantage of clusters of processors, and very different pro-

Parallel Block herative Solvers

107

cessing capabilities. Besides, we conclude that the Parallel Block Cimmlno will provide in a "reasonable" time a solution to a problem that cannot be solved in a single processor. N.Slaves IBM [sUN 1 1

0 2

0

3

1

4

GRE1107 205449 (1.0) 161969 (1.3) 201517 (1.0) 249802 (0.8)

5

2s6320

o

(0.9)

........ ~ransonic Flow 10 Blks 16 Blks

78079 (1.0i 53352 (1.5) 47479 (1.6)

33916 (2.3)

40966

77757 (1.0) 75297 (1.0) 64349 (1.2)

11 Blks 77579 (i.0)

4849~ 4122~ (1.9) 44352 (1.8) 50895

36200 (2.1)

(i.~) (1.9)

(1.5)

42721 (1.8)

Table 4: Preliminary results of the parallel Block Cimmino solver. Times shown in table are in mUlseconds.

108

M. Arioli et al.

References

[1] E. Anderson, Z. Bai, C. Bischof,J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. L A P A C K User's Guide. SIAM, Philadelphia, 1992. [2] M. Arioli,I. S. Duff',J. No~iUes, and D. Ruiz. A block projection method for sparse matrices", S I A M J. Scientificand StatisticalComputing 1992, 13, pp 47-70. [3] M. Arioli,I. S. Duff',D. Ruiz, and M. Sadkane. Block Lanczos techniques for accelerating the block Cimmino method CERFA CS TR//PA//92//70,Toulouse, France, 1992. [4] R.H. Bartels, G.H. Golub, and M.A. Saunders. Numerical techniques in mathematical programming. In Nonlinear programming J. B. Rosen, O.L. Manga.sarian, and K. Ritter, eds.,Accademic Press, New York, 1970. [5] R. Bramley and A. Sameh. Row projection methods for large nonsymmetric linear systems. S I A M J. Scientificand StatisticalComputing 1992, 13, pp 168-193. [6] S.L. Campbell and C.D. Meyer, Jr. Generalized inverses of linear transformations. Pitman, London, 1979. [7] G. Cimmino. Calcolo approssimato per le soluzioni dei sistemi di equazioni lineari. Ricerca Sci. II, 9, I, pp 326-333, 1938. [8] I.S. Duff and J.K. Reid. The multifrontal solution of indefinite sparse linear systems. A CM Trans. Math. Softw. 9, pp 302-325, 1983. [9] I.S. Duff, R.G. Grimes and J.G. Lewis. Users' guide for the Harwell-Boeing sparse matrix collection (Release 1). RAL 92-086 Central Computing Department, Atlas Centre, Rutherford Appleton Laboratory, Oxon OXll 0QX, 1992 [10] G.D. Hachtel. Extended applications of the sparse tableau approach- finite elements and least squares. In Basic question of design theory W.R. Spillers, ed., North Holland, Amsterdam, 1974. [11] L.A. Hageman and D. M. Young. Applied Iterative Methods. Academic Press, London, 1981. [12] M. It. Hestenes and E. L. Stiefel. Methods of conjugate gradient for solving linear systems. Nat. Bur. Std. J. Res. 49, pp 409-436, 1952. [13] D. P. O'Leary. The block conjugate gradient algorithm and related methods. Linear Algebra and its Applications 1980,29, pp 293-322. [14] D. Ruiz. Solution of large sparse unsymmetric linear systems with a block iterative method in a multiprocessor environment. CERFA CS TH/PA/9~/6. Toulouse, France, 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

109

EFFICIENT VLSI ARCHITECTURE FOR RESIDUE TO BINARY CONVERTER

G.C.CARDARILLI, R . L O J A C O N O ,

M.RE, M . S A L E R N O

Dept. of Electronic Engineering University of Rome. Tot Vergata Via della Ricerca Scientifica, I Rome Italy cardarilli @utovrm. it

A B S T R A C T . The Residue Number System (RNS) to binary conversion is a criticaloperation for the implementation of modular processors. The choice of moduli is strictlyrelated to the performance of this converter and affectsthe processor complexity. In this paper, we present a conversion method based on a class of coprime moduli, defined as (n - 2 k, n + 2k). The method and the related architecture can be easily extended to a large number of moduli. In this way the magnitude of the modular arithmetics used in the R N S system can be reduced. The proposed method allows the implementation of very fast and low complexity architectures. KEYWORDS. set.

1

Parallel architectures, RNS, arithmetic representation conversion, moduli

INTRODUCTION

Residue number system (RNS) is a very useful technique to improve speed and arithmetic accuracy of digital signal processing implementations. It is based on the decomposition of a number represented by a large number of bits into reduced wordlength residual numbers. These residual arithmetic blocks are independent each other. Consequently, this approach reduces the carry propagation delay speeding up the overallsystem. This fact makes R N S an interesting method for a low level parallelization.In addition, modular operation with high computational cost, as for example multiplication, can be speeded-up by using suitable isomorphisms stored in look-up tables. The main drawback for the use of R N S in high speed D S P is related to the conversion between the internal and the external number

I I0

G.C. Cardarilli et al.

representations. This conversion requires the translation from binary to RNS and viceversa and uses two different types of converters. The input converter transforms binary numbers into a set of numbers corresponding to the RNS representation. The output converter is used for the inverse conversion and it transforms the numbers from RNS to binary. In general, both the converters are critical in terms of speed and complexity but the second one is more important for the definition of the overall system performance. This second conversion can be performed using different approaches based on two fundamental techniques: the Mixed Radix Notation (MRN) and the Chinese Remainder Theorem (CttT). While the first approach is intrinsically serial, the second one can be easily made parallel but it requires a large dynamic range in order to represent the intermediate results [1]. Recently several authors have developed a number of alternative methods to overcome the CRT problems. In particular, in [2] Premkumar proposed a method derived from the CRT for a particular choice of RNS moduli. He considered an RNS system defined by the three different moduli ( 2 n - 1,2n,2n + 1). With these moduli he reduced the internal dynamic range of the converter and simplified the final modular operation. Different solutions based on other choices for the moduli set as for example (2 n - 1, 2'~ + 1) were also proposed. In this case, it is possible to use the elementary converter for defining an ItNS system composed by a large number of moduli. The disadvantage of this method is the exponential growth of the moduli magnitudes that makes the arithmetics of the RNS system complex and slow. In this paper, we present a conversion based on a class of coprime moduli, defined as ( n - 2 k, n+2k). This system can be easily extended to a large number of moduli limiting the magnitude of the modular arithmetics. In addition, this method allows the implementation of very fast architectures based on simple operations. 2

BINARY TO RNS CONVERSION

Let us consider an RNS arithmetic based on two moduli rnl and m2. For this choice, the number X can be obtained from its residues rl and r2 by using the classical CRT approach

where (X}M represents the result of modular operation X modulo M, with M

-- 7711 * 7712 =

=

rh, = __.M rh2 = __M 1711

(2)

1712

The two quantities rhl 1 and ~h~1 are such that <~hl~hll>,n, = 1; m' = 1, and using (2)we can write

ma= <m~l>,~ ; <~'I>m~= <m~'l>m2.

The application of the CP~T shown in [1] to the ltNS to binary conversion leads to complex architectures. These architectures require a large wordlength and a complex modular operation rood M. The wordlength is related to the dynamic range of partial results of equation (1). In our approach these problems are avoided by using a particular form of (1)

Efficient VLSI Architecture for Residue to Binary Converter

111

and considering a particular choice for the moduli mt and m2 9 If the left side and the right side of (1) are multiplied by (m2 - ml), taking into account equation (2), we obtain

Regarding the definition of

~;1.

we can write (m21 m2)

quently we obtain ~r~2

~rtl

-

(klr/~ 1

+ 1)m 1 and conse-

ml

There are an infinite number of values for kl that make the second member of equation (4) an integer number. It can be easily proved that among these values there exists a particular value k~ such that k~ ml + 1 < ml m2. Using this value, equation (4) can be written without the modular operator, i.e.

('/Tl'2"l)'m,:l

"- k t f T ' t l -

+1

m2

(5)

A similar procedure leads to the expression

=

,.,---.-.--~-

(6)

Substituting the above values in (3), we obtain

(x (-~2 - ..~)). = (-~2 (ki-.. + 1) r~ -

~1

(k~.~2 + 1)r2)u

(7)

The modular operation present in equation (7) can be removed by introducing an additional term a M . Finally we obtain X = m2rl - mlr2 + a M

' (m2'-

.',1)

(8)

If the difference ( m 2 - ml) is a power of two, namely 2 h, the equation (8) can be easily evaluated. In fact, in this case the division is reduced to a right shift of h positions. In this work we consider a class of moduli defined as ml = n - 2k, m2 = n + 2 k being n an odd number. For this choice, the two moduli (m2, ml) are coprime (see Appendix A) and the difference is (m2 - ml) = 2k+l. This method can be extended to a large number of moduli. In particular for a RNS representation using four moduli the use of equation (8) requires (ml, m2, m3, m4) = (n - 2/r n -F 2 k , m - 2 j, m + 2 j)

(9)

and m4m3 - m2ml ----2/t

(10)

where ml, m2, mz and m4 are coprime moduli. Equation (10) allows the recursive application of equation (8) on the three pairs (ml, m2), (m3,rn4) and (ml * m2, m z , m4) as shown in Fig.1. In order to prove the usefulness of the above procedure, we searched for four moduli according to (9) and (10). In particular, we searched for moduli with k = j , for which the equation (10) becomes m4m3 - m2ml = m 2 - n 2 = 2 R

(11)

112

G.C. Cardarilli et al.

With such a choice, it can be proved that (ml, m2, ms, m4) is a set of coprime moduli, as shown in Appendix B. The solutions of this search is shown in Tab.1. In this table, some solutions corresponding to the wordlength normally used in D S P applications are presented. It is worth nothing that the moduli corresponding to these solutions are very similar in magnitude. This means that the four moduli R N S is implemented by four arithmetic blocks with similar complexity. 3

HARDWARE

IMPLEMENTATION

In order to point out the requirements of a hardware implementation for the proposed method, it is necessary to evaluate the dynamic range of the different terms of (8). In our approach with respect to the classical approach we need a reduced dynamic range. In fact, we have

(m2- I) < m2r -

_<

i)

0 < a <_. 2 k+1

We can note that the result of (8), X, is an integer number. This implies that in (8) the value of a must be such that the numerator of (8) becomes a multiple of 2~+1. Consequently the k + 1 less significant bits of the numerator are equal to zero allowing a suitable simplification for the hardware implementation. In fact, the presence of the k+ 1 zero bits in the numerator permits a fast search of the correct multiple a M among all multiples stored in a ROM. The search of the correct multiple is followed by an addition of some bits of the selected c~M to the quantity m2rl - mlr2 in order to obtain the correct value of the numerator. The architecture of this block is shown in Fig.2. In Fig.3 the control logic circuit is presented. In Fig.1 a possible solution is presented for the extension of our approach to a four moduli RNS. 4

CONCLUSIONS

In this paper an RNS to Binary conversion method based on the moduli set ( n - 2 4 , n + 24) has been presented. The choice of this particular class of moduli allows the simplification of the converter architecture eliminating the division by (m2 - ml). The proposed architecture can be extended to a larger number of moduli and can be easily speeded-up by using a pipeline techique. Acknowledgements The authors wish to acknowledge the reviewers for the constructive suggestions. This work was partially supported by the Italian Research National Council (CNR), contract number N.94.01811.CT07.

Efficient VLSI Architecture for Residue to Binary Convener

113

.~ppendix A

T~o~m

~ ~,

~o~,~

o~ , ~ , ~

~ ~

~ , ~ ,~o ~ ~

~

-

(~-~)

m2

is odd. Proof: If n is even also the two numbers m l , m2 axe even and then divisible by two. Now, let us consider the case of n odd. If we suppose that ml, m2 are not coprime, there should exist a c o m m o n divisor p such that m2 - pm~, ml = pm~ Then the difference between ml and m2 should be m2

-- m l

= n + 2 k -- n-{- 2 k = 2 k+l = p ( m ~

-- rnl)

From the last equivalence of the above equation, we obtain that p must be a divisor of 2 k. Consequently, from the hypothesis of p odd derives that p is equal to 1.

114

(7. C. Cardarilli et al.

References

[1] M. Sodestrand W.Jenkins et al. Residue Number system arithmetic: Modern applications in digital signal processing, IEEE Press, 1986. [2] A.Benjamin Premkumar. An RNS to binary Converter in $n-1, $n, $n+l Moduli Set. IEEE Trans. on CAS I I , July 1992.

DR l -

Im

4 5 6 7

3 7 15 31

5 9 17 33'

8

63

65

I '~,

Ira,

im~

1 ...... 5 5 9 13 17 29" 33

.... 3 7 15 31"

.... 6~

.... 65

9 127 129 125 129 .... 10 255 257 253 ' 257 11 511 '513' 509 "'513 12 1 0 2 3 1 0 2 5 1 0 2 1 i025 18 65535 65537 65533 65537

63

Ira,

I 1r o! b~t, D .

7 11 19 35 67 131 259

,

127 255 515 511 1023' 1027 65535 65539

6 ,

15

31

63

Tab.1 Some solutions for equation (9) and (10) for j=k

.

.

115

Efficient VLSI Architecture for Residue to Binary Converter

,J,m' ,,,I, m~ xl

rl~

mlm2

m3m4

,L

,L

CONV2M

,J

>x

"1

IcoNi2M_ I

CONV2M

x2

q'm3 q'm,

Fig. 1: Architecture of a four moduli converter.

Cin aM rl . . . =

m2

ml Fig.2: Two moduli converter architecture

_I- ' ~n~ Io,I T ...

I_~ i ~

Cln (to adder)

To Mux

m2r I-m Ir2

) R o m address

J f

k+l

....

N-(k+l)

Fig. 3: Control logic circuit

j--~To adder

Algorithms and Parallel VLSI Architectures Ill M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

I19

A CASE STUDY IN ALGORITHM-ARCHITECTURE CODESIGN: HARDWARE ACCELERATOR FOR LONG INTEGER ARITHMETIC

C. RIEM, J. KONIG, L. THIELE

C. Riem, J. KOnig and L. TMele

120

methods will be treated in some detail. Furthermore, the particular application leads to a concentration on the class of massive parallel architectures with some degree of regularity. An overview of the different mathematical approaches to the design of (piecewise) regular architectures is given in e.g. [23, 24]. Despite of many generalizations, the computational model of Karp, Miller and Winograd [9] may still be considered to be the most important contribution. Recently, methods which aim at mechanical, provably correct synthesis are receiving more and more attention, see e.g. [12, 18, 25]. The overall process of mapping consists of a sequence of program transformations that is applied to an initial behavioral specification. The basis of the design process is the methodology adopted in CoMPAR[1]. The main properties with respect to the used models and methods are: 9 Algorithms and architectures are described using LIRAN (Linear Recursive Algorithm Notation), a functional subset of UNITY [2] which has been extended to allow for functional hierarchy and type refinements. 9 The basic model of algorithms can be described by a set of equations of the form V X~Iv: v[f(O ] =

Yv(w[g(O],...)

(1)

The three dots in the list of arguments of the function ~'v represent similar arguments. The index set I is privat to each equation. The index functions f ( l ) and g(I) are af[ine. 9 The index sets of piecewise linear algorithms are linearly bounded lattices ( L B L ) , see e.g. [21, 22]: I={I"

I=A,•+b

^ C , a > _ d ^ a E Z t}

where A E Z "xt, b E Z', C E Z mxt and d E Z m. This particular algorithm model is closed under operations like localization of data dependencies, partitioning, multiprojection, scheduling and control generation. This property is not satisfied by many other models. Even after a first level of partitioning or local broadcasting, the representation can be processed further, e.g. for increasing the levels of hierarchy, to perform scheduling and allocation or to generate control signals. These methods will be explained on the design of an arithmetic processor CoSEMI, a Coprocessor for SEMlnumerical algorithms, which was developped at the University of Saarland to help in computer algebra computations and cryptoanalysis such as RSA cryptography and factorization of large integers [26]. The term 'seminumeric' is borrowed from D.E. Knuth [10] "because they [the algorithms] lie on the borderline between numeric and

symbolic calculation". An overview on computer arithmetic is given in [7], an in depth discussion of digit recurrence algorithms for division can be found in [6]. The algorithms used in the coprocessor CoSEMI are based on [15]. The architecture developed using CoMPAR is characterized by

A Case Study in Algorithm-architecture Codesign

121

9 computation of multiplication, Z-division, modular multiplication and GCD on long integers using a piecewise regular architecture, 9 a scalable architecture, 9 different nested levels of partitioning for load balancing and hardware matching, see e.g. [201, 9 local broadcast for fast on-chip and local off-chip communication, see e.g. [3], and 9 local control flow for operation switching, see e.g. [19]. COSEMI has been fabricated and embedded into a SPARC Workstation. The host interface is described in [11]. 2

OPERATIONS

ON LONG INTEGERS

In order to achieve high performance in digital arithmetic, the design of efficient division algorithms is of central importance. Several systolic approaches have been proposed, combining it with multiplication [26], multiplication and square root computation [5] or multiplication and computation of the greatest common divisor [15]. We are now going to concentrate on division algorithms, as it is an essential part of both modular multiplication and computation of the greatest common divisor. On the other hand, multiplication can be adopted as a simplified case of division so that the major issue of integrating the required operations can be satisfied. Moreover digit recurrence algorithms working in a MSD (Most Significant Digit) first fashion are chosen. 2.1

Division

Division with remainder of N by D is defined as follows: Find integers Q and R so that the following equation holds: N =Q.D+R (2) In order to uniquely determine Q and R, a further condition is necessary. This paper concentrates on division by smallest remainder, called Z-division (N-divison with positive remainder is quite similar)' D D ---
2.1.1 Traditional division algorithm A simplification is to split the determination of Q into more steps which can be carried out easier. This concept is used by the traditional division algorithm: numbers are represented by digits in a base (radix) b and only one digit of the quotient is determined in each step, beginning with the most significant digit (MSD).

122

C. Riem, J. Kdnig and L. Thiele

More formally: Let b be the base used for the representation of numbers and let N, D, Q, R be as follows: x--1

X = X O X l . . . x x - 1 = Z bx-J-lxJ j=o

xj E [0, b - 1].

(4)

Let X (i) = x(i)x[i)...x(')_ 1 denote the value of X in computation step i. The algorithm can then be described as follows: 1. Normalization: shift dividend N by n - d digits to the right: N(~ = N , b -(n-d). 2. Quotient determination: for i = 0 to n - d do: (a) Quotient digit determination: qi = N (i) § D

(b) Subtraction: R (i) = N (i) - q i * D

(c) Determine new Dividend by

shifting:

N (i+1) = R (i) , b

3. Result determination: q = q o . . . q n - d , R = R (n-d)

Although the determination of Q is now broken up into several steps, the algorithm still has two crucial drawbacks for a parallel implementation. 9 Determination of the quotient digit: The (exact) determination of the quotient digit qi requires a look at the whole length of the divisor and at the same number of the

MSDs of the current remainder. A parallel implementation would be complex due to the large number of digits considered. One solution is to look only at the first k digits to find an estimate on qi. Redundant number representation of the quotient is an often used technique to accomodate for later adjustment of the quotient due to wrong quotient digit selection in earlier steps, see e.g. [14] for a general discussion. 9 Subtraction step: Another drawback of the algorithm for a parallel implemetation is

the subtraction step. A carry could be propagated from LSD (Least Significand Digit) to MSD. As for the quotient, redundant number representation of the residual avoids this drawback. As the divisor does not change during division, not much can be gained for the division steps by using an other representation for it. In fact the division step gets more complicated in redundant representation. Because of that, a unification of the operand representation has not been given much attention in the literature. For the computation of the greatest common divisor (GCD) with Euclid's algorithm [8] however, remainder and divisor have

A Case Study tn Algorithm-architecture Codesign

123

to be exchanged so that the same representation for both is useful to avoid a difficult to parallelize conversion steps. It should be realized, that quotient and remainder have to be converted to nonredunant representation where a carry propagation from LSD to MSD cannot be avoided. It can be done on the fly starting at the MSD in a digit by digit manner with a step delay time roughly equivalent to the delay time of a carry save adder [4];

2.1.2 Transformed division algorithm For our further discussion, we use signed digit representation for the opera nds and the results (quotient and remainder), where a digit is extended by a sign (see Preparata and Vuillemin [15] for an in depth discussion of the mathematical aspects). The choice of the radix b allows for a tradeoff in cycle time versus reduction in the number of cycles. For the new transformed algorithm, we change the representation of the numbers N and D to fractional form in order to distinct the k digits considered for quotient digit estimation from the rest of digits (the fractional part). The non fractional part is simply written as do and no respectively in an unspecified base (as integers): d-k

D = do.d1...dd-k = do + ~ b-Jdj,

n'k

N = n o . n 1 . . . n , - k = no + ~ b-Jnj,

j=l

do, hoE [-b k,b k - l ] ,

j=l

dj, nt E [ - b , b - 1 ] , l < j < d - k, l < l < n - k

The new algorithm is then:

1. Normalization: N(~ = N , b-(n-k), D = D , b-(d-k). 2. Quotient determination" for i = 0 to n - d do: (a) Quotient digit selection" qi = n(oi) + do (b) Subtraction: for j = 1 to d - k do in parallel , Subtraction" r~/) = n~i)- qi* dj

r(i) = n(i) -- qi * do (c) Shift" for j = 1 to d - k do in parallel

n(i+l) = s(i) j j+~ + c(i) j+2 n (i+1) = r(i)b -I- s[ i) -t- c[i)b + c~i) 3. Result determination: Q = q o . . . q , - d , R = N(n-a) , ba-k

The results are in redundant number representation. The conversion to nonredundant representation may result in a carry propagation over the full length of the results. Nonetheless, we have done n - d subtractions without carry propagation whereas only one subtraction is necessary for the conversion. Note that do must be > coast(b) for the algorithm to work properly. E.g. for base b = 2, do > 8 (that is k > 4), do > 32 for b = 4 (k > 3) and do > b2 for b > 8 (see [15] for further details).

124

2.2

C. Riem, J. Kdnig and L. Thiele

Multiplication and Modular Multiplication

Multiplication can be easily adopted to our division algorithm by replacing the quotient digits by the multiplier digits (MSD first), the divisor by the multiplicand and by initializing the partial remainder to 0, as it will be the result. The Subtraction and split step 2(b) are extended to start at j = - 1 and two new index points j = - 2 and j = - 3 are inserted which perform simple addition. Let A = a o a l . . . a n - l , B = bob1...bn-1 and P = A , B = POP1...P2,~-x. We initialize n = 0 f o r j _ _ _ - 3 a n d d j = b n _ j for l _ < j _ < n a n d d o = d - 1 = d - 2 = d _ 3 = 0. The quotient digit qi is then set for 0 < t < 2n + 2 according to

9 O
-ai 0

qi=

(5)

_(i+,) 2(b) and 2(c) are extended to j _ - 2 . n_ 3 is set to .(_/~+ c(._i~.[15] gives an inductive _(i+1) proof for b > 4 whereas [16] does the same for b = 2 in order to show that ,,j E [-b, b - 1]. Modular multiplication P = A 9 B (mod M) is the computation of P = A 9 B with an additional division by a modulus M where the final remainder defines the result. Instead of first computing P = A , B and then P (rood M) with an intermediate result of length 2n + 2, it is possible to interleave the computations by appropriately shifting the modulus and the multiplier by a positions. So instead of P = A , B (mod M) we compute Pb" = (Abe'), B (mod M b~ ) The algorithm uses variables u (i), p(i), u(i) together with qi as follows during 0 < i < n + a : u(i) = p(i) + ai , b (6) qi

U(Oi) + mo

(7) %

at

p(i+l)

= b (u (i) - qi* m) (s) This translates to one step of multiplication without shifting followed by a division step where the modulus is the divisor. An inductive proof of the correctness can be found in [15]. ~t

2.3

G r e a t e s t C o m m o n Divisor ( G C D )

The greatest common divisor of two integers a and b is the largest integer that divides both of them: gcd(x,y) = max {k I k \ x and k \ y } (9) It is easily computed using Euclid's algorithm [10] which uses the following recurrence: gcd(O, y) = y

gcd(x,y)

=

g c d ( y m o d x , x),

forx > 0

(10)

This suggests the following implementation: Initialize N <~ = n(~

~

"n(n~ to x and D (~ = d(~

1. if D (i) = 0 then gcd(x, y ) = N (i)

)'" "d~~ k to y.

A Case Study in Algorithm-architecture Codesign

125

2. shift D (i) until d (i) >_ const(b) 3. compute R(i) = N (i) rood D (i) by performing n - d + 1 steps of Z-Division. 4. set N (i+1) = D (i) and D(i + 1) = R (i), goto 1; const(b) is the accuracy needed in do for the algorithm to work correctly as described in section 2.1. The next step is to construct a unified dependence graph for all these operations. Because the number of shifting steps of D(i) is dependend on the value of D (i), the computation of the greatest common divisor cannot simply be intergrated. Instead, an external control is responsible for initiating the exchange of the divisor and the residual upon detection of d (i) <_ const(b). 2.4 Divir

Unified D e p e n d e n c e G r a p h multiplication and modular multiplication are now condensed into one dependence

126

C. Riem, J. K~nig and L. Thiele

Figure 1: Unified Modules Mo and Mj

9 Unification of Algorithms: The processor should be able to perform multiplication, modular multiplication, division and GCD for long integer numbers. To this end, a (re)design of existing algorithms had been necessary. Finally, the corresponding dependence graphs have been combined into a piecewise regular form. The algorithms are based on redundant number systems to avoid carry propagation, see [15]. 9 Regularization: The algorithm as designed so far is piecewise regular. In order to account for the necessary operation switching of the processing elements, the technique of control generation has been applied, see e.g. [19] and the references therein. Some problems still remain. They are summarized in the following together with a solution by further transformations.

9 Load Balancing: The estimation of the quotient digit has shown to be about three times as complex as the computations performed by module Mj. To increase the load of these modules and to reduce silicon each module Mj computes three digits while module M0 estimates one quotient digit. This has been implemented by tripling the register set of Mj. If I modules Mj exist, then each of them works on digit j, l + j and 21 + j in turn. To this end, local modifications of elementary operations and a new indexing scheme have been applied. For this purpose of load balancing, a 'locally parallel- globally sequential partitioning' has been applied to each chip. 9 Global Signal Propagation: To achieve high data throughput, the quotient digits have to be broadcasted to all cells Mj. This contradicts the cascading of chips for building larger computation fields. The solution to this problem is local on-chip broadcasting and systolic off-chip communication, which was possible due to the technique used for load balancing. The corresponding technique, see [3], can be shown to be a special case of the partitioning methodology proposed in [20].

A Case Study in Algorithm-architecture Codesign

127

Figure 2: Unified Dependence Graph 9 Data 1/(9: The data flow between the host and the individual processing elements has been designed using localization techniques, see e.g. [17, 19] and the references therein. Also the operands are input serialized, one digit at a time. This transformation is achieved by introduction of an additional load phase prior to the computation and a final read-out phase for parts of the result still inside the field. 9 Interleaving of Operations: Because of the digit serialization for input/output of operands/results, additional time steps are required which reduce the efficiency of the computations. By overlapping input of a new operation with an active computation the penalty is nearly 0, a constant gap of 2 to 5 steps depending on the performed operation is necessary. Operations are separated by marking the beginning of a new operand. 9 Control Generation: The selection of the operation to be performed has been moved to the outside by the insertion of some multiplexers. The control signals have been determined automatically, see e.g. [19]. These control signals are responsible for the correct dataflow between the host, the processors and external memory. This control information is global inside a chip but systolic between two chips, running in parallel with the quotient digits. 16 different control instructions have shown to be sufficient for the design. It is purely combinational inside a chip so that it can be regarded as stateless. The relativly simple control requires a more complex stateful control from an interface that bridges the processor and a host system. 9 Fixed Size Hardware: As the size of the hardware is fixed, the architecture must be partitioned, see e.g. [20] and the references therein. Note that three levels of partitioning are used after all (local broadcast, load balancing, fixed size mapping). It is thus possible for one of the operands to be larger than the field length available

C. Riem, J. KOnig and L. Thiele

128

in hardware. E.g. the dividend may be arbitrary large because computations are only performed on digits in the length of the divisor. 9 Scalable Architecture: As the data and control flow between chips is local in time (systolic), the system is scalable. The configuration is a bidirectional linear array of processing elements. As module M0 is only needed once, it can be deactivated if a chip is not the front node.

These transformations can be visualized in a timing diagram as shown in Figure 3. Note that each module Mj is displayed three times, although they only exist once but with a triple register set.

Figure 3: Timing Diagram with on-chip off-chip commtmication

4

IMPLEMENTATION

A chip has been designed for radix-2 as this leads to simple cells at the cost of higher communication and thus a higher clock rate. Due to the internal partitioning, off-chip communication only occurs every third cycle. The design has been implemented using Cadence Edge and a 1.5p silicon process from ES2 (European Semiconductors). It has been fabricated via EUROCHIP and has a capacity of 17000 gates. The design includes 44 cells Mj which allows for 132 binary digits to be processed concurrently. A test board with 8 cascaded chips has been fully integrated into a Sun Workstation with a free programmable interface based on FPGAs (Field Programmable Gate Arrays) and

A Case Study in Algorithm-architecture Codesign

129

DVMA (see [11] for further details). The communication between the host and the application is asynchronous allowing both parts to run at their optimum speed. The necessary control is implemented on the interface by simple finite state machines which control the four operations. The non redundant representation is fully hidden from the host so that normal 2n-complement representation on the host is sufficient. On the fly conversion from redundant to nonredundent representation as described in [4] is performed as a final step on the result in the interface. The test board can process numbers of up to 8,132 = 1056 bits for the smaller operand. The dividend in case of division or the multiplier in case of multiplication may be arbitrary long. A speedup of up to 10 has been achieved compared to the GNU multiprecision library. A C-library for the operations is currently beeing developed for easy integration into software environments with the need for high speed long integer arithmetic, such as cryptographic applications, e.g. the quadratic sieve algorithm [13]. Further investigations currently concentrate on simplifying module M0, supporting square root computations and extending the concept to polynomial arithmetic. 5

CONCLUDING REMARKS

In the present paper, no new methods or models are introduced. But it is the opinion of the authors that at the present state of the design technology it is of importance to show how to combine the different methods in a systematic Algorithm-Architecture Codesign. References

[1] U. Arzt. COMPAR: Compiler .for massive-parallel architectures. Dissertation, University of Saarland, 1994 [2] K.M. Chandy and J. Misra. Parallel Program Design. Addison-Wesley Publ. Comp., Reading, Mass., 1988. [3] Ed. F. Deprettere. Cellular broadcast in regular array design. In Proc. VLSI Signal Processing Workshop, pages 319-331. Computer Society Press, 1992. [4] M.D. Ercegovac and T. Lang On-the-fly conversion of redundant into conventional representations IEEE Trans. Computers, C-36:123-125, 1987. [5] M.D. Ercegovac and T. Lang Module to perform multiplication, division, and square root in systolic arrays for matrix computations. Journal of Parallel and Distributed Computing, 11:212-221, 1991. [6] M.D. Ercegovac and T. Lang Division and Square Root, Digit Recuurence Algorithms and Implementations. Kluwer Academic Publishers, 1993. [7] K. Hwang Computer Arithmetic. John Wiley & Sons, 1979. [8] L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics Addison-Wesley Publ. Comp., Reading, Mass., 1989. [9] R. M. Karp, R. E. Miller, and S. Winograd. The organization of computations for uniform recurrence equations. Journal of the ACM, 14:563-590, 1967.

130

C. Riem, J. KOnig and L. Thiele

(10] D. E. Knuth. The art of computer programming Volume 2 Addison-Wesley Publ. Comp., Reading, Mass., 1968. [11] J. Kibnig and L. Thiele. A High Speed FPGA-Based Interface to High Bandwidth Hardware. In W.R. Moore and W. Luk, editors, FPGAsl Abingdon, England, 1994. Abingdon EE/CS Books. [12] C. Lengauer, M. Barnett, and D.G. Huds6n. Towards systolizing compilation. Distributed Computing, 5:7-24, 1991.

[13]

C. Pomerance The quadratic sieve factoring algorithm. In T. Beth, N. Cot and I. Ingemarsson, Eds., Lecture Notes in Computer Science ~09; Advances in Cryptology: Proc. Eurocrypt'8~, pp. 169-182, Berlin: Springer Verlag, 1985.

[14] B. Parhami Generalized signed-digit number system: A unifying framework for redundant number representation. IEEE Trans. Computers, C-39:89-98, 1990.

[1 1 [16] [17]

I[18] (19]

[20] [211

[22] [23] [24]

[2e]

F.P. Preparata and J.E. Vuillemin. Practical cellular dividers. IEEE Trans. Computers, C-39:605-614, 1990. Riem, C., "Entwicklung eines Coprozessors fiir seminumerische Algorithmen", Master's thesis, Lehrstuhl fiir Mikroelektronik, Universit~t, des Saarlandes, Saarbriicken, Germany, 1993. V. Roychowdhury, L. Thiele, S. K. Rao, and T. Kailath. On the localization of algorithms for VLSI processor arrays, in: VLSI Signal Processing III, IEEE Press, New York', pages 459-470, 1989. J. Snepscheut and J. Swenker. On the design of some systolic algorithms. Journal of the A CM, pages 826-840, 1989. J. Teich and L. Thiele. Control generation in the design of processor arrays. Int. Journal on VLSI and Signal Processing, 3(2):77-92, 1991. J. Teich and L. Thiele. Partitioning of processor arrays: A piecewise regular approach. INTEGRATION: The VLSI Journal, 14(3):297-332, 1993. L. Thiele. On the design of piecewise regular processor arrays. In Proc. IEEE Symp. on Circuits and Systems, pages 2239-2242, Portland, 1989. L. Thiele. Compiler techniques for massive parallel architectures. In P. Dewilde, editor, The State of the Art in Computer Systems and Software Engineering, pages 101-151. Kluwer Academic Publishers, Boston, 1992. L. Thiele. Mapping algorithms onto VLSI architectures. In P. Pirsch, editor, VLSI Implementations for Image Communications, pages 69-116. ELSEVIER Publishers, 1993. L. Thiele and U. Arzt. On the synthesis of massively parallel architectures. International Journal of High Speed Electronics, 3(4), 1993. J. Allen Yang and Y. Choo. Parallel program transformations using a metalanguage. In Proc. ACM Conf. on Principles of Programming Languages, pages 11-20, 1991. A. Vandemeulebroecke, E. Vanzieleghem, T. Denayer and P. Jespers. A new carry-free division algorithm and its application to a single-chip 1024-b RSA processor IEEE J. of Solid State Circuits, Vol 25:748-756, 1990.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

131

AN OPTIMISATION METHODOLOGY FOR MAPPING A DIFFUSION ALGORITHM FOR VISION INTO A MODULAR FLEXIBLE ARRAY ARCHITECTURE

AND

J. ROSSEEL, F. CATTHOOR, T. GIJBELS, P. SIX, L. VAN GOOL, H. DE MAN IME C Kapeldreef 75 8001 Leuven, Belgium rosseel @@imec. be

E~A T Katholieke Universiteit Leuven 3001 Leuven, Belgium gijbels @@esat. kule u ve n. ac. be

ABSTRACT. This paper addresses the architecture mapping of a complex non-linear diffusion algorithm for use in vision applications. The algorithm exhibits several parameters and has to be partly programmable. Real-time execution for a particular parameter set is however required. Starting point for this design was a set of affine recurrence equations. For this abstract specification, we tackle the difficult task of finding a globally optimised architecture with fully matched throughput while avoiding an explosion of the search space. KEYWORDS. Regular array architectures, vision algorithms, image processing, memory size optimization, architectural mapping. 1

INTRODUCTION

Real-time signal processing (RSP) applications as in video, speech and image processing require a mapping methodology which is optimised to throughput and not to latency. This demand contrasts with most other current array synthesis approaches [6, 3, 11, 12, 2, 1, 14, 8] that allow to transform regular algorithms from affine (ARE) or uniform recurrence equation (URE) form into efficient regular array architectures (RAA's). Moreover, to allow pipelined processing elements (PEs) and multi-dimensional algorithms, changes are needed to the uniform mapping and projection (scheduling) approach. This paper discusses our manually applied formal design methodology [9] resulting in a modular and programmable regular architecture implementing an image diffusion algorithm. It involves initial algorithm transformations, 1D or 2D placement mapping, multidimensional scheduling, local PE design and scheduling and low-level scheduling. All these steps are embedded in an iterative decision process with pruning due to ordering of the decisions and gradual refinement of the cost functions used to differentiate alternatives.

132

J. Rosseel et al.

Two main aspects of this design will be highlighted. The first aspect involves the application of algorithm transformations that allow the design of modular architectures with evenly spread and reduced I/O, suitable for different sets of algorithm parameters and compatible with the design method of [9]1. The second item is the design and scheduling of a very complex PE with the method presented in [9, 10], using a tuned ILP formulation. It will be shown that a carefully optimised array architecture can be obtained with several novel features. The architecture exhibits maximal use of the data-path hardware but is also heavily optimized in terms of the complex and crucial foreground and background memory storage.

2

A DIFFUSION IMAGE PROCESSING ALGORITHM

The general goal of diffusion algorithms in image processing is to filter out noise (smoothing the image) in even portions of the image, while preserving sharp edges, resulting in a better image. Smoothing can be realised by (iteratively) adding a correction term to each pixel that is proportional to the gradients of the image at the position of that pixel. However, to preserve sharp edges, this correction term must be zero for those parts of the image where large gradients occur. Also, if this smoothing is applied iteratively in order to smooth over larger regions, the final image may be too far off the original image [7]. Therefore, the following generic equation can be used to calculate a pixel in the kth iteration of a diffusion algorithm: ia~ = p~y +

(1)

- K x (l%y -P~y)

where p~y is the value of the pixel at position (~, y) in iteration k, p0y represents the pixel values of the original image and F(Vp~ "1) is a non-linear function of the gradient of the image at pixel position (x, y) in iteration k - 1. The original formula used for calculating a new pixel in an iteration is given by the following equation [4]: fnew

--

fc

+ (~ • ~ ) •

] f s - fc f w - fe fn - fe fE-fc ~ + (f~ - fc)" + ~2 + ( f w - fc)~ + r2 + ( f n - fe)~ + r2 + (fs - fc)2 J - ~-~ • ( f c - gl

(21

where fnew represents the pixel being calculated; fc, fN, fE, f s and fw represent respectively the pixel at the same position, at the north, east, south and west of the previously iterated image; g represents the pixel at the same position of the original image. This equation can only be implemented in hardware at a high cost. An alternative function that requires less resources in hardware, yet provides more flexibility and programmability will be presented in section 5. 1Part of the latter method has been submitted to another conference

Mapping a Diffusion Algorithmfor Vision 3

ALGORITHM

3.1

133

TRANSFORMATIONS

The Original Algorithm

The algorithm in its initial Aftine Recurrence Equation (ARE) form, ~ using a simplified form of equation 2, can be written as follows:

f ( x , y , k ) - F ( g ( x , y ) , f ( x , y , k - 1), f(x 4- 1, y , k - 1), f ( x - 1, y , k - 1),f(x, y4- 1, k - 1 ) , f ( x , y - 1, k - 1)) V(x,y,k) E { (x,y,k)]k >_1,1 < x < imagewidth, 1 < y g imageheight } (3) A number of algorithm transformations were applied to the original diffusion algorithm. These transformations had the following goal: 9 Making the algorithm (and the architecture) modular. 9 Minimising I/O. 9 Moving control logic to border PE's.

3.2

Towards a Modular Algorithm

The main transformation that will be discussed here was introduced to obtain a modular algorithm that results in a programmable architecture which can be used for different image sizes. In the original ARE (see equation 3), the bounds for the image sizes are parameterisable, a while the upper bound for the iteration index k has to be determined at run-time in a number of cases. The resulting index space thus has variable bounds in all directions. This makes it impossible to design a fixed-size architecture that is usable in all circumstances. Indeed, in the transformational design method, the bounds of the algorithm are translated in bounds on the array size by the placement mapping process. If all algorithm bounds are variable, a variable sized array will be tlle mandatory result. To overcome this dilemma, one can split the images into vertical bands 4 with fixed width N. This "bandwidth" must thus be equal to the lowest common denominator of all image widths that have to be processed. In practice, N will equal 8 or 16. The algorithm transformation from original to banded indices is straightforward. The relation between a pixel in the new, banded coordinate system and the original coordinate system can be expressed as follows: (new:)

f(B,x~, y, k)= f ( N . B + xB, y,k) (old)

(4)

The index B indicates the band-index, whereas the index xB indicates the horizontal pixel position within one band. This transformation introduces a new index-space that has fixed borders in the direction of the xB axis. This will be exploited in the placement mapping. 2For this case, the ARE was also a Uniform RE because no operations had to be localised. 3The architecture must be usable in multiple applications with different image sizes. 4Horizontal bands could also be used with identical results.

134

3.3

J. Rosseel et al.

I/O Reducing Transformations

Communication bandwidth reducing transformations are crucial for this array architecture, as a large amount of background memory is needed to store the intermediate iterated images. Minimising the I/O bandwidth to these memories can greatly reduce total system cost. The transformations are related to the reindexing transformations proposed in [13]. The transformation leading to the communication bandwidth reduction starts from the observation that in order to calculate one pixel, five pixels of the previous iteration are accessed each time. This can be reduced to one as the signal f ( x , y , k - 1) used in the calculation of f(x, y, k) is also used in the calculation of f ( x + 1, y, k), f ( x - 1, y, k), f(x, y + 1,k), f(x, y - 1 , k ) . Therefore, this signal can be accessed only once, and then be propagated to the other index-points. This introduces the intermediate signals fc, fN, rE, fs, and fw. Care must be taken when applying these transformation so that the resulting dependencies will form a pointed cone that is "as pointed as possible" [13] in order not to restrict the mapping process too much.

3.4

Regularising Transformations

A last set of transformations introduces more regularity in the I/O by introducing extra row and columns in the algorithm. These effects are illustrated in figure 1. This transformation reduce the communication bandwidth peaks required at the borders of the architecture, or at the beginning of a new iteration pass. Another goal of this regularising transformation is to separate control flow steering operations from operations performing "real" optimisations. This is accomplished by moving these control equations to new index-points at the borders of the original index-space, as illustrated in figure 2. Such a transformation must be applied with care however. Creating extra index-points can slow down the execution of an algorithm, but it can also simplify

Figure 1: Removing peaked I/O requirements (in time) by introducing extra "rows" in the algorithm

Mapping a Dtffusion Algorithm for Vision

135

Figure 2: Removing peaked I / 0 requirements (in space) and isolating control operations in the border PE's by introducing extra "columns" in the algorithm the design of the PE's. Both counteracting effects are observed in this design, but the overall throughput decrease of 1% is accepted because of the hardware simplifications that are possible (see section 7).

3.5

The Final Set of Equations

The final result of applying all transformations is the set of conditional URE's (partly) in equation 5. This result is much more complicated than the original equation, but exhibits a number of advantages that more than compensate the apparent complexity of the description. The abundance of conditional equations implies some control logic, but the carefully selected transformations placed the conditional operations at the borders so that overall control overhead is minimised. Moreover, this CURE is very modular, so that one architecture can be designed that can be used for multiple application parameters. 1 < t < #iterations 0 < B < imagewidth m

--

N

F~(B, x s , y , t ) E n = { (B, xB,y,t) l O < xo < N + l

1 }

0 < y < imageheight

f(B, as, y, t) =F(go(B, as, y, t), fc(B, xs, y, t), rE(B, xn, y, t),

/w(B. x.. v, t)./~(n. ~8. v, t). fN(B. ~B. v, t)) ify>lAl<xs

go(B, z s , y , t )

1

f g(N . B + xn, y) i f y > l A l < x s < N A t > l Lfc(B, z s , y , t ) if y > 1 ^ 1 < xn < N A t = 1

I N ( B , ~ , , V, t) = !r l f c ( B ' ~ ' ' y - 1,t)

( fc(B, zn,y,t)

ify_> 2 ^ 1 _<xB _
(5)

136

4

J. Rosseel et al.

ARCHITECTURE MAPPING: AND PLACEMENT

MULTI-DIMENSIONAL

SCHEDULING

The mapping of the transformed diffusion algorithm to an architecture will only be briefly discussed, without providing the details of the analysis and comparison of a number of alternatives. In this summary, only the end result is provided. The transformation mapping the algorithm indx to time and PE-coordinates that was used, is written in a general form as follows:

t._2

=

=Tx

i+7

=

7r(._2)~

...

~r(,,_~).| x

|s(._,),

...

s(._,).|

[. s,l

""

snn J

+

(6)

i

The complete optimisation method for finding optimal mapping parameters as described in [9] has been used for this design. Because no localisation was needed (URE's were available as a starting point), the optimisation process was fast and simple. More in particular, finding the best matched placement mapping was extremely simple, as most of the coefficients of the 1 x n placement matrix can be fixed. Indeed, because only one index has fixed bounds, only one index of the algorithm can be mapped to a P E index so that only a onedimensional RAA is possible. Moreover, to make the size of the architecture independent on variable bounds in the t, B and y directions, the coefficients in the t, B and y columns of the placement matrix must be zero. Hence, the only placement matrix that can be used (except for multiplication with a factor) is: /3'

S=[

0

:rs

y

t

1 0 0 ]

,(7)

After applying this placement mapping, three PE-types can be distinguished: the hftmost and rightmost P E calculate nothing but have to take care of the parameter dependent control and are required to input the border pixels of the adjacent bands of the previous iteration, whereas the other PE's perform the diffusion calculation. The border PE's can be implemented easily using off-the-shelf programmable components like FPGA's, so that only the body P E type will be considered. The specific example that was studied processes a 256 x 256 image with a frame rate of 25 Hz. It is estimated that 50 to 100 iterations are needed for each image. So, 256 x 256 x 25 x 50 = 81920000 to 163840000 index-points have to be calculated per second. If we assume that 16 PE's will be used 5, then each PE must process 5.12 to 10.24 million index-points per second. If a clock-frequency of ~ 20 MHz can be reached, then the PE must have an operation interval equal to 2 or 4. This is a requirement for the P E-designs that will be discussed in section 5. 5This is the case if a bandwidth of 16 is used which can be deduced from memory bandwidth considerations.

Mapping a Diffusion Algorithmfor Vision

137

The choice of a 1-dimensional architecture implies that 3 indices must be mapped to a multi-dimensional time. Therefore, in the initial mapping phase, two multi-dimensional scheduling vectors must be found. The vectors that were chosen have been derived from the given timing constraints and they optimize the memory requirements according to the methodology of [9].

H, = [ II2=[ 5

B 0 1

xB y t 0 0 1 ] 0 0 0 ]

(8)

LOCAL PE DESIGN AND SCHEDULING FOR (VERY) C O M P L E X KERNELS

This section will discuss the design of a number of alternative PE's for the diffusion application. First, two designs for the original diffusion equation which required costly divisions will be presented. These designs are for a 4 and 2 cycle PE pipeline budget. Second, an alternative 2-cycle P E design will be proposed for a piecewise linear approximation of the original equation. This alternative design addresses the too large area needed for the first designs. For this design, a more efficient approach for scheduling the operations on the PE is proposed, based on bad experiences with the first designs.

5.1

PE's for the Original Equation

A first design was made for a P E with a 4 cycle pipeline budget. This design uses 2 adders, 2 subtractors, 2 multipliers and 1 completely unrolled, parallel divider. The operations were scheduled onto the operators using a manually formulated ILP problem, based on the equations and constraints given in [9]. This scheduling problem was solved by the LINDO ILP solver on a DECstation 5000 in approximately 45 minutes of time. The resulting schedule shows that the execution of 2 consecutive index-points must be interleaved in order to obtain the required 4-cycle pipeline. A second design was made that tried to achieve a 2-cycle pipeline period. This highperformance des!gn uses 3 adders, 3 subtractors, 4 multipliers and parallel dividers. The hardware requirements for this design are thus quite enormous. Trying to find a schedule with a 2-cycle operation interval failed in a first phase, because the interleaved execution of 2 consecutive index-points wa.s allowed. Therefore, a new ILP formulation of the scheduling problem was made, allowing 3 consecutive points to be interleaved. The ILP solver then found a schedule that reached the 2 cycle goal and requires indeed the execution 3 consecutive index-points to be interleaved. 5.2

A m o r e Efficient and Flexible P E

The original diffusion equation uses a very smooth function of the gradient in tile correction term part of the equation. However, its implementation requires too costly hardware. The

J. Rosseel et al.

138

function shape given in figure 3 realises a non-linear correction term in function of the gradient, using a piece-wise linear (PWL) approximation of the original equation.

Figure 3: A suitable non-linear correction function and its piecewise linear approximation The approximating function F~i,,(X ) can be written as F~i,,(X ) = ai • X + bi if boundi < X < boundi+l. Determining in which interval i X can be found can be done in log n steps, with n the number of intervals, using a binary search. The computational requirements for this approach are much smaller than in the original equation. Instead of completely unrolled dividers, we need only four adders, a multiplier and some RAM or ROM to store the a[i] and b[i]. A side effect of this approach is that the kernel of the algorithm becomes partly programmable. Indeed, changing the constants a[i] and b[i] will result in a completely different shape of the divergence function. A PE-design for the PWL approach was made with a desired operation interval equal to two. This PE-design, with some of the routing is given in figure 4. Registers required to correctly delay signals are left out, as the exact number can only be determined after local scheduling. Scheduling the operations on this P E yields very similar results to the ones for the initial hardwired design using 2 cycles, also necessitating 3 index-points to be interleaved for optimal results. To calculate this schedule, a manually formulated ILP problem was used, based on the equations and constraints given in [9]. However, to reduce the complexity of the ILP formulation, all the hardware blocks inside the dashed lines in figure 4 were considered as one large, pipelined operator with latency 7 and operation interval 1. The result is that only 9 operators are to be taken into account. As a result, the ILP solver LINDO was able to schedule the operations in only a few minutes. O

LOW-LEVEL SCHEDULING

Now the final scheduling vector can be determined. The nil'space vector U for the proposed multi-dimensional scheduling vectors and placement vector is: B

v=[o

za

y

t

o 10]

(9)

Mapping a Diffusion Algorithmfor Vision

Figure 4:2 cycle PE design for the PWL approximated diffusion equation

139

140

J. Rosseel et al.

The constraint for realising an operation interval of n then becomes: 1II3 • Ur[ = n. The vector Ha must also satisfy the "standard" timing constraint resulting from dependencies and local scheduling. The following solution was found to be optimal for the given local schedule: Ha=[

B 0

xs -1

y t n 0 ]

(10)

This solution is however not the globally optimal one. Another solution for II3, with a 0 instead of the -1 coefficient for x s, leads to a reduced internal register cost between the PEs. A different local schedule is needed for this which exhibits the same cost. For the local scheduling stage, these two best solutions were equivalent, and one of them was chosen randomly by the ILP solver. Linearisation of the multi-dimensional time can in general only be done after fixing the application parameters. However, the very regularly shaped index-space allows at this point to derive a parametrised overall scheduling vector: II = #iterations • (imageheight + 2) x I I l + (imageheight + 2) • II2 + II3

(11)

This scheduling function is thus clearly dependent on the application parameters. This is requires only a global controller to be designed for this programmability. The behaviour of the local P E-controllers do not depend directly on the application parameters. Their behaviour is controlled with control bits issued by the global controller.

7

EVALUATION OF THE ARCHITECTURE

The global architecture is a linear array of processors. The number of PE's is equal to the chosen bandwidth+2. The resulting architecture exhibits the following properties: 9 100% efficient usage for the computational PEs. 9 border PE's with control can be easily implemented with small standard components, such as FPGA's. 9 Reduced storage requirements and small bandwidth to memory as a result of the optimisation. This depends on the low-level scheduling vector choice though and for II3 in (11) there is some overhead at the borders. . Programmability both for image and iteration parameters, as well as for the shape of the diffusion equation kernel. The details of the complete architecture with special emphasis on the board realisation including the external memory communication and on the building block design for the data-path are described in [5].

Mapping a Diffusion Algorithm for Vision

8

141

CONCLUSION

This paper deals with tile design of a modular, optimised array architecture for an image diffusion application, using a space-time mapping design methodology. It is shown how manually applied transformations allow the final architecture to be modular and have a low communication bandwidth. An overal optimised result is found using the tuned design script as described in [9]. Special attention is given to the design and scheduling of the very complex PE's that are necessary for this algorithm. References

[1] D.Baltus, J.Allen, "Efficient exploration of nonuniform space-time transformations for optimal systolic array synthesis", Proceedings of the IEEE International Conference on Application. Spececific Array Processors 9P. [2] A.Darte, T.Risset, Y.Robert, "Loop nest scheduling and transformations", in Environments and Tools for Parallel Scientific Computing, J.J.Dongarra et al. (eds.), Advances in Parallel Computing 6, North Holland, Amsterdam, pp.309-332, 1993. [3] K.Ganapathy and B.Wah. "Optimal design of lower dimensional processor arrays for uniform recurrences", Proceedings of the IEEE btternationai Conference on Application Specific Array Processors 9•, pages 637-648. [4] T.Gijbels, "The architectural specification of the non-linear diffusion equation", Technical report, ESAT, Katholieke Universiteit Leuven, Belgium, 1994. [5] T.Gijbels, P.Six, L.Van Gool, F.Catthoor, H.De Man, and A.Oosterlinck, "A VLSI architecture for parallel non-linear diffusion with applications in vision", IEEE Workshop on VLSI Signal Processing, 1994. [6] D.Moldovan. "Advis: a software package for the design of systolic arrays", Proc. IEEE Int. Conf. on Computer Design, Port Chester NY, pp.158-164, 1984. [7] N.Nordstrom, "Biased anisotropic diffusion: A unified regularization and diffusion approach to edge detection", Image and Vision Computing, 8(4), 1990. [8] P.Quinton and Y.Robert (eds.), "Algorithms and parallel VLSI architectures II", Elsevier, Amsterdam, 1992. [9] J.Rosseel, "Synthesis of Application Specific Architectures for Real-Time Regular Algorithms", PhD thesis, Katholieke Universiteit Leuven, 1994. [10] J.Rosseel, F.Catthoor, and H.De Man, "Extensions to linear mapping for regular arrays with complex processing elements", Proceedings of the 1EEE International Conference on Applica. tion Specific Array Processors 90, pages 156-167. [11] W.Shang and J. B. Fortes, "Time optimal linear schedules for algorithms with uniform dependencies", Proceedings of the IEEE International Conference on Application Specific Array Processors 88, pages 393-402. [12] V.van Dongen, "Presage, a tool for the design of low-cost systolic arrays", Proceedings of the btternational Symposium on Circuits and Systems, pages 2765-2768, 1988. [13] M.van Swaaij, F.Catthoor, H.De Man, "Non-linear transformations for high-level regular array ASIC synthesis", Journal of VLSI signal processing, pp.259-268, Dec. 1992. [14] Y.Wong and J-M.Delosme. "Optimal systolic implementations of n-dimensional recurrences", Proceedings of ICCD, pages 618-621, 1985.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

A SCALABLE

DESIGN FOR DICTIONARY

143

MACHINES 1

T. DUBOUX, A. FERREIRA, M. G A S T A L D O LIP, URA 1398 du C N R S

Ecole Normale Supdrieure de Lyon 69364 Lyon Cedez 07, France { dubouz,ferreira, gastaldo} @lip.ens-lyon.fr ABSTRACT. So far, VLSI dictionary machines have not been used in real time systems because they do not scale well. Indeed, dictionary machines were designed to fit in one chip only. If the number of acquired elements is larger than that of VLSI cells, another chip has to be designed and manufactured to take a larger dictionary into account. In this paper, we propose a new design for dictionary machines that assembles blocks of standard existing dictionary machines. The performances of our machine are as good as the best performances studied in the literature, with the enormous advantage of scaling quite easily, with no degradation of its performances, by simply adding more and more standard blocks. KEYWORDS. Dictionary machine, VLSI, load balancing, parallel data structure.

1

INTRODUCTION

The dictionary is an important data structure used in applications such as sorting and searching, symbol-table and index-table implementations. It can be seen as a special purpose system capable of storing records and performing update and retrieval operations on these records. All applications with data acquisition can take advantage of such a dictionary machine, with an on-Une treatment and efficient input/output. It means that the machine has to be very simple, scalable and adapted to be integrated in systems in robotics or image recognition. Several especially designed parallel architectures have been proposed in the literature for the implementation of dictionary machines in VLSI. Some papers report on novel interconnection networks [8, 6], while others use well studied topologies, like hypercubes [3, 7]. ' ISupport'edb'Y Stratag~me project and the DRET

144

T. Duboux, A. Ferreira and M. Gastaldo

On the other hand, some dictionary machines have been implemented on existing parallel architectures. Namely, an implementation on the Maspar MP1, an SIMD architecture, has been proposed [5] and results obtained on the Volvox i860, an MIMD machine, have been presented [4]. From these experiments we were able to review several concepts, that could be used to propose a new powerful and realistic dictionary machine, that is simple and scalable. After some preliminaries on dictionary machines and existing VLSI solutions, we present our new design in Section 3 and we detail its behavior in Section 4. Then, in Section 5, properties and performances are studied. We close the paper with some concluding remarks and directions for further research. 2

2.1

PRELIMINARIES

Definition of a Dictionary Machine

The dictionary task can be loosely defined as the problem of maintaining a set of data elements each of which is composed of a key-record pair (K,R), where K is the search key and a the associated record. For simplicity, the record whose associated key is K will be denoted record K. The dictionary supports a set of instructions on its entries, like $ INSERT(K): inserts record K in the dictionary; 9 DELETE(K): deletes record K from the dictionary; 9 FIND(K): retrieves record K if currently stored, does nothing otherwise; 9 EXTRACT MIN: returns the current minimum record and delete it.

Insert and Delete can be redundant. An insertion is redundant when the key being inserted already exists in the dictionary; a deletion is redundant when the key being deleted does not exist. Performances of dictionary machines can be measured in terms of the following parameters.

Capacity: The maximum number of elements that may be stored in a dictionary machine. R e s p o n s e time: The elapsed time between initiation and completion of an instruction. P i p e l i n e interval: The minimum elapsed time between the initiation of two distinct instructions. 2.2

E x i s t i n g V L S I Solutions

As reported in the introduction, many papers deal with the implementation of a dictionary machine on special purpose architectures. Among them, the machine proposed in [8] was one of the first proposed in the literature, with the advantage of being powerful and relatively simple. In such a machine, key-record pairs are stored in a linear array of N Storage Elements (SE's). In addition, the machine has a balanced binary tree of processing elements, with the SE's appearing as leaves of the tree. Instructions enter the machine at the root of

A Scalable Designfor Dictionary Machines

145

the tree, and answers are extracted from the root. The purpose of the tree is simply to broadcast a given instruction to all the SE's and to transmit answers back to the root. When instructions are pipelined, information will be flowing up and down the tree simultaneously. A given instruction reaches all SE's at the same instant. The key-record pairs are stored in the SE's from left to right in increasing key order. When the SE's receive an INSERT(K,R) instruction, the pair (K,R) is inserted at the proper place in the array, and all key-record pairs with keys greater than K are shifted one position to the right. When a DELETE(K) instruction is received, the SE with key U simply marks its content as deleted. To achieve a good response time in the presence of redundant instructions, the authors in [8] allow holes in the structure. A hole results from the insertion of an already existing element or the deletion of an element. To prevent a large number of holes, these holes are removed by a special instruction, called COMPRESS. It can be shown (see [8]) that if one COMPRESS follows each insertion (or EXTRACT-MIN)and two COMPRESS follow each deletion, then the jth smallest key can be found at position p <_ 2 • j - 1. In particular, the smallest key is always in the first node. Because of the holes, the capacity of such a dictionary machine is N/2, where N is the number of leaves of the tree. The instructions are pipelined so that the interval is O(1) and the response time 2 is log N. The theoretical performances of this machine are very good. However, no realization has followed such a study. Actually, as a node handles a single element, building a machine with large capacity leads to a huge tree not implementable in one chip only. Therefore, the corresponding architecture would not be scalable, as adding some nodes would be impossible. Our goal in this paper is to take advantage of the existing solutions on the VLSI model to get a powerful, scalable and realistic dictionary machine by adding several new concepts derived from our experiments on SIMD and MIMD models [4, 5]. 3

A SCALABLE SOLUTION

In the following, we present the architecture of the new machine along with the ideas of the algorithms to execute the dictionary operations. In order to make our solution scalable, we have also to solve a crucial load balancing problem. 3.1

Proposed Architecture

The new architecture is composed of an Input/Output device (noted ~" in the remaining), a linear array of P elementary processors ( P 0 , ' P l , . . . , PP-1) working synchronously, and a VLSI tree like the one in [8],7", under each T) (see Figure 1). The size of a 7" is noted N (corresponding to a capacity of ~). N 2An extension was proposed in [8] that improves the response time to log n where n is tile actual number of elements ill the structure. Under the constraints of real time computing and scalability, however, such an extension is of little help.

T. Duboux, A. Ferreira and M. Gastaldo

146

The communication capabilities in our architecture are as follows. Jr can broadcast instructions to all the P using a broadcast network. In the other way, only a single P can send information across this network to ~ in one step. Furthermore, in a given time step, each P can communicate with its two neighbors, and with the root of its corresponding T. As described in [8], within these trees each inner cell can send data to its two children or receive and combine data from these children. Obviously, it can also send data to its parent. Finally, the leaves are connected in a linear array. The P ' s are also connected to a prefix tree that is able to compute total sums, partial sums, and maximum of the leaves in logarithm time (such a tree can be found in [1]). ~" is connected to the root of this tree. This prefix tree will be of particular use for load balancing, as we shall see later. Notice that the broadcast network and the prefix tree may be implemented by one specific network. Figure 1 shows all these connections on a small example, with P = 6 and N = 4. One can see that the VLSI trees are not connected to each other. This is very important because it means that such trees can be implemented on independent chips, guaranteeing the scalability of our design.

Figure 1: General design of our dictionary machine

G l o b a l Behavior

3.2

3.2.1 Partitioning the Space of the Keys. The idea to get a distributed data structure is to split S, the space of the keys. We sort and make a partition of this space. Each Pi will handle a working domain Di, with U Di = S and ~ D i = q), such that if ki E Di and kj E Dj then ki < ki if and only if i < j. One can notice that assuming a balanced structure, with the same amount of data in each T, means that the distribution law of the keys is known, since it would be easy to split S in sections that have the same probability concerning insertions. But a bad distribution of the keys may cause a memory overflow in some T. 3.2.2

Broadcasting a Request and Getting the Answer.

tion in our machine can be sketched as follows. 1. The instruction is broadcast from ~" to all P's;

The treatment of a given instruc-

A Scalable Designfor Dictionary Machines

147

2. Each P checks if it is concerned by the instruction; the concerned one, say Pj, sends the instruction to the root of Yj; 3. The instruction is executed inside Tj as described in section 2.2; 4. The root of Ti sends the answer to Pj; 5. Y'j sends the answer to .%'. Note that the instructions are pipelined, i.e., we do not wait for an answer before sending another instruction. In a tree 7", because of redundant instructions, we have seen in section 2.2 that two executions of COMPRESS may be necessary between each instruction. Also, communications in 7- are very simple, which is not the case in the broadcast of the instructions from ~" to every T~. According to this, we assume hereafter that the frequency of the cells in the trees 7- is at least three times that of the P's. In other words, two COMPI~ESS are inserted between each query sent from P to its corresponding 7-.

3.2.3 Memory Use. Considering the memory space, a P only needs a constant size memory to handle two record-instruction pairs and a very smaJ1 number of integers. It means that each P is very simple, which allows large values for P. In the 7-'s, each node may have to store two record-instruction pairs. The hardware complexity of a P is not greater than the complexity of any cell in 7". In a VLSI implementation, P and the root of 7" could be merged into one single component. 3.3

Balancing the Structure

In order to obtain a powerful machine, one has to guarantee that the whole structure is balanced. This problem has well known solutions in sequential algorithms, performing tree rotations for example. Unfortunately, load balancing is much more difficult to implement in parallel (see for example [2, 5]), and even worse in real time systems. As seen before, the key distribution among the P's is not known beforehand. Moreover, the load of a 7- directly depends on the size of its data structure. Then, to achieve good performances, we have to propose a dynamic load balancing strategy. The first idea to balance the structure is to use a strategy close to the one proposed in [8], but it is not efficient here, since it can be shown that ~(log N) time would be necessary between two instructions. As suggested in [5] for a SIMD machine, we balance the structure after a given number of instructions. The balancing strategy is based on global knowledge, but with only local data exchanges, as follows. Using the prefix tree, each/)i calculates the total number of elements handled by the structure (TS), and the number of elements in the structure, at its left (PSi). nj is the number of elements handled by ~ : j
TS = ~ n~ j=o

,j
PSi = ~ nl j=o

148

7". Duboux, A. Ferreira and M. Gastaldo

These calculations can be done in O(log P) time (see [1]). Thus, Pi determines where data should be sent by computing the left and right imbalances:

Actually, the behavior of each Pi takes into account the states of all the other/)'s, but communications are only performed between neighbor P's. To compute these sums, it is necessary to know in a constant time the local sizes (hi's). Because of redundant instructions, this counting cannot be exactly done when a 7) receives an instruction from .T. Hence, a P cannot know if an INSERT Will increment its number of elements, nor if a DELETE will be executed. This problem can be solved considering the answers, and adjusting ni when necessary: a 7) increases its number of handled elements when a DELETE has not been executed, and decreases it for a redundant INSERT. 4

D E T A I L S OF O P E R A T I O N

As shown in the GLOBAL ALGORITHM, two phases are alternativelyexecuted in our machine: the instruction processing phase and the load balancing phase. The firstalgorithm, corresponding to the processing phase, is executed I times (to be precised later). Then, steps (2), (3) and (4) are executed, corresponding to the balancing phase. After that, the machine is ready to start again a processing phase. This can be repeated. GLOBAL ALGORITHM: Repeat indefinitely (1) Repeat I times INSTRUCTION PROCESSING ALGORITHM

(2) BALANCING ALGORITH., PART 1 (3) BALANCING ALGORITHM, PART 2

(4) BALANCING ALGORITHN, PART 3

4.1

Processing Phase

Let us describe now the INSTRUCTION PROCESSING ALGORITHM. INSTRUCTION PROCESSING ALGORITHM: Do in parallel (a) ~ broadcasts R(t + I) to every 73 (b) Each 73 compares R(t) .ith its .orking domain: If 73j is concerned by R(t) Ij ~- R(t) Update nj Else

Ij ~" Hop S.ap (Ij,Aj(t)) between 7)] and

(c) If Aj(t- 1) ~ Void Update, i f necessary, nj 73j sends t j ( t - 1 ) to

A Scalable Designfor Dictionary Machines

149

Case A

DL <0

n

>o

c

;-o

>o

.

_
.

.

D~, >0 .

.

Ways ~P~--~ .

.

.

.

pj

,

Figure 3' Left and right ways of balancing Figure 2: Swap(I~,A~(t))

Figure 2 illustrates the instruction pipeline during the INSTRUCTION PROCESSING ALGORITHM: each P is able to simultaneously (in one step) receive a request from t', send an answer to jr, send an instruction to its corresponding 7" and receive an answer from it. However, as shown by the parameter t, data in these communications are different. Furthermore, at step t, A(t) is the answer to the instruction I(t - 2log N), as instructions are also pipelined into 7" (see section 2.2). In step (b), only one P is concerned by the request R(t), because of the partition defined by the working domains. Suap is done by every P, because an answer should arrive from a previously sent instruction, but only one P can receive a no void answer during this step (the P that sent the instruction 2log N step before). So, there is no conflict when this 79 sends the answer to .T (in (r

4.2

Balancing P h a s e

The BALANCING ALGORITHM can be split in three parts. During PART l, the tota.l sum and partial sums are globally computed, and then DL and D~ are calculated for each Pj. According to these values, each P is in one of the four situations described in Figure 3. After that, the balancing instructions are initiated but corresponding elements are not yet ready to be sent. Indeed, answers from the last instructions are extracted from Tj. Due to the response time of 7-, the first element to be sent for balancing is extracted after 2 log N step. PART 2 corresponds to the first message sent to the right and to the left. Now, at PART 3, each P sends or receives data only corresponding to the balancing. During this part, messages are scheduled as shown in Figure 4, according to the ways of balancing defined in Figure 3.

7". Duboux, A. Ferreira and M. Gastaldo

150

BALANCING ALGORITHM, PART I: (1) (2) (3) (4)

Each ~9j sends nj to the prefix tree Each Pj waits and receives PSi and TS Each Pj computes DL and Vf Each Pj sends IDa] to the prefix tree so that M = max/]D~] is computed (5) Do ]ogN times the next two steps (6) Do in p a r a l l e l (a) If Df > 0

(7) Do in parallel (a) I f D f > O Ij ~-- Eztract-Haz

Ij ~- Emtract-Mi.

Decrease D~ Else lj ~- Hop Swap (Ij, Aj(t)) (b) I f Aj(t- 1) # Void Update nj

?# sends ,#(t- I) to 5

Decrease Df Else Ij ~- Hop

Swap (Ij, Aj(t)) (b) I~ A # ( t - 1 ) # Vo~ Update nj 7~j sends Aj(t-1) to jr

BALANCING ALGORITHM, PART 2: (1) Do in p a r a l l e l (a) If D~ > 0 Ij 4-- Effitr~ct-Min

(b)

Decrease D~ Else Ij ~- fop S,ap (Ij, Aj(t)) If Aj(t -- I) = .am-Arts Decrease nj Pj s ends Aj(t-- I) in R~+ I

(2) Do in parallel (a) If Df > 0 Ij ,-- Eztract-Maz

Decrease Df Else lj ~- bop Swap (Ij ,Aj(t)) (b) If Aj(t- I) = Hin-Ans Decrease nj Pj sends Aj (t-- 1) in RR_I

BALANCING ALGORITHM, PART 3: (i) Do M times the next two steps (2) Do in parallel (a) If D ~ - 0 Ij ~- Hop

If D ~ > O Ij ~- E~tract-Min

Decrease D~ If D ~ < O Ij ~- Insert (R~) Increase D~ S.ap (Ij ,Aj(t)) (b) If Aj(t- 1)= Maz-An.s Decrease nj

(3) Do in parallel (a) If D ~ = O Nop

If D ~ > O Ij ~- Eztract-Maz

Decrease DR If D ~ < O Ij ~- l.sert(Rf) Increase DR S.ap (Ij,Aj(t))

(b) I~ A~(t- I)= m.-A.s Decrease nj 7)j s ends Aj(t- 1) in. Rf_ I (4) Update working domains

A Scalable Design for Dictionary Machines

151

Figure 4: Message scheduling (during PART 3) corresponding to the ways of balancing. 5

MACHINE PROPERTIES

AND PERFORMANCES

The first point to show is that the machine works, i.e. it is able to store the data structure and to maintain it, using the balancing technique. However, proving that the machine has a correct behavior leads to a very in-depth study of the balancing strategy, with the timing of each stage. In the remaining, we introduce several results allowing to justify the correctness of the behavior of our machine. T h e o r e m 1 The execution of Z instructions can increase the imbalance by at most I . P r o o f The proof is based on the fact that the execution of a single instruction increases the imbalance by at most one. As seen previously, the imbalance is defined by D~ = ,rTSX~p-- PS~] . Let us consider D~ before a given instruction, Instr, and after the execution of this instruction. Suppose that Instr is a DELETE. Clearly, T S decreases by one, and PSi stays constant or decreases by one (depending on j). Hence, D~ either decreases by one, or remains the same. The case of the instruction INSERT is analogous, an insertion increasing D~ by 0 or 1. Therefore, I instructions can modify D~ by at most I , for all j.

!-i

In order to study the behavior of BALANCING ALGORITHM, we first need to introduce the following definition. Definition 1 The data structure is said to be perfectly balanced if and only if D L = O, for all j, and Din = O, for all i. T h e o r e m 2 Assuming that Vi, ni > 2 + log N, a balancing phase (parts 1,2,3)perfectly balances the structure.

152

T. Duboux, A. Ferrelra and M. Gastaldo

P r o o f According to the balancing algorithm in Section 4.2, each P decreases ID.LI and IDa[ by one at each iteration until they reach 0. Furthermore, the loop is executed M times, where M = maxi [DiLl was computed before the actual balancing started. Then, if no problem arises in the loop, the structure will be balanced. The point to be considered concerns the communications. We have to verify that they can be executed, i.e., the data to be sent exist at the correct time instance in the corresponding T. Let us consider the four possible situations for DJ~ and D L in Pi, as illustrated by figure 3. Clearly, in situation D, P/ only receives elements and in situation C, Pj handles enough elements to execute all the sendings. Problems could arise in situations A and B, which are symmetric. If Pj handles at least 2 + log N elements, it can execute the parts 1 and 2 of BALANCING ALGORITHM without problem as at most 1 + log N extractions may be necessary. In the beginning of part 3, each extraction is followed by an insertion, meaning that there are enough data to allow all the instructions to be performed. Therefore, the M iterations are executed without problem, leading to a perfectly balanced structure. El In the remaining, we assume that we start our application in a balanced state, with at least 2 + log N elements in each T. Furthermore, the BALANCING ALGORITHM is applied every 2" instructions, as described in GLOBAL ALGOPdTHM. C o r o l l a r y 1 Under the hypotheses above, the structure is balanced after each execution of BALANCING ALGORITHM. P r o o f Direct from theorems 1 and 2.

El

T h e o r e m 3 The time of a balancing phase is Ts = 2 x (log P + log N + M) + 4. It is upper bounded by

Ts=,~ = 2 • (log P + log N + I ) + 4. P r o o f Let us analyze BALANCING ALGORITHM. In PART 1, steps 1 and 2 take 2 • log P, corresponding to partial and total sum calculations in the prefix tree (see section 3.3). Steps 3 and 4 consist in a calculation and a communication in parallel, executable in a single time unit. The remaining of PAnT 1 is executed in 2 • log N times, as steps 6 and 7 are executed in one time unit each. In the same way, PART 2 is executed in two time units. PART 3 is a loop executed M times, where one loop execution requires two time units. Step 4 is executed in one time unit. According to theorem 2 and corollary 1, M = maxi IDOl is bounded by Z, and the result follows. El R e m a r k : Instruction processing can restart immediately after BALANCING ALGORITHM, as ~" handles M and can thus anticipate the end of the balancing phase, broadcasting new instructions so that the P's do not stop processing. T h e o r e m 4 The capacity of our machine is K = P x ( 2v _ Z). 2 P r o o f Each T can handle ~ elements (see section 3.1), but as we balance the structure after the execution of 2" instructions, Z leaves have to be free on each T. I-I

A Scalable Design for Dictionary Machines

153

The goal of our application is to treat a flow of instructions coming from the outside world, in real time. We call Intent the constant pipeline interval between two arriving instructions. Because of our balancing strategy, we clearly need a FIFO queue, storing the requests during the balancing phases. If we note Intin, the interval between the broadcast of two instructions from the queue to the P's, it is clear that Inti,t = To during a balancing phase and Inti., = 1 otherwise. Furthermore, if Intl., = 1 then IntoRs = 1/3, where Intons is the pipeline interval in 7" (because each instruction is followed by two COMPRZSS). T h e o r e m 5 Let SQ be the size of the FIFO queue. Assuming that To is bounded and that the queue is empty in the beginning, S o is bounded by SQ"`.~ = C2 • (log P + log N + 2). In the end of the first balancing phase, we have S o <_ 5'Q"`.. = TB...~I,,,.~," If we want to empty the queue before the next balancing phase, we have to broadcast all the arriving z - I . Therefore: instructions plus the ones in the queue. So, SQ"`.. + tat..----'7 Proof

TB.~ + Z = Intext I and thus

Intent=

TB"`.~ + 1 = 2

Z

I •176176

Clearly, if I = C1 • (log P + log N + 2), where C1 is any small constant, we have

Int~t = 3 +

2

and

SQ...~ = C2 • (log P + log N + 2). One can note that the constants are small. Indeed, C2 can be explicited easily:

C,+I 2 C2=2xC, x 3xCI+2 < ~x(C,+I).

vl

6 Let P, N and I n t ~ t be given. Then, our dictionary machine is able to maintain a dictionary with response time TR < Ca x (log P + log N + 2), where Ca is a small constant.

Theorem

We can determine Z using the property that S O is bounded. Then the structure is balanced and handles at most P x ( ~ - - Z) elements. Tile response time, TR is given by twice the traversal time (down and up the trees), plus the idle time, as follows. Proof

Tn < 2 x (log P + log N) + max(So"`~ TB"`~ ). Since SQ..~ -- TB.... l,,.~, ; if Inter, > 1, then To.,.. = max(SQ.,..,TB.,~ Tn < 2 • Tn < 2 x (C~ + 2) x (IogP + Iog N + 2).

Therefore,

El

It is important to notice that for a capacity K = P • (~- - C, • (log P + log N + 2)), the response time of the machine is Tn < Ca x (log P + log N + 2), yielding that, for large enough values of P and N, Tn corresponds to log K. Another point to note is that all the proofs above are based on the hypothesis that ni >_ 2 + log N, for all i. This allows us to ensure that the balancing phase can be executed.

T. Duboux, A. Ferreira and M. Gastaldo

154

This restriction is not severe as it just means that the data structure is not completely empty. However, even if this condition is not verified, the balancing strategy can be applied. It has been shown that the resulting structure is not balanced immediately, but becomes more and more balanced in time. A very detailed study of the balancing algorithm when the structure is empty, in the case of SIMD implementations, with in-depth analysis, can be found ill [5]. 6

CONCLUSION

The main features of the dictionary machine proposed in this paper are its scalability and on-line characteristics. Indeed, such an architecture can be used as an embedded system coupled to a high speed I/O device for real time information processing. It has the same performance characteristics as the ones of [8], providing now a feasible and scalable archia response time corresponding tecture. In other words, we have a capacity close to T N•, to the logarithm of the capacity, and a constant pipeline interval. A further question concerns the interval between two balancing phases. We proposed to perform a balancing every 2" steps, what is not necessarily optimal. To improve this, one could study the impact of an adaptive frequency for balancing, that would depend on the amount of data exchanged during the previous balancing phase. References [1] S. G. Akl. Design and Analysis of Parallel Algorithms. Prentice-Hall International, Inc., 1989. [2] F. Dehne and M. Gastaldo. A note on the load balancing problem for coarse grained hypercube dictionary machines. Parallel Computing, 16:75-79, 1990. [3] F. Dehne and N. Santoro. An optimal VLSI dictionary machine for hypercube architectures. In M. Cosnard, editor, Parallel and Distributed Algorithms, pages 137-144. North Holland, 1989. [4] T. Duboux, A. Ferreira, and M. Gastaldo. MIMD dictionary machines : from theory to practice. In Bouge et al., editor, Parallel Processing: CONPAR 92 - VAPP V, number 634 in LNCS, pages 545-550. Springer-Verlag, 1992. [5] M. Gastaldo. Dictionary Machine on SIMD Architectures. Technical Report RR 93-19, LIP ENS-Lyon, July 1993. Submitted to Publication. [6] H.F. Li and D.K. Probst. Optimal VLSI dictionary machines without compress instructions. IEEE Trans. on Computers, 39:332-340, 1990. [7] A.R. Omondi and J. D. Brock. Implementing a dictionary on hypercube machines. In Int. Conf. on Parallel Processing, pages 707-709, 1987. [8] T. Ottmann, A. Rosenberg, and L. Stockmeyer. A dictionary machine (for VLSI). IEEE Trans. on Computers, c-31(9):892-897, sep 1982.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

155

S Y S T O L I C I M P L E M E N T A T I O N OF S M I T H A N D W A T E R M A N A L G O R I T H M ON A S I M D C O P R O C E S S O R

D. ARCHAMBAUD, I. SARAIVA SILVA, J. PENNI~

MASI Laboratory, University Paris 6 4 place Jussieu, 75252 Paris Cedex 05, France eraail : { archambaud, dasilva} @masi. ibp.fr ABSTRACT. We present a parallel algorithm that performs genetic sequence comparison in a systolic architecture. The Smith and Waterman algorithm provides an alignment between two sequences, and calculates accurate similarities between all possible subsequences - however it requires huge amounts of calculation. Software tools used so far to compare genetic sequences are approximations of the Smith and Waterman algorithm, running reasonably fast but with reduced accuracy. We aim at implementing the exact algorithm in our architecture, to provide a fast and accurate sequence comparison tool. We describe the implementation of the main dynamic programming formula (easily paraUelizable since the data dependence is topologically limited) and also the way to manage the gap penalties and the substitution matrix weights. The coprocessor board we are developing is a paginated set-associative memory with one-dimensional systolic capabilities. It is intended to be connected to a host machine via a standard PC bus. The board is made of multiple VLSI circuits, each one containing some processing elements. The I/O throughput between the host and the systolic row does not limit the board speed with regards to this application. KEYWORDS. Similarity, Smith and Waterman, Proteins, DNA sequences, VLSI implementation, SIMD coprocessor, Systolic architecture.

1

INTRODUCTION

Genome alignment is an important task in molecular biology. It consists in comparing every substring of two strings associated to proteins or DNA sequences. A basic algorithm has been described by Smith and Waterman [18] and went under several variations [1],[8]. The genome database is already huge and is still increasing. A prohibitive computing time has been reached on standard workstations, up to 400 days of CPU time is necessary for the

156

D. Archambaud, L Saraiva Silva and J. Pennd

exhaustive matching of an entire protein sequence database [7]. The time complexity of such an application is O(M~N), M and N being the sequence lengths. Gotoh reduced this complexity to O(MN) under some limitations [S]. A systolic implementation runs in time O(M+N). There are basically two hardware approaches to increase the processing power for a given problem. One can either design/use a dedicated computer with its own operating system and application softwares [5],[4],[10], or add an accelerator to an already existing machine [14]. The second solution has many advantages: it is a fast solution since the device to design is reduced to a specialized coprocessor with an interface to the host computer, it is a cheap solution since only one board is necessary, and it is a standard solution as the accelerator can adapt to different architectures. The SIMD organization (single instruction stream, multiple data stream) is a low-cost way to implement a massively parallel architecture. It only requires a unique sequencer for all the execution units. Our accelerator board is composed of a control circuit (containing the sequencer and the interface with the host computer) and several cascadable execution circuits containing the processing elements [2]. We first present the Smith and Waterman algorithm and how it can be adapted to a linear systolic architecture. We then briefly describe our own system - - the Rapid-2 coprocessor - - and show the way we implement the algorithm. As a conclusion, some performance evaluations are given. We wrote a simulation program which exactly simulates what is executed in the coprocessor at each clock cycle, and measures the time elapsed in each data transfer between the host and the board. Such a program helped us to make architecture choices and to write optimized and valid implementations of the algorithm. 2

SMITH AND WATERMAN ALGORITHM

The use of similarity measurements to compare two proteins or DNA sequences was first proposed by Needleman and Wunsch [15]. Mathematically the problem was presented as a two-dimensional array, where the value of all elements Hi,j is a function of the comparison between the ith and jth amino acids in both sequences. Pathways of non-zero values through the array correspond to the possible alignments of the two proteins. The basic problem of the comparison of biological sequences is searching the alignment with the highest similarity. An alignment is a subset of the cartesian product of two sequences. Similarity measurements are based on two weight functions. The first, s(ai, bj), measures the weight of a substitution between ai and bj. The other function, wk _ 0 is the weight of a gap (insertions or deletions) of length k. Gaps take place when one or more amino acids of one of the sequences are not present on the subset of the cartesian product A (a, b). The similarity of an alignment between two sequences "a" and "b" is the sum of the weights of all the substitutions minus the sum of the weights of all the gaps. Smith and Waterman [18] presented a new algorithm to compare biological sequences. Their approach gives a way to find the pair of segments of two sequences with the maximum similarity. A segment is defined as being a subsequence of a biological sequence; for instance, the subsequence aiai+l...aj is one segment of length (j - i + 1) of a = ala2...an, where

Systolic Implementation of Smith and Waterman Algorithm

157

1 _< i _<j _< n . Let Hi,j be the maximum similarity of two segments that end in ai and bj; or zero when the calculated value for the similarity is negative. In its most general form, the similarity measurement in the Smith and Waterman algorithm is defined as: Hij = max {Hi-l,j-1 + s(ai,bj), maxl
The pathways of similarity progress diagonally when the elements of segments match (Hi-l,j-1 + s(ai, bj)) and horizontally or vertically when a gap is encountered respectively in "a" or "b". In a sequential implementation of this algorithm, the pair of segments with maximum similarity is found by first locating the maximum element of H. The other elements of the pathways are determined with a traceback procedure ending with an element of H equal to zero. The next best similar segments are found with the second largest element of H not associated with the first pair of segments. Figure 1 presents a very simple matrix and the deduced alignment. The matrix elements were calculated with parameters s(a,b) = 3 i f a = b, - 1 else; and wk = 3 + k . The algorithm can be executed in M2N steps (M and N being tile sizes of both sequences). Gotoh showed that when the gap penalty wk is linear, the complexity can be reduced to M . N steps [8]. Such a dynamic programming algorithm has been applied in other contexts such as text processing [20] or speech recognition [17]. The acoustic similarity calculation provides a time alignment so that the phoneme duration can be different between the reference and the word to be recognized.

M

D

S

F

D

R

Y

G

M

D

D

0

0

0

0

0

3

0

A

0

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

3.

0

0

0

0

1

0

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

0

0

3,

1

0

0

0

0

0

0

0

0

3

0

0

"6,

3

2

1

0

0

3

3

Corresponding Alignment"

, R K

0

0

2

0

3

"SI

2

1

0

0

0

2

0

0

0

1

2

6,

4

2

1

0

0

0

o

o

o

o

1

3

"9, ~

s

4

3

2

o

o

o

o

o

a

6i~

9

e

7

~

0

0

0

0

0

3

5

9 11

8

7

6

0

0

0

0

0

0

4i

8

8 10

7

6

~

E F

D K R

Y

R

K

- -

The alignment score is 12. It begins at letter S and ends at letter A Dashes symbolize the gaps

Figure 1" Principle of alignment by similarity calculation

158

3

D. Archambaud, 1. Saraiva Silva and J. Pennd SYSTOLIC APPROACH

Systolic systems are a relatively simple and inexpensive approach to achieve high computation throughput with only modest I/O bandwidth [11]. A systolic system consists in an array of cells performing some simple operations. Data flow from the host in a pipelined fashion, passing through the cells where the computation requirements are executed. A high computation throughput is achieved since cells in the path of the data-flow can read data and perform computation simultaneously. Using a linear Systolic array, the algorithm is executed in M+N steps instead of M.N. The dynamic time wrapping algorithm for speech recognition has been implemented in 2-D systolic systems [6] and Wood showed that a 1-D systolic scheme was more efficient since it reduces computation and data transfers [19]. As described previously, the Smith and Waterman algorithm uses a two-dimensional array of similarity values to search the maximum H. A parallel implementation of the algorithm is possible, due to the particular value dependences in the matrix. Each value Hi,j results from values Hx,y in the matrix such as (x < i), (y < j). Moreover a recursive method can be found, in which Hij can be deduced from Hi-l,j-1; Hi-l,j and Hij-1 only. Thus the matrix could be mapped into a 2D-systolic array. Such architectures are under study [13] but are drastically limited by the integration possibilities - - small number of elements in a single circuit, and unability to interconnect multiple circuits since this would involve too large interfaces. We can then remark that at a given time t, the activity is limited to the diagonal such as t = i + j - 1. At time t, values Hi,j in the diagonal i + j = t + 1 are computed according to the values in diagonals i + j = t and i + j = t - 1. Hence, all values in the diagonal can be calculated simultaneously by a systolic linear scheme. Figure 2 illustrates this reasoning, the arrows ~symbolize the data dependence: Hij is deduced from its three neighbors by a function which is the adaptation of formula (1) in a recursive fashion.

al

al a2

al a2 a3

al a2 a3 a4

al a2 a3 a4 a5

bib2

0

bl b2 b3

0

bl b2 b3 b4

0

/

bl b2 b3 b4 b5- ~

0

0

0

Figure 2: Diagonal at time t=4 H4,1=a1; H1,4=bt H3,2 depends on H3,1; H~,2; H2,1 H2,3 depends on H2,2; H1,3; H1,2

0

Systolic Implementation of Smith and Waterman Algorithm

159

In a linear systolic architecture, the systolic row maps the diagonal. In each cell Ci (a cell is a processing element) we store value ai. Values bj are successively shifted from a cell to the next. We adapted the 2-D algorithm in a 1-D system by keeping one dimension (the one noted i) and by transforming the other one into time, with formula t=i+j.1. Thus Hi,j which uses 2-D space coordinates, becomes Hi,t-i+1 with 1-D space plus time coordinates. In the 2-D algorithm, the data dependence is as follows : Hi,j only depends on Hi-l,j; Hi-l,j; and Hi-l,j-1. We can deduce the relative positions in space and time into the row by a mere variable transformation:

Hi-l,j = Hi',t'-i'+l if and only if i'=i-1 and t'=t-I this means value Hi-15 is accessible in the systolic row at time t-1 in the left neighboring cell of the cell containing Hi,j. tIi,j-1 = tIi,,t,-i,+l if and only if i'=i and t'=t-1 this means value tti,j-1 is accessible in the systolic row at time t-I in the very cell that contains Hi,j.

Hi-l,j-1 = Hi',t'-i'+l if and only if i'=i-1 and t'=t-2 this means value Hi-l,j-1 is accessible in the systolic row at time t-2 in the left neighbor of the cell that contains Hi,j. As a conclusion, the space and time dependence in the systolic row is reduced to: two calculation steps in time and one cell neighborhood in spac e . Time dependence is solved by storing results in registers. Space dependence is solved by a 1-D mesh. Figure 3 shows the transformation from a 2-D network (spatial dependence only) to a 1-D network with space and time dependence. The arrows on the 2-D matrix show the data dependence for calculating one diagonal of elements (symbolized by circles). In the 1-D scheme, all these elements are being calculated at the same time (time t). The arrows on the right part of the figure show the new data dependence on a 1-D systolic row: ea.ch cell needs to be able to read three values from its left neighbor: the equivalent of value Hi-l,j, the equivalent of value Hi-l,j-1 and the bj character. This access to neighboring values is performed by globally shifting the relevant values of M1 the cells to the right. Multiple meshes should be implemented to perform the transfers simultaneously; however, serializing the transfers through a unique mesh considerably reduces the interface and makes cascadability possible. 4

ARCHITECTURE

The Rapid architecture is being developed in the MASI laboratory. It aims at a massively parallel, SIMD, paginated, set-associative hardware accelerator mounted on a PC board. The coprocessor contains a variable number of execution units called cells. Each cell comprises registers and an ALU (arithmetic and logical unit). A page includes one register of each cell, processing is usually performed upon a p a g e - i.e. in all the cells in parallel. The ALU can perform basic operations (add, sub, and, or, xor, not, shift) in both 32-bit

160

D. Archambaud, L Saraiva Silva and J. Pennd

Figure 3: Data dependence from the 2D to the 1D computation

and 8-bit mode, the second mode being executed with a parallelism multiplied by a factor four as the 32-bit ALU is then reconfigured into four 8-bit parallel ALUs. All the cells receive the same micro-instruction from a unique sequencer, and are connected to a shared data bus. We added to this common design, a mesh that links each cell to its neighbors, so that it can be used as a systolic architecture. It is a bi-directional ring, so that each word can be sent either to the left or the right neighbor, in parallel in all the cells. Considering a page, this corresponds to globally shifting a whole page to the right or to the left. This shift operation can be performed in both 32-bit and 8-bit modes. The mesh is designed in such a way that a faulty cell can be by-passed. This ensures a relative fault-tolerance (a faulty-cell is disactivated, and its emitted value is ignored). Figure 4 shows the global board architecture. The physical hierarchy will be as follows: one circuit (execution circuit) will instanciate some cells, another one (controller circuit) will contain the board interface and the sequencer with its micro-ram. The board will interconnect several execution circuits and one controller circuit. The number of cells in an execution circuit is only limited by the fabrication technology; the number of execution circuits on the board is only limited by its size. Using an SIMD architecture, the speed-up may be limited by the I/O bottleneck [11]. However, our architecture is not limited by the I/O throughput as long as data to be processed are stored in the cells, so that data access can be executed in parallel in all the elements [3]. Therefore, the memory must be wide enough to avoid too frequent I/O communication between the coprocessor memory and the main memory of the host computer,

Systolic Implementation of Smith and Waterman Algorithm

161

and the process must be run over local data, i.e. data that can be contained in the cells. In order to avoid too frequent communications between the board and the host computer, the following features have been chosen: 9 T h e cell m e m o r y is organized into 64 words of 38 bits, plus two auxiliary registers. Such a decision is based upon some tradeoffs. The memory vs logic space ratio must favor the memory (the purpose is to provide "intelligent memory" rather than "complex processors in parallel"). The memory must be wide enough to run the Smith and Waterman algorithm, for instance the best way to calculate the similarities is to store the substitution weight matrix values s(ai, bj) in every cell. 9 T h e c o n t r o l - s t o r e R A M is organized into 2K words of 64 bits. The micro-program is loaded into the board control circuit during initialization. This writable controlstore makes the coprocessor a field programmable device that can be adapted to a given application. Thus, the host does not have to send instructions to the board during computation steps. This particularly reduces communications through the interface. A complex token unit which makes possible specific processing over particular data without any cell-addressing mechanism. The token is a flag that indicates in each cell whether the instruction broadcast from the sequencer has to be executed or not. The token can be set according to complex assertions (comparisons, combinations of comparisons...). This can be used to perform conditional operations and to emit one particular register to the databus. In a standard systolic row, the only way to read out a value in the row is to wait for it to be output at the end of the row. In our system, a particular value can be read out immediately by emitting it to the data bus. The token is used to select the emitting cell. When several tokens are set, an arbitration tree provides a priority so that only one cell emits simultaneously.

5

ALGORITHM

IMPLEMENTATION

In section 4, we show that a systolic scheme would need to shift at least three values at each step: Hi-l,j-1, Hi-l,j, and bj. In our architecture we use the mesh to do so. As only one mesh exists, values are shifted sequentially. We associated some pages to particular values and behavior: page A contains values ai with no modification during the whole process, page B contains values bj filled in a systolic fashion from the host, page D contains values Hi-l,j-1 of diagonal v (cf figure 3), pages FF andEE contain values Hi,j-1 and Hi-l,j of vertical and horizontal accesses (w and u in figure 3), twenty pages S are used to store an intelligent part of the 8(ai, bj) array: in a cell that contains an we only store the subarray s(an,bl), s(an,b2),..., 8(an, b2o) since there are typically 20 kinds of amino-acids in a protein. Formula (1) is applied in parallel in all the cells according to the time and space dependences described in figure 3. Assignments noted u, v and w are performed as follows:

162

D. Archambaud, L Saraiva Silva and J. Penn#

Figure 4: Board structure

Figure 5: Transfer diagram between the pages for a systolic step

Systolic Implementation of Smith and Waterman Algorithm

163

9 Transfer u is performed by shifting page EE. Actually, a maximum is calculated for EE evaluation (max1 in figure 5) to perform the sub-maximum max 1
PROGRAMMING

AND PERFORMANCE EVALUATION

The implementation is composed of two programs: the driving-program executed by the host, and the micro-program executed by the Rapid-2 coprocessing board. Both programs run simultaneously, synchronized by the data exchange. This section describes the microprogram stages and presents evaluations of the execution time. The programs have been validated and evaluated by using a very accurate simulator written in C. It takes into account the Rapid-2 architecture and physical delays, the data transfer delays through a common PC machine, and the i486 assembler instructions used to manage the bus in the driving-program. Our implementation yields results exactly equal to the results obtained by using ssearch. Ssearch is an exact implementation of Smith and Waterman algorithm which is part of the FASTA 1.7 package [16], it is used by geneticists to get accurate measurements). Validation and simulations have been made using real proteins from the tRNA synthetase [12]. The micro-program is composed of several stages: 9 Loading the m i c r o - p r o g r a m : The host sends the micro-program to the board sequencer (of figure 4). This is not a computing stage but it must be taken into account when comparing our implementation with software tools. 9 Initializing the board m e m o r y : The host sends sequence a = a o a l . . . a i which is stored in page A. The substitution matrix s(ai, bj) is also loaded and stored within 20 pages. ,

Systolic steps : This stage is divided into two parts. During the first part, sequence b = bobl...bj is systolically introduced in page B while computing, one character

164

D. Archambaud, L Saraiva Silva and J. Penn~

per step. The second part starts when sequence b has been entirely entered - no communication from the host is necessary any more. The step calculation is detailed in section 5. 9 P r o d u c e result : The value of the maximum similarity is read out by the host with its position in the matrix (cell index and diagonal time stamp for the 1-D systolic row). This maximum is obtained by running a dichotomic comparison between all the sub-maxima in the cells, using the arbitration tree. Figure 6 shows execution time evaluation for a similarity calculation between two tRNA synthetase proteins, sequence a is 951 amino-acid long, and sequence b is 876 amino-acids long. The simulated Rapid-2 board is configured with 256 32-bit ALUs (1024 8-bit ALUs). For each stage, we present the number of cycles needed by the stage, the number of cycles really spent in cell computation, the ratio between those two values, and the percentage of time elapsed in this stage. The overall execution takes 393,606 cycles, thus 13 ms with a 30 MHz clock. The parallel similarity calculation is 80.6% of this time, during which the board is used at 99.95% of its pure computational speed. Running the same comparison on Sparc Station 2 requires 4200 ms (using ssearch). Hence, Rapid-2 is 320 times faster than the Sparc Station. Figure 6 also shows that stages 1 and 2 are I/O intensive stages that have low calculation rates. They are relatively quickly executed since they represent less than 20% of the overall execution time.

Stage

Nb cycles

Nb cycles calculation

19682 56417 317101 406

0 4683 316943 375

Micro-prg loading Initialization Systolic steps Result output .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

% calculation 0 8.3 99.95 92.4

% total time 5 14.3 80.6 0.1

.

Figure 6: Time evaluation of the Smith and Waterman algorithm implementation

7

CONCLUSION

Nowadays integration scales make possible systolic rows large enough to accelerate computer intensive tasks. Our architecture aims at a massive parallelism (256 parallel 32-bit processors reconfigurable into 1024 8-bit processors) and c()mbines a set-associative design with a 1-D systolic mesh. We implemented a genetic sequence comparison program based on the very popular Smith and Waterman algorithm in our architecture. We validated the implementation by simulation and measured the performances. Adding a Rapid-2 board to a computer accelerates the tasks up to a factor 320. Our system is easily adaptable to an existing environnment and could be used to accelerate genetic databank comparisons.

Systolic Implementation o f Smith and Waterman Algorithm

165

Since the board features a reconfigurable sequencer (micro-program RAM), it can be easily programmed for other applications requiring massive processing and light data transfers. Among the possible applications for Rapid-2, image processing and image recognition is under study.

Acknowledgements The authors are very grateful to all the people involved in developing the Rapid-2 architecture and particularly to P. Faudemay (project manager) and A. Greiner (laboratory director). The project is partially funded by GdR ANM (Novel Machine Architecture), GdR "Informatique et G6nome" (Computer Science and Genomics), and GIP GREG (Groupement de Recherches et d'Etudes sur les G6nomes).

References [1] Altschul S. F. et al., "Basic Local Alignment Search Tool", J. Mol. Biol., n ~ 215, pp 403-410, 1990 [2] Archambaud D., Faudemay P., Greiner A., "Rapid-2 An Object Oriented Associative Memory Applicable to Genome Data Processing", Proc. of the 27th Hawaii International Conference on System Sciences, vol. 5, pp 150-199, 1994 [3] Archambaud D., Saraiva Silva I., Faudemay P., "Communication and Performance Trade-offs in a Systolic Machine", Proc. of the SAC-PAD / 6th Brazilian Symposium of Computer Architecture and High Performance Computing, August 1994 [4] Brutlag D. et al., "BLAZE: An Implementation of the Smith and Waterman Sequence Comparison Algorithm on a Massively Parallel Computer", Computer Chemistry, vol 17, n ~ 2, pp 203-207, 1993 [5] Coulson A. F. W. et al., "Protein and nucleic acid sequence database searching: a suitable case for parallel processing", Computer journal, vol. 30, n ~ 5, pp 420-424, June 1987 [6] Frisson P., Quinton P., "Systolic Architectures for Connected Speech Recognition", Proc. of the NATO advanced study institute on new systems and architectures for automatic speech recognition and synthesis, pp 146-167, July 1984 [7] Gonnet G. H., Cohen M. A., Benner S. A., "Exhaustive matching of the entire protein sequence database", Science vol. 256, 5 June 1992. [8] Gotoh O., "An hnproved Algorithm for Matching Biological Sequences", J. Mol. Biol., n 162, pp 705-708, 1982. [9] Greiner A., P6cheux F., "Alliance: A complete set of tools for teaching VLSI design", Proceedings 2nd Eurochip Conference, Grenoble, September/October 1992. [10] Jones R., "Sequence Pattern Matching on a Massively Parallel Computer", CABIOS, vol. 8, n ~ 4, pp 377-383, 1992

166

D. A r c h a m b a u d , I. Saraiva Silva and J. Penn~

[11] Kung H. T., "Why systolic architectures", Computer, vol. 15, pp 37-46, Jan. 1982. [12] Land~s C., H6naut A., Risler J. L., "A comparison of several similarity indices used in the classification of protein sequences: a multivariable analysis", Nucleic Acids Research, vol 20, n ~ 14, pp 3631-3637, Oxford University Press 1992 [13] Lavenier D., "An Integrated 2D Systolic Array for String Comparison", Proc. of the 1st South American Workshop on String Processing, pp 117-122, September 1993 [14] Lopresti D. P., "P-NAC: A Systolic Array for Comparing Nucleic Acid Sequences", Computer, vol. 20, pp 98-99, July 1987 [15] Needleman S. B., Wunsch C. D., "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins", J.Mol.Biol., vol. 48, pp 443, 1970. [16] Pearson W. R., Lipman D. J., "Improved tools for biological sequence comparison", Proc. of the National Acad. Sci. U.S.A. 85, pp 2444-2448, 1988 [17] Smith A. R., Sambur M. R., "Hypothesizing and Verifying Words of Speech Recognition", Trends in Speech Recognition, Prentice-Hall, Lea W. A., Englewood Cliff N.J., pp 139-149, 1980 [18] Smith T. F., Waterman M. S., "Comparison of Biosequences", Advances in Applied Mathematics., vol. 2, pp 482-489, 1981. [19] Wood D., "A Survey of Algorithms and Architectures for Connected Speech Recognition", Proc. of the NATO advanced study institute on new systems and architectures for automatic speech recognition and synthesis, pp 233-248, July 1984 [20] WuS., Manber U., "Fast Text Searching Allowing Errors", Communication of the ACM, vol 35, n ~ 10, October 1992

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

ARCHITECTURE AND SIGNAL PROCESSORS

PROGRAMMING

167

OF PARALLEL

VIDEO

K.A. VISSERS, G. ESSINK, P.H.J.VAN GERWEN, P.J.M. JANSSEN, O. POPP, E. RIDDERSMA, H.J.M. VEENDRICK

Philips Research, Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands. vissers @prl.ph ilips, nl ABSTRACT. Programmable Video Signal Processor ICs (VSPs) have been developed for the real-time processing of digital video signals. These processors are supported by dedicated programming tools. A large number of applications have been developed with boards containing several of these processors. Currently two implementations of the general architecture exist: VSP1 and VSP2. A single chip contains several Arithmetic and Logic Elements (ALEs) and Memory Elements. A complete switch matrix implements the unconstrained communication between all elements in a single cycle. The programming of these processors is done with Signal Flow Graphs (SFGs). These SFGs can conveniently express multi-rate algorithms. These algorithms are mapped onto a network of processors. Applications with these processors have been made for a number of industriaUy relevant video algorithms, including the complete processing of next generation fully digital studio TV cameras and several image improvement algorithms in medical applications. KEYWORDS. Real-time video processing, fine-grain parallelism, programmable processors.

1

INTRODUCTION

The development of new algorithms for video processing is done in several steps. Initially off-line simulations are used, followed by detailed analysis of the algorithm on systems that allow real-time execution. We developed systems consisting of high-speed programmable digital video signal processors (VSPs), programming tools and several boards containing a number of VSPs. These systems enable a flexible and rapid evaluation in real-time of video algorithms.

168

K.A. Vissers et al.

There are many advantages to having programmable hardware to evaluate the algorithms. The flexibility allows the mapping of many different algorithms on the same hardware. Furthermore, detailed fine-tuning of algorithms can be done without any hardware modification. A previously developed algorithm can easily be reused by mapping that algorithm onto the same or similar programmable hardware systems. Valuable development time is saved by a high-hvel description of an algorithm followed by the computer assisted mapping of that algorithm on programmable hardware. These steps abstract from unnecessary hardware details. In the next section the architecture of two generations of our VSPs will be presented. Next the concepts and implementation of the dedicated programming tools will be illustrated, followed by the description of some applications and results. Finally some conclusions and suggestions for future research will be formulated.

2

ARCHITECTURE

The implementation of digital video processing requires a high computing power and a high communication bandwidth since the sampling rates for video signals are in the order of a few to several tens of MHz. Processing of TV images in real time can be done in several ways. One possibility is to implement a parallel computer with sufficient memory and computing power to store several images and allocate a number of processors to parts of the image. This requires large memories and even for small computational tasks a relatively large computer. Another option is to use high-speed processors that can handle the required speeds of streams of samples and perform the required processing on the streams [1] [2]. This approach requires the integration of hig h speed communication and sufficient processing power in a single chip. This approach is used in the VSPs. Video signal processing can be done with a limited resolution in the number of bits. We have developed two generations of VSPs with a 12-bit architecture, shown in Figure 1, which allows multiple precision computations. The first generation, called VSP1 [3] [41 [51, has already been available for several years. The second generation, called VSP2, has recently become available [6]. The architecture of these VSPs consists of a number of Processing Elements (PEs) connected to a switch matrix. A Processing Element (PE) can be an Arithmetic and Logic Element (ALE), a Memory Element (ME), a Buffer Element (BE) or an Output Element (OE). All these Processing Elements are pipelined and are active in paralhl. The switch matrix realizes the unconstrained programmable communication between all elements in a single cycle. For each PE the instructions are stored in a local program memory. Initially the program memories are loaded with instructions via a serial download. After a well-defined initialization, the PEs execute their instructions cyclically [7] [8]. Conditions are handled as data-dependent selections. Therefore instructions can be executed unconditionally, avoiding branch hazards which are one of the limitations in using parallelism in architectures [9]. Every clock cycle each PE can take one or more values from the switch matrix and produce, after a few cycles, a result that is available to all PEs. The silos in a PE store intermediate values for maximally 31 clock cycles. The switch matrix and silos support multi-rate processing. The characteristics of the VSP1 and VSP2 are given in Table 1. The Arithmetic and Logic Element contains silos, a program memory and an

Architecture and Programming of Parallel VSPs

169

Figure 1: Architecture of VSPs.

ALE core, as shown in Figure 2. Tile ALE core contains barrel shifters, multiplexers for constant input, and an ALU. The instruction set of the ALU is given in Table 2. The ALU implements multiplication by Booth encoded multiplication steps. The data dependent result of an instruction is illustrated in columns 4 and 5 of the instruction set. For instance, an addl will either implement the addition of the first two inputs (P+Q) or produce a zero (clear), depending on the last bit (r0) of the third input (R) of the ALU. The Memory Element contains silos, program memory, data memory and logic for the address calculations. The ME of the VSP2 is given in Figure 2. This ME can implement the storage of several video lines. The memory can be filled with initial values during program download. This also facilitates the usage of the memory as Look Up Table (LUT): an implementation of any non-linear fixed transfer function at video resolutions. The Buffer Element contains a silo that can be used for additional storage of intermediate results. The Output Element provides buffering and the output to the chip. An arbitrary number of these Video Signal Processors can be used in parallel in a network of processors to implement arbitrarily complex video algorithms. Field memories are added to the network of processors only when required by the function of the algorithm. The processing power and flexibility of the VSP2 at least equal the processing power and memory of any network of eight VSPls. This is illustrated as follows. The ALEs of a VSP1 and VSP2 are identical in instruction set, and the VSP2 contains four times the number of ALEs in a VSP1, and the VSP2 runs at 54 MHz, twice the frequency of a VSP1. Therefore the ALEs of the VSP2 are equivalent to the sum of ALEs of eight VSPls. The 2 MEs on the VSP1 implement 2 • 512 words of storage with one read or one write per clock cycle of 27 MHz. The four MEs on the VSP2 implement 4 x 2k words of storage with a read and a write per clock cycle of 54 MHz. So the total memory capacity and bandwidth of the VSP2 is8 times that of a VSP1. On top of this, the switch matrix of a VSP2 allows communication between all PEs. Furthermore the VSP2 allows additional flexibility in the

170

K.A. Vissers et al.

technology chip size transistors package dissipation clock word size ALEs MEs size ME memory style BEs inputs _,Loutputs (=OF~)

VSP1 1.2/, CMOS 90 mm 2 206,000 QFP 160, PGA 176 1W 27 MHz 12 bit 3 2 512 x 12 single port 5 5

VSP2

0.8p CMOS 156 mm 2 1,150,000 QFP 208 <SW 54 MHz 12 bit 12 4 2k • 12 dual port 6 6

6

Table 1" Characteristics of VSPs. schedule due to the six BEs which are absent in the VSP1. In conclusion, a VSP2 is at least equal to any network of eight VSPls.

3

IC D E S I G N

Because of the compatibility with the first generation (VSP1) we adopted a comparable design approach for the VSP2. In a 0.8#m CMOS process, with this highly parallel architecture, where all PEs operate at the same speed, the 54 MHz clock frequency was a real challenge' Therefore, many of the blocks are custom designs. The ALU, silo, different instances of the program memory, switch-matrix and barrel-shifters are all created from procedural layout descriptions, called generators. These custom generators are written in the L language of the GDT IC Design Package. Especially the timing of the switch matrix was a challenge. Each of the 60 inputs of the 28 processing elements can communicate with every PE output. To minimize the area of the switch- matrix, the switches are implemented as minimum size transistors, which are selected by means of a 5 bit decoder. 60 of these decoders are situated beneath the wide databus (312 data wires). This also limited the sizes of the decoder transistors and hence several actions (reduce RC-times inside the decoder and inside the switch-matrix) had to be taken to have the switch-matrix operated at 54 MHz. The dual port data RAM in the ME is a full custom design, optimized for this chip in this technology. The address calculation logic in the ME is realized with standard cells, as is the initialization and test control logic. The special high-speed, low-voltage swing option of the I/O circuits also required a dedicated design. Block and chip assembly were done by channel routing. Over the cell routing was used to save chip area. As will be discussed further on, application of this chip is supported by a set of tools that map algorithms onto the hardware. For design verification these tools have been used to map several algorithms onto the design. These algorithms have been chosen such that the response of every Processing Element can be simulated and verified in GDT. An in-house tool was used to compare the final netlist, extracted from the layout, with the GDT netlist. Figure 3 shows the layouts

Architecture and Programming of Parallel VSPs

171

Figure 2: Arithmetic and Logic Element and Memory Element. of the VSP1 and the VSP2. Compared to the VSP1, the VSP2 chip has a much more regular structure. This is done to achieve proper databus, clock and supply routing all over the chip. Because of the combination of high complexity, high power dissipation and high performance, special clock and power routing strategies have been adopted during the design. Testability has been ensured by the inclusion of eight scan-chains. Reduction of the number of testvectors is achieved by the implementation of Multiple Input Signature Registers (MISRs), which allow compression of data over a number of clock cycles. The final signature is then scanned out. The VSP2 chip is packaged in a ceramic AINi2 208 pins QFP. 4

PROGRAMMING

TOOLS

Programming of networks of VSPs is by far too complicated and cumbersome to be done manually at the assembler level. Therefore, dedicated programming tools were developed. These tools take all characteristics of the processors into account. The programs for the PEs of the VSPs in the network are loaded into the processors, and next the PEs execute the individual programs cyclicMly. Fixed static rates are mapped onto a network of processors with a fixed static schedule [10]. The programming of these processors is done with Signal Flow Graphs (SFGs). The SFGs can conveniently express multi-rate algorithms. The SFGs consist of operations and data dependencies. The execution rate of an operation is fixed and given during the specification of the Signal Flow Graph, similar to the concept of Synchronous Data Flow [11] as used in the Ptolemy environment [12]. A more detailed transformation of Synchronous Data Flow into SFGs can be found in [13]. With the aid of the programming tools, these SFGs are then mapped onto networks of processors. The inherent fine grain parallelism in the SFG

K.A. Vissers et al.

172

I

[name_,] i4...0 I r z r ' [ ro = o I ro i COmment add0 0 x x 4 P+Q P-+'Qq:l xx P+Q clear addl 1 add2 2 x x , P+Q Q xx P+I Q add3 3 add4 4 xx P+Q P-Q 5 xx P-Q-1 P-q subO 6 xx P-Q clear subl abs(q) 7 xx q P-q sub2 8 x x Q -P+Q sub3 sub4 9 xx Q P-1 I0 x x P-Q set set--- 1 sub5

i-logO logl log2 log3 log4 "cmpO cmpl cmp2,

bm

um

sm

11

12 13 14 15 16 17 18 19 19 19 19 20 20 20 20 21 21 21 21

" xx

xx xx xx xx

" P/~Q

Pvq P~Q

XX XX

P=Q

P#Q

P P+Q P-2Q P-Q

P+Q P+2Q P-Q P

clear P-2Q P 2P-2Q -P+2Q P -2P

Q P-Q P+Q 2P-Q Q P+Q -2P+Q

O0 01 10 11 O0 O1 10 11 O0 01 10 11

l

PAQ

PvQ P~Q P clear P>_Q P>Q

XX

'

q

set true true

switch sign(R) true=-2048 faise=O 2-bit Booth cell 3-bit unsigned multiply (0:7)

3-bit signed multiply (-4:3)

-2P+2Q -P+Q

Table 2: The ALU instruction set.

is exploited during the mapping onto a number of parallel processors. The programming tools support the entry of SFGs, networks of processors, simulation of SFGs, display of simulation results, the mapping of an SFG to a given network, and microcode generation for the processors. The VSP programming trajectory with the programming tools is given in Figure 4. The programming tools have a graphical interface and support the entry and manipulation of the SFGs and VSP networks. The top part of the graphical interface in Figure 4 shows the SFGs of two filters. The small number below the operation denotes the period. This period indicates how often an operation is executed, e.g. a period of 4 denotes that an operation needs to be executed once in every 4 clock cycles. This is for instance a rate of 13.5 MHz if the clock frequency is 54 MHz. These filters are a band-pa~s filter and a low-pass filter as used for separating respectively the chrominance and luminance components out of a "Digital Composite

Architecture and Programming of Parallel VSPs

173

Figure 3: Layouts of VSP1 in 1.2p CMOS and VSP2 in 0.Sp CMOS Video and Blanking and Synchronization Signal" (DCVBS), as used in TVs. These are simple, yet realistic examples containing only 10 and 7 operations. In the bottom righthand part of Figure 4 the results of simulation are displayed in an oscilloscope-like form. The top signal, a sweep, is applied to the inputs of the filters, resulting in the responses shown underneath. The top SFG contains 8 ALE operations, all with period 4. These 8 operations can be mapped onto 2 ALEs, where each ALE will have a cycle of four instructions. To execute an SFG on the VSPs in real-time, first a mapping of an SFG onto a network of VSPs needs to be made. A correct mapping of an SFG onto a network of processors consists of an assignment of a P E and a start time in terms of clock cycle for each operation in the SFG and must satisfy the following constraints:

Type constraints An operation must be mapped on a P E that can execute the specified function.

Communication constraints Two operations that are connected by a data precedence must be mapped on PEs that can either communicate via the switch matrix or via a channel.

Time constraints The samples of a data precedence can be stored in a silo for at least 1 and at most 31 clock cycles. P E c o n s t r a i n t s Operations that are executed at the same time cannot be executed on the same PE. Silo c o n s t r a i n t s The samples of two data precedences cannot be written into the same silo at the same time.

174

K.A. Vissers et al.

Figure 4: VSP programming trajectory and Graphical interface of the programming tools. The mapping of an SFG onto a network of processors can be split into two steps. First a partitioning of the SFG is made for the network of VSPs. During partitioning operations are assigned to a VSP, under the constraints given by the communication constraints between processors and by the processing and memory capacity within a processor. Next a scheduling step is done where each operation is assigned to a PE and a start time [14]. The mapping, consisting of partioning and scheduling, is supported by tools. A graphicM manipulation of the mapping is also supported, with a direct interactive feedback using different colors to indicate violations of constraints. The tools take care of all pipelining effects. When a correct mapping is obtained for the SFG onto the network of processors, the microcode for all processors can be generated automatically, and the code can be loaded into the processors.

5

APPLICATIONS

Using the tools, the initial entry and mapping of a new algorithm is typically done in a few hours, depending on the complexity of the algorithm. Fine tuning of those algorithms, with immediate real-time display of the results, is done in a matter of minutes. The complete impact of an algorithm can often only be judged on realistic video material, like the normal TV signal as received with consumer quality equipment from a cable company. Using several of these development systems, a number of groups inside and outside Philips Research have successfully applied this method to the development and fine-tuning of their algorithms. Using complex systems, the complete processing of a next generation fully dig-

Architecture and Programming of Parallel VSPs

175

ital studio TV camera and several image improvement algorithms in medical applications have been implemented with VSPs. Furthermore, a large variety of video processing algorithms in the field of TV processing have been successfully implemented with VSPs. These algorithms can roughly be divided into four areas which are listed below. The names of the algorithms are indicated between parentheses. 9 Picture quality improvement - contour enhancement (contour) adaptive noise reduction (limeric) adaptive luminance and gamma correction (histmod, ijntema, gamma) improved luminance/chrominance separation

-

-

-

9 Special effects -

-

-

-

picture expansion and scan correction (vidiwall4x, vidicorr) picture-in-picture and multi-window TV [15] test pattern generation (testgen) fading, wipes

9 Standards conversion 50-100 Hz conversion, interlace-progressive scan conversion 4:3-16:9 aspect ratio conversion (horcompr, panorama) - color space conversion (haighton) - PAL-NTSC conversion -

-

-

9 Bit-rate reduction and compression related techniques motion estimation [16] and compensation - DCT, IDCT -

Some characteristics of the applications are given in Table 3. The columns indicate the name of the algorithm, the number of operations in the SFG, the periods of the operations, and the ALE, ME, ME storage and OE utilization respectively. The VSPl-flexboard, a standard board with 8 VSPls and 4 inputs and 4 outputs, each 12 bit, is in use in several setups. The flexibility of our approach is illustrated by the fact that all algorithms given in Table 3 except "limeric" have been mapped onto this board. The network of VSPls that is used on the boards for the above-mentioned mappings is shown in Figure 5. "limeric" has been mapped on three of these boards, and "motionest" has been mapped on a board containing a different configuration with 8 VSPls. As was illustrated in section 2, a single VSP2 is equivalent to any network of 8 VSPls. Therefore a

176

K.A. Vissers et al.

Table 3: Applications and the VSP utilization of 8 VSPls.

Figure 5: Network of 8 VSP1 processors. single VSP2 can implement all algorithms given in Table 3, except in the case of "limeric", which requires at most 3 VSP2s. The compatibility of the mapping of an SFG is supported at the software level. Therefore all algorithms described in the form of an SFG and mapped onto networks of VSPls can also be mapped onto VSP2s. This is illustrated by the contour algorithm given in Figure 6. This algorithm results in a perceivable improvement of sharpness in horizontal and vertical direction. This algorithm is part of the processing in studio TV cameras. The control settings are an essential part of the algorithm. The algorithm was first mapped onto 8 VSPls and next onto a single VSP2. The numbers in the figure indicate the periods. For a mapping onto a VSP2, the period of an operation will be doubled, since the frequency of the VSP2 is twice the frequency of a VSP1. 6

CONCLUSIONS AND FUTURE DIRECTION

It is illustrated that an environment based on dedicated programmable processors, dedicated programming tools and several boards, provides a unique and extremely powerful tool for the development of real-time video algorithms. This environment is successfully being

Architecture and Programming of Parallel VSPs

177

Figure 6: Contour algorithm that can be mapped on 8 VSPls or on a single VSP2. used in several industrial projects. The impact and results of the algorithms are essential for the performance of the final product and need to be judged visually. Currently boards with 6 VSP2s that can be used in a VME environment are becoming available. Applications in the field of high speed medical imaging are under development with a number of these boards. Several applications in the field of standard definition TV processing and HDTV processing will also be done with these new boards. The further integration of the tools deserves attention. A single, integrated tool that can program systems of VSPs in general is under development. The more powerful boards with VSP2s will increase the size of the Mgorithms under study. Therefore the automated mapping quality will continue to deserve attention. Acknowledgements We greatfully acknowledge the many collegues who have helped in the successful development and application of several environments.

178

K.A. Vissers et al.

References

[1] T. Kopet, , Programmable Architectures for Real-Time Video Compression, Proc. ICSPAT '93, September 1993, Santa Clara, CA. [2] P.E.R. Lippens et. al., Phideo: A Silicon Compiler for High-Speed algorithms, Proc. European Design Automation Conference, CS Press, 1991. [3] C.M. Huizer et al., A Programmable 1400 Mops Video Signal Processor, Proc. CICC, 24.3.1.-24.3.4, May 1989, San Diego, CA. [4] G. Essink, et al., Architecture and Programming of a VLIW Style Video Signal Processor, Proc. MICRO-~, November 1991, Albuquerque, NM. [5] A.H.M. van Roermund et al., A General-Purpose Programmable Video Signal Processor, [EEE Transactions on Consumer Electronics, August, 1989. [6] H.J.M. Veendrick, et al., A 1.5 Gips video signal processor (VSP), Proc. CICC, 6.2, May 1994, San Diego, CA. [7] D.A. Schwartz and T.P. Barnwell III, Cyclo-static Multiprocessor Scheduling for the Optimal Realization of Shift-Invariant Flow Graphs, Proc. ICASSP, 1985, Tampa, FL. [8] B.G. Chatterjee, The polycyclic processor, Proc. of the ICCD, Oct. 1983, Port Chester, NY. [9] J.L. Hennesy and D.A. Patterson, Computer Architecture a Quantitative Approach, Morgan Kaufmann Publishers Inc, 1990. [10] E.A. Lee and D.G. Messerschmitt, Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing, IEEE Transactions on Computers, January, 1987. [11] E.A. Lee and D.G. Messerschmitt, Synchronous Data Flow, IEEE Proceedings, September, 1987. [12] J.T. Buck, The Ptolemy Kernel, Technical Report, Memorandum UCB/ERL M93/8, University of California at Berkeley, January, 1993. [13] Sun-Inn Shih, Code generation for VSP Software Tool in Ptolemy, Master Thesis, Report UCB/ERL M94/41, University of California at Berkeley, May, 1994. [14] G. Essink et al., Scheduling in Programmable Video Signal Processors, Proc. ICCAD91, November 1991, Santa Clara, CA. [15] A.A.J. de Lange and G.D. La Hei, Low-cost Display Memory Architectures for FuUmotion Video and Graphics, SPIE Volume 2188, proceedings on High-Speed Networking and Multimedia Computing, Editors: Arturo A. Rodriguez, Mon-Song Chen, Jacek Maitan, February 8-10, 1994, San Jose, California, USA, ISBN 0-8194-1483-2. [16] G. de Haan, P.W.A.C Biezen, H. Huijgen, and O.A. Ojo, True-Motion Estimation with 3-D Recursive Search Block Matching, IEEE Trans. on Circuits and Systems for Video Technology, October 1993.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

A HIGHLY PARALLEL

SINGLE CHIP VIDEOSIGNAL

179

PROCESSOR

K. R()NNER, J. KNEIP, P. PIRSCH

Laboratorium fiir Informationstechnologie Universitdt Hannover Schneiderberg 32, D-30167 Hannover, Germany roenner@mst, uni-hannover, de

ABSTRACT. A highly parallel single-chip video signal processor architecture has been inferred by analysis of image processing algorithms. Available levels of parallelism and their associated demands on data access, control and complexity of operations were taken into account. The architecture consists of a RISC style control unit with separate instruction cache, 16 parallel data paths with local data caches, a shared memory with matrix type data access and a powerful DMA-unit. Multi-processor systems are supported by four hypercube links with auto-routing capability. A C++-Compiler and an optimizing assembler have been implemented, fully supporting all levels of concurrency. The processor achieves a very high sustained performance for a broad spectrum of image processing algorithms. For example, the processor with 16 data path performs a 1024 samples complex fourier transform in 33.3 #s, histogramming of a 512 • 512 image with 256 grey-levels takes 0.83 ms and the Hough-transform of a 512 • 512 image (30% black pixels, 1.4 degrees angle quantization) can be calculated in 66ms. An MPEG2 (MP@ML) encoder can be implemented with one processor plus an external motion estimation processor. All examples are based on 100MHz clock fi'equency and correspond to a sustained performance of over 2 billions of arithmetic operations per second. First prototypes of the processor with four parallel data paths implemented in 0.8#m CMOS will be available in the first quarter of 1995. KEYWORDS. Parallel VLSI RISC processor, shared memory architecture, VLIW, autonomous SIMD controlling, video signal processor, image processing

1

INTRODUCTION

The demands on processing power and data throughput of image processing algorithms call for highly parallel architectures. The term image processing refers to a very wide application field ranging from image restoration, image analysis and image coding to synthesis

180

K. Rdnner, J. Kneip and P. Pirsch

of images from non-visual signals (e.g. synthetic aperture radar (SAR) spectrum, nuclear magnetic resonance signals, X-ray tomogram signals). The processing demands in terms of processing power, data throughput and types of utilized algorithms of these applications differ significantly. Because this great variety of requirements could not be covered by single processors in the past, two major approaches to image processing architectures were taken. For the purpose of compact hardware realization, architecture development focused on dedicated implementation of single, especially demanding algorithms (e.g. convolution, DCT etc.) or single applications (e.g. image en-/decoders [1], recognition of predefined objects [2] etc.). This led to small systems, yet inflexible to changes in algorithms. Fully programmable systems have been designed from standard micro or signal processors, leading to large volume and high power dissipation. Both approaches were dictated by available VLSI-technology and demand for solutions to special applications. Steady progress in semiconductor technology now enables monolithic realization of parallel image processing architectures. According to the broad range of processing demands found in the outlined applications, a software-oriented approach is mandatory, i.e. programmable video signal processors (VSP) must be developed. Though the majority of proposed VSPs [3] - [14] is capable of processing various algorithms, most architectures are optimized for special tasks like image coding according to MPEG-standards [15] or low-level image preprocessing. This optimization leads to restrictions in the instruction set, in types of data access that are supported efficiently and in complexity of control. Therefore these architectures lack much of the flexibility required to perform complex image processing tasks that typically employ a braod range of operations and data access pattern and complex decision making. The MVP proposed by Texas Instruments [16] offers very much flexibility in terms of control of parallel processors and access capabilities to the large array of 25 memory blocks. Due to the complex control-processor, separate controllers, addressing units and instruction caches for each co-processor and a large full cross-bar memory access network, only four parallel co-processors could be integrated, although an advanced technology and large diesize have been employed. Therefore, flexibility and degree of parallelism are unbalanced. Additionally, the processors instruction cycle time is very low. These disadvantages reduce the effective processing power beyond what could be achieved using the given resources. Software-implementation of complete image processing tasks is quite demanding. Yet, none of the VSPs proposed so far is supported by a compiler enabling parallel programming in a high level language. The envisaged image processing system has small volume and low power dissipation, i.e. is based on VLSI video signal processors. These processors must be capable of achieving a high sustained performance over a wide range of image processing algorithms. Because image processing applications are growing in complexity, the processor must be programmable in high level languages supporting parallelization. This paper presents a VSP that fulfills the named goals. This has been achieved by adaptation of the type of control, memory architecture and parallelization strategies to the requirements of a wide variety of image processing algorithmswith respect to data access

A Highly Parallel Single-chip VSP

181

patterns, arithmetic operations and available levels of parallelism. The architecture uses parallelization strategies suitable for integration into state of the art compilers. Further, full advantage of high clock speeds offered by modern semiconductor technology is taken. In the following section the requirements of image processing algorithms and the available levels of parallelism are outlined. This builds the foundation for the VSP architecture presented in section 2. Section 3 deals with the processors programming model and some details of compiler and assembler implementation. Finally, the conclusions that can be drawn from the discussions in this paper are summarized.

2

2.1

AVAILABLE PARALLELISM PROPERTIES

AND

REQUIRED

ARCHITECTURAL

Levels of Parallelization and Average P a r a l l e l i s m

Image processing applications offer several levels of inherent parallelism, i. e. concurrency of some type of processing. Concurrency can be achieved by two means: parallel processing and pipelining. Both methods will be outlined briefly. For a comprehensive overview see for example [17]. Parallel processing denotes concurrent calculation of multiple independent results from a single instance or several instances of data. For the majority of algorithms, the set of input data can be partitioned prior to processing and assigned to different processing units though for final calculation of results a final merge step might be necessary, requiring efficient means for communication between parallel processors. Data-parallel processing achieves a maximum speedup equivalent to the amount of instances of data available concurrently. Therefore, for typical image sizes parallel processing of segments or single pixels offers the opportunity for a large degree of concurrency of order 1000. However, for exploitation of this attractive chance to speed up computation of image processing algorithms, architectures must cover several algorithmic requirements outlined in the following subsection. Data parallel processing can also make use of concurrently computable intermediate variables. But their number is typically low. Yet, as this level of concurrency makes use of instruction level parallelism as does software-pipelining of operations [21], it is granted with no extra overhead. Pipelining achieves concurrency by overlapping subsequent computational steps in time. Therefore the next input to a pipeline-stage is calculated while the stage itself calculates the input for it's successor. Typically four levels of pipelining can be found in image processing applications: algorithms, functions, operations and instructions. The maximum speedup of pipelining is given by the number of pipeline-stages. On the algorithmic level the number of possible stages is in the order of 10. Calculation of each algorithm typically employs three to four functions. The number of operations of arithmetic functions performed per input data seldom exceeds four. Instruction pipelines speed up computation up to 6-8 stages ([18], p. 336). Therefore maximum parallelism of pipelining is less than 10 for any level. However, the actual speedup that can be achieved by pipelining is even smaller, due to pipeline hazards reducing hardware utilization. Most pipeline hazards

182

K. R~)nner, J. Kneip and P. Pirsch

are due to unavailability of data caused by pipeline delays in data dependent processing and differences in execution speed of pipelined units. Most data-hazards can be reduced by properly sized buffers. At the algorithmic and functional level however, these buffers might get very large, exceeding sizes feasible for monolithic integration. Therefore, in monolithic architectures the effective speedup on this level is much less than can be achieved maximally. Operation and instruction pipelines suffer from the same type of hazards, but required buffer sizes are small. In RISC style load/store architectures a large register file is used for this purpose. Additionally, data forwarding ([18], p. 261ff) reduces data. hazards and further increases pipeline speedup close to the theoretical maximum.

2.2

Hardware Requirements

As mentioned previously, the speedup actually achieved by parallelization in real applications is much smaller than theoretical maximas. In this subsection we will look deeper into the causes. First, losses in speedup by paraUelization due to algorithm and to hardware properties must be distinguished. Except rare cases, applications contain inherently sequential parts that can not beaccelerated by parallelization. Therefore, the speedup that can be achieved by parallelization asymptotically approaches an upper limit set by the relation of the number of sequential operations to the total number of operations. This is known as Amdahls law [19]. In order to keep losses in execution speed as low as possible, a parallel architecture must also achieve a high sequential performance.

Additionally to limits in speedup set by algorithms, losses arise because architectures are not able to fully exploit available parallelism due to hardware limitations. However, trying to exploit each possible opportunity for paraHelization overcomplicates architecture design and as a consequence actually slows down all types of algorithms. Therefore typical, widely available properties must be identified and incorporated, whereas it is sufficient to compute rare cases with less speed without significant loss of total performance. Data-level parallelization often encounters the situation, that the operations applied to the input data may depend on the data itself, i.e. each parallel processing unit must be capable of performing different operations. An example is the following code sequence for binarization of image data. 1. if (pixel < THRESHOLD) { 2. pixel = WHITE; 3. else 4. pixel - BLACK;

5.} Depending on the result of the compare operation in line 1 either the operation in line 2 or in line 4 must be executed. Many common algorithms (e.g. binarization, Hough-transform, quantization etc. ) contain such type of data dependent operations. Therefore,

A Highly Parallel Single-chip VSP

183

the processing units of a parallel image processor must be capable of selecting instructions independently, dependant on their current data. It should be noted however, that typically very few alternative instructions have to be carried out, i.e. too much control overhead should be avoided. Concurrent data access of all parallel processing units is viable for a significant speedup by data-level paralleUzation. Memories supporting multiple concurrent accesses must consist of multiple memory blocks, since multi-port memories are getting to large and slow if more than approximately four read and write ports are required. Concurrent access to data distributed among several blocks is not feasible without access conflicts in many situations. Conflicting accesses must be sequentialized with significant impact on performance. To avoid this situation, accesses should be restricted to access patterns fitting both, mapping strategies of data onto multiple memory blocks and algorithmic properties. Most algorithms can be formulated such that they operate on separate, quadratic segments or at least access single pixels regularly in their input data space. Segments may overlap or be positioned adjacently. This is shown in figure 1. Access to a single pixel corresponds to a segment size of one.

Figure 1: Segment oriented data access of image processing algorithms. However, not all data accesses into segments are regular. For example histogramming or Hough transform access their output data space irregularly. But separate processing units still can be assigned different segments of the data space, although these segments may get very large (e.g. Hough transform) or must be summed up to form the final result (e.g. histogramming). The latter corresponds to a split and merge parallelization strategy, requiring communication among processing units. Communication can be mapped easily to regular concurrent write and read operations to (communication) segments and therefore is covered by the types of accesses shown in figure 1. The common data acces patterns of a wide variety of algorithms outlined above can be summarized into three requirements on concurrent data access facilities. To achieve a large concurrency, they should support

184

K. Rt)nner, .L Kneip and P. Pirsch predetermined, regular parallel accesses to different segments, predeterminded, regular parallel accesses to data inside a single segment, and non-determined, irregular parallel access with dynamically calculated addresses.

Finally the question arises on the relation between flexibility of hardware to cover the outlined algorithmic requirements and the degree of parallelism dedicated to each level of parallelization. The reduction in speedup caused by sequential portions of processing can be expressed as loss of average parallelism, i.e. loss of utilization of parallel processing units. Eager et. al. show [20], that the efficiency increases with parallelism, as long as the number of parallel units is less or equal to the average parallelism of their associated level of parallelization. The average parallelism increases with hardware flexibility. Therefore, an architecture must offer more fiembility when parallelism is increased. Put in another way, there is no gain from simply increasing parallelism beyond average parallelism, which is low, if due to u lack of flexibility many sequential processing parts are caused. On the other hand, as long as the number of processing units is far less than average parallelism, flexibility can be reduced to basic requirements without significant loss of performance. With respect to architecture design this means, that

the number of parallel processing units and their associated flexibility should be balanced such that the number of parallel units is of the order of the average parallelism of their associated level of parallelization. 3

A V L I W R I S C W I T H P A R A L L E L DATA P A T H A R R A Y

T h e proposed very long instruction word (VLIW) architecture uses a RISC style control, where the same instruction is executed in several para,llel data paths (SIMD style of control). All arithmetic operations are performed on local registers. The architecture is scalable, a first prototype with four parallel data paths is currently being reMized in 0.8ttm CMOS technology and will be available at first quarter 199.5. A version with 16 parallel data paths will be implemented in 0.5#m CMOS. The processor consists of a control-unit with instruction cache, a data path array, a separate cache per data path, a shared memory with matrix type access formats [22], a DMA unit, a JTAG (IEEE Standard 1149.1-1990) compatible interface for test and debug purposes and four hypercube links with autorouting capabilities supporting multiprocessor systems. Figure 2 gives an overview of the processor architecture. The control unit fetches instructions from the instruction cache and initiates up to three concurrent operations. One load or store operation, two arithmetic operations or one arithmetic and one control operation. The operations and their associated units will be explained in the remainder of this section. Load and store operations access either the matrix memory or the data caches. If the block of data associated to the current address is not available, it is fetched from externM memory by the DMA controller. Additionally, the DMA unit can pre-fetch data on demand

A Highly Parallel Single-chip VSP

i I INSTRUCTION REGISTER

- :

-! O,CH, Ii

GLOBAL

,=::Z-

185

ADDRESS

CONTROL

ADDRESS

II

DATA CACHE

....__k I< 0

MEMORY

9

O

9

O

9 9

e o

9

o o

9 9

I

"1

L. O~CHE i

,

J

"

~.

I

I ,,~T. ~

..~

.--:~

__ J . . . . .

CACHE DATA BUS

.....

L........J MATRIX DATA BUS

,

~,.~..oc.,o. = t- -,~- I c.c..

I I-

i

BLocKs i

I INSTRUCTION BUS

(..,)'

HYPERCUBE LINKS _

DMA CONTROL I II II

I

....

.

,

INTERFACE

I

II

Figure 2: Overview of the proposed architecture. N 2 is the number of parallel data paths.

of the programmer prior to use, thus avoiding losses in speed due to cache misses. Addresses for lnatrix-memory accesses are computed by a dedicated address calculation unit. The caches can be accessed either with addresses individually calculated by the data paths or by a central address calculation unit (e.g. for stack operations). The caches are accessed with physical addresses. The matrix-memory uses virtual 2D addresses composed of the position of the upper left matrix element and a vertical and horizontal spacing between adjacent matrix elements. Four types of matrix access pattern with different spacings are shown ill figure 3. All spacings accept multiples of three (four data paths) respectively five (sixteen data paths) are legal. This is due to the way the mapping of memory blocks to accessing data paths is calculated from logical two dimensional matrix-addresses. For this purpose the modulo of the square root of the number of memory blocks is taken which is zero for zero and all multiples of three respectively five which would cause all accesses to go to memory block zero. In practical applications it can be found, that typical distances

K. ROnner, J. Kneip and P. Pirsch

186

are described by 2 n x 2 n, n E {At', 0}. The combination of matrix memory and individual caches enables all types of data accesses that were identified to be required by a wide class of image processing algorithms in the previous section.

Figure 3: Four different matrix type data accesses to an image stored in matrix memory demonstrated for four parallel data paths. The first example shows access to four overlapping segments, the second to four non-overlapping segments. A vector and a scalar type of access are also presented. Arithmetic operations are performed by the parallel data paths. Each consists of an ALU, a 16bit • 16bit multiplier with 40bit accumulator and a 32/40bit shift and round unit. The ALU can perform either a single 32bit or two 16bit operations. All arithmetic instructions are performed on data read from and written to a register file with sixteen registers (i.e. 3 operand instructions). The data paths are capable to perform two operations out of the three classes ALU, MUL/MAC and SHIFT/ROUND concurrently. All data paths perform the same operations, though operations may be skipped conditionally by each data path individually. Control operations can be divided into four types: - Program control (branch conditional, jump, trap). - Register transfer (set, read and write status, data and control registers). - Arithmetic/logical.

A Highly Parallel Single-chip VSP

187

- IF/ELSE/ELSIF/ENDIF (start/alter condition/end conditional execution of operations in the data paths). The first three operations are executed by the central control unit, while the fourth is used for autonomous selection of instructions by the data path, depending on the stati of previous operations (e.g. negative, overflow etc.). This extension to simple SIMD control schemes plus the previously mentioned addressing autonomy of the data paths enables parallel processing of data dependent operations, required to perform general image processing tasks. Additionally, the central controller can branch on individual or groups of conditions computed by the data paths. This further increases the flexibility of control. A single central controller reduces the overhead for control of the parallel data paths, thereby enabling a high degree of parallelism on data-level. This strategie uses the large average parallelism offered at this level. The integrated extensions to simple SIMD control balance parallelism and fle~bility such that performance is maximized in contrast to either more flexible but less parallel architectures or architectures over-emphazising parallelism in favor of flexibility, defeating utilization of the offered resources. All operations are executed in a six-stage instruction pipeline to achieve the high clock speed of 100 Mttz, thus covering the outlined requirement of a high sequential performance. Because pipelining introduces delays for accesses to previously calculated results, all writes as well as register loads are forwarded. This measure together with the properly sized register file and software pipelining by the parallelizing assembler (see next section) serves to keep the pipeline filled especially during inner loops, frequently encountered in image processing algorithms. Performance data underlines that a high utilization of processing elements is achieved in image processing tasks. The processor with 16 data path performs a 1024 samples complex fourier transform in 33.3 ILs, histogramming of a 512 x 512 image with 256 grey-levels takes 0.83 ms and the Hough-transform of a 512 • 512 image (30~163 black pixels, 1.4 degrees angle quantization) can be calculated in 66ms. An MPEG2 encoder was implemented on the processor. Together with an external motion estimation processor, a single 16 data path processor suffices for realization of an MP@ML (CCIR 601, 4:2:0) encoder. All examples correspond to a sustained performance of over 2 billions of arithmetic operations per second. 4

THE PROGRAMMING

MODEL

The proposed architecture offers support for high-level languages, namely a stack mechanism for allocation of intermediate variables and branch operations that save their return address. Of the many levels of parallelization employed by the architecture only datalevel parallelization and pre-loading of data concurrently to processing must be handled by the programmer. The latter is done by writing addresses to the DMA controller's control register. The DMA controller schedules reads and writes into no-operation (hop) slots of load/store operations. Data-level parallelization is mapped to concurrent operations on multiple instances of basic data items, collected into a new data type, called matrix. A matrix may be a collection

K. ROnner, J. Kneip and P. Pirsch

188

of arbitrarily accessed data items, if access is performed via individually computed addresses to the external memory or it may be a real matrix of image pixels if access goes to the matrix memory. We extended the GNU-C++ compiler to handle the new data type matrix. The programmer declares variables concurrently accessed by all data paths to be of type matrix (e.g. matrix[4][4] int var;). This enables the compiler to handle parallel data using the same register allocation and code optimization strategies as for other RISC processors. The compiler creates sequential code. The sequential operations are mapped to the VLIW by the assembler. This eases assembler programming because parallel operations and all associated effects (e.g. pipeline delays etc.) are hidden to the programmer. The assembler also performs software pipelining and loop unrolling to ensure, that, especially in computing intensive inner loops, the pipeline is filled maximally. Mapping of do-loops onto the hardware loop-counter by the compiler further increases the speed of loop-processing.

5

CONCLUSIONS

Since a parallel VSP's processing power is determined by the number of operations executed concurrently and the cycle time per operation, an SIMD-architecture is proposed, that avoids the large overhead introduced by separate control-units per processing unit and consequently achieves a high parallelism. The drawbacks often associated with highly parallel SIMD-processors are inflexibility and large communication overhead, restricting the usable parallelism. The proposed architecture demonstrates, that the usable parallelism can be increased drastically with little hardware expense even for data dependent processing tasks. SIMD control and types of supported data access enabled porting of the Free Software Foundation's GNU C++ Compiler to the architecture with little extensions required. Therefore the advanced code optimization strategies of the compiler remain unaltered. This approach led to a high performance, high-level language programmable architecture for general image processing applications.

References

[1] P.A. Ruetz et. al.: "A ttigh-Performance Full Motion Video Compression Chip Set", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 111-122, June 1992 [2] J. SchSnfeld, P. Pirsch.: "Compact Hardware Realization For Hough Based Extraction of Line Segments in Image Sequences For Vehicle Guidance", Proc. ICASSP93, April 25-30, Minneapolis, USA, 1993 [3] S.-I. Nakagawa et. al.: "A 24-b 50-ns Digital Image Signal Processor", IEEE Journal of Solid State Circuits, Vol. 25, No. 6, pp. 1484-1493, Dec. 1990

A Highly Parallel Single-chip VSP

189

[4] T. Nishitani et.al: "Parallel Video Signal Processor Configuration based on OverlapSave Technique and its LSI Processor Element: VISP", Journal of VLSI Signal Processing, Vol.1 , No.l, pp. 25-34, Aug. 1989 [5] H. Nakahira et. al.: "An Image Processing System Using Image Signal Multiprocessors (ISMPs)", Journal of VLSI Signal Processing, Vol. 5, No. 2/3, pp. 133-140, April 1993 [6] S. Evans et. al.: "A 1.2 GIP General Purpose Digital Image Processor", Proc. IEEE 1994 Custom Integrated Circuits Conference, IEEE Press, Los Alamos, pp. 99-102, 1994 [7] K. Gaedke, H. Jeschke, PI Pirsch.: "A VLSI based MIMD Architecture of a Multiprocessor System for Real-Time Video Processing Applications", Journal of VLSI Signal Processing, Vol. 5, No. 2/3, pp. 159-170, April 1993 [8] T. Inoue et. al.: "300-MHz 16-b BiCMOS Video Signal Processor", IEEE Journal of Solid State Circuits, Vol. 28, No. 12, pp. 1321-1328, Dec. 1993 [9] K. Aono et. al.:"A Video Digital Signal Processor with a Vector-Pipeline Architecture", IEEE Journal of Solid State Circuits, Vol. 27, No. 12, pp. 1886-1894, Dec. 1992 [10] H. Fujii et. al.: "A Floating-Point Cell Library and a 100-MFLOPS Image Signal Processor", IEEE Journal of Solid State Circuits, Vol. 27, No. 7, pp. 1080-1088, Jul. 1992 [11] J. Gosch: "Video Array Processor Breaks Speed Record", Electronic Design, pp. 115116, July 1990 [12] H. Veendrick et. al.: "A 1.5 GIPS video signal processor (VSP)", Proc. IEEE 1994 Custom Integrated Circuits Conference, IEEE Press, Los Alamos, pp. 95-98, 1994

[13]

H. Miyaguchi et. al.: "Digital TV with serial video processor", IEEE Trans. on Consumer Electronics, Vol. 36, No. 3, August 1990, pp. 318-326

[14]

T. Minami et. al.: "A 300 MOPS Video Signal Processor with ParaUel Architecture", IEEE Journal of Solid State Circuits, Vol. 26, No. 12, pp. 1868-1875, Dec. 1991 D.J. LeGall: "The MPEG video compression algorithm", Signal Processing: Image Communications, No. 4, pp. 129-140, 1992

[16]

K. Balmer et. al.: "A Single Chip Multimedia Video Processor", Proc. IEEE 1994 Custom Integrated Ci~vuits Conference, IEEE Press, Los Alamos, pp. 91-94, 1994

[17] D.I. Moldovan: Parallel Processing: From Applications to Systems, Morgan Kaufmann Publishers, San Mateo, CA, USA, 1993 [18] J.L. Hennessy, D.A. Patterson: Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers, San Mateo, CA, USA, 1990 [19] G. M. Amdahl: "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", Proc. AFIPS, Vol. 30, pp. 483-485, 1967

190

K. ROnner, J. Kneip and P. Pirsch

[20] D. L. Eager, J. Zahorjan, E. D. Lazowska: "Speedup Versus Efficiency in Parallel Systems", IEEE Trans. on Computers, Vol. 38, No. 3, pp. 408-423, Mar. 1989 [21] B.R. Rau, J.A. Fisher: "Instruction-Level Parallel Processing: History, Overview and Perspektive", in B.R. Rau, J.A. Fisher (Eds.): Instruction-Level Parallelism, Kluwer Academic Publishers, Boston, MA, 1993 [22] It. Volkers: Ein Beitrag zu Speicherarchitekturen programmierbarer Multiprozessoren dee Bildverarbeitung, PHD-Thesis, Institut fiir Theoretische Nachrichtentechnik und Informationsverarbeitung, University of Hannover, Germany, 1992

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

191

A MEMORY

EFFICIENT, PROGRAMMABLE MULTI-PROCESSOR ARCHITECTURE FOR REAL-TIME MOTION ESTIMATION TYPE ALGORITHMS

E. DE GREEF, F. CATTHOOR, H. DE MAN

IMEC Kapeldreef 75 3001 Leuven Belgium [email protected]

ABSTRACT. In this paper, an architectural template is presented, which is able to execute the full search motion estimation algorithm or other similar video or image processing algorithms in real time. The architecture is based on a set of programmable video signal processors (VSP's). It is also possible to integrate everything on a chip set using VSP cores. Due t6 the programmability, the system is very flexible and can be used for emulation of other similar block-oriented local-neighborhood algorithms. The architecture can be easily divided into several partitions, without data-exchange between partitions. Special attention is paid to memory size and transfer optimization, which are dominant factors for both area and power cost. KEYWORDS. Architectural template, block-oriented video algorithms, programmable VSP's, interconnection network, memory size and transfer optimization. 1

INTRODUCTION

In video applications, several block-oriented local-neighborhood-dependent algorithms with simple control flow are used. A well known example is the full search motion estimation (ME) algorithm [6, 11, 19, 12]. The goal of this paper is to present a flexible, programmable and parallel architectural template for which the memory cost and the number of transfers are minimized. The template is able to execute this type of algorithms in real time, making use of existing video signal processors (VSP's) [18, 14, 15, 9]. It is also possible to integrate everything on a (set of) custom chip(s) with stripped VSP cores though. The algorithms considered here are characterized by the hct that a frame of W pixels in width and H pixels in height, is divided into several small blocks, called current blocks

192

E. De Greef, F. Catthoor and H. De Man

(CB's). The processing for each of these blocks is assumed to be dependent on a limited region in the previous frame, located in the same neighborhood. This region is called reference window (RW). The generic loop structure which is repeated for every block, looks as follows: for (i = -m/2 .. m/2-1) [horizontal traversing of RW] for (j = -m/2 .. m/2-1) [vertical traversing of RW] for (k = 1 .. n) [horizontal traversing of CB] for (1 = 1 .. n) [vertical. traversing of CB] [basic o p e r a t i o n 1 (BO1) on 1 pixel of CB and RW] end (1) end (k) [basic o p e r a t i o n 2 (BO2) on result of BO1] end (j) end (i) The CB has a size of n 2 pixels, while the corresponding RW has a. size of (n + m - 1)2 pixels. This is shown in the lower right corner of Fig. 2. Only in the inner loop, pixels of the CB and of a region of the RW, of the same size as the CB, are needed. Even this could be extended though. For the ME algorithm, B01 consists of a subtraction, an absolute value and an accumulation, while BO2 consists of a comparison and a selection operation. For other variants of the algorithm, these operations can be different, but this will only affect the required functionality of the processing elements or some of the loop parameters. For instance, it is possible that only one out of two pixels is really processed instead of all of them.

2

STATE OF THE ART

Block-oriented video algorithms typically have a lot of inherent parallelism, and are therefore well suited for mapping on parallel architectures. In the past, several designs for ME have been published, which make use of the massive application of pipelining and parallel processing provided by systolic or linear arrays, or tree-like architectures [6, 11, 19, 12]. These architectures are well suited for mass production, but they generally lack flexibility and/or programmability and they are not extendible with off-the-shelf components. Although the architecture presented here has been designed for ME, it is fully programmable and para.meterizable and can be used for emulation of similar algorithms. In contrast, several general-purpose MIMD architectures have been designed [16], but these architectures are too general and not tuned enough to this type of video algorithms to achieve real-time operation with acceptable cost. The problem of mapping algorithms on programmable weakly parallel architectures with distributed memories has been studied extensively already. However, most of these techniques are oriented to medium throughputs and/or algorithms with a limited amount of shared data. They generally concentrate on equal distribution of the work load on the processing elements (eE's) to achieve maximum speed [10, 13, 17, 8, 2, 4, 3], without enough

Real-time Motion Estimation Type Algorithms

193

considerations to the memory cost and/or the required number of memory transfers. Similar work has also been done on the mapping of algorithms onto massively parallel architectures with distributed memories (e.g. [5]), but these techniques also fail to take sufficiently into account the memory cost. This is unacceptable for video algorithms. Others concentrate on the minimization of communication buffers [1] or on the optimization of cache memories in shared memory multi-processors [7, 3], but these techniques are not well suited for video or image processing applications with high speeds and large amounts of shared data. In our design, explicit provisions have been made to keep the memory sizes and the number of transfers as small as possible, while maintaining high throughputs that are required for video applications. 3

3.1

MAPPING

ON MULTIPLE VSP'S

Typical Parameters, Assumptions

Typical parameters (standard television) for the ME algorithm are: W=720, H=576, m=16, n=8 (72 rows of 90 CB's). For a typical frame rate of 25 Hz, this results in a pixel rate of 10.4 MHz. These pixels are usually 8-bit shades of gray. In the sequel, numeric values referring to these parameters are indicated as in [=~ numeric value]. For the time being, it is assumed that the pixel frequency is also the clock frequency of the complete system. The system consists of a large frame memory and a signal processing part, as indicated in Fig. 1. It is assumed that the pixels arrive in a row-wise manner, e.g. from a camera.

3.2

Required Number of Processing Elements

For every CB, m~n ~ BOl's have to be executed. If we assume a throughput of 1 CB pixel per clock cycle, the system should be able to perform m 2 BOI's per clock cycle. Therefore, at least m 2 PE's are needed [=~ 256 PE's], each capable of performing one BO1 per cycle, possibly pipelined. There is also need for a minimum of 2 parallel ports to the frame memory in order to provide sufficient bandwidth (one CB pixel and one RW pixel per cycle). For most VSP's, this is no problem if BO1 is simple enough (as in ME).

3.3

Reduction of Main Memory Bandwidth

A line buffer is needed to limit the necessary bandwidth between the frame memory and the PE's, and to store the data required for the processing of a row of CB's. This buffer can be divided in two parts: one for the CB's and one for the RW's. Since the different CB's do not overlap, it should be large enough to store a complete row of CB's. Neighboring RW's however, do overlap. Of every RW, m - 1 out of m + n - 1 columns can be shared with each of its right and left neighbors, such that only n additional columns have to be stored for every RW. This is indicated in Fig. 2 for the shaded RW's in the top row.

194

E. De Greef, F. Catthoor and H. De Man

Figure 1" System setup

Figure 2: Overlap of neighboring RW's In addition, the RW line buffer doesn't have to contain every line of the RW's. Assuming that each RW is processed from top to bottom, the upper lines can be overwritten as soon as they are no longer needed. On the other hand, part of the RW's can be reused for the processing of the next row of CB's, since the vertical overlap is similar to the horizontal one (Fig. 2). In order to avoid rereading the overlapping parts from the frame memory, these parts should be stored long enough in the line buffer. The sequence of reading and overwriting of a RW in the line buffer is depicted in Fig. 3 (when m = 8, n = 4) for the part of the RW corresponding to a certain CB. The current and next sequences correspond t o the RW rows in Fig. 2. In general, only (m- 1)n memory locations [=~ 120 bytes] are needed per RW instead of (m + n - 1) 2 [:~ 529 bytes]. For the pixels of the CB's, no such overwriting is possible since they are used too intensively. Therefore, the line buffer size for the CB's has to be doubled in order to be able to read in the pixels of the next row of CB's, while the current row is being processed. This results in 2n 2 memory locations per CB [=~ 128 bytes]. This way, every pixel of the previous and current frame needs t o b e read only once. This means that in every cycle, one pixel of the current frame and one pixel of the previous frame are both read row-wise. 4

THE DOMAIN-SPECIFIC

ARCHITECTURAL

TEMPLATE

The full template proposed here is shown in Fig. 4. It consists of 3 frame memory banks, a number of columns of PE's (with different VSP functions) and a tuned interconnection

Real-time Motion Estimation Type Algorithms

195

Figure 3: Read/write sequence of RW line buffer scheme. The detailed operation is explained in the subsequent sections.

Figure 4: Proposed architecture

4.1

PE Organization

It is desirable that the number of PE's is a multiple of the number of CB's in one row, such that an equal number of PE's can be assigned to every CB [ ~ 270 PE's, divided into 90 columns]. Each PE in a column can then perform part of the processing of the corresponding CB (Fig. 4). 4.2

Frame Memory

For the frame memory, we used a circular buffering strategy similar to the one for the RW line buffer. This is shown in Fig. 5. The pixels of the current frame gradually replace the pixels of the previous frame that are no longer needed. At every moment in time, pixels of a certain region of the current frame and of the previous frame are being read by the signal processing part of the system, while another region is being overwritten with new pixels.

196

E. De Greef, F. Catthoor and H. De Man

By configuring the frame memory as 3 interleaved banks using identical circular addressing, the number of read/write accesses per clock cycle to every bank can be kept ag low as one. This allows to use cheap off-the-shelf single ported memory chips.

Figure 5: Circular buffering strategy for the frame memory Using this strategy, only 1 complete frame and m + 2n lines need to be stored, which results in a frame memory size of W(H + ra + 2n) locations [~ 434 liB]. In every cycle, one new pixel is written in one of the banks, while one pixel is read from each of the other two banks. A multiplexer always connects two of the banks to the signal processing part of the system. This is shown at the left side of Fig. 4.

4.3

Distribution of the Line Buffers

In every cycle, a large amount of transfers is necessary from the line buffers to the PE's. Since every PE only needs pixels of a small region of the frames, both the CB and RW line buffers can easily be divided in small pieces (1 piece for each column of PE's). In order to provide just enough local storage capacity and bandwidth, we have decided that every column of PE's stores its own CB and the middle part (n columns) of its RW. Every PE column also needs access to the RW's of its neighboring columns. Ill case m _( 2n, there is only communication with the nearest neighbors. In practice, the condition m < 2n is almost always fulfilled. In the sequel, we will assume that this is the case whenever the concepts are illustrated. A structural view of this architecture and the data flow between line buffers and P E's are shown in Fig. 6.

4.4

Sharing of Transfers

In Fig. 7a, the sequence is shown in which the regions (of size n 2) of the RW's are read from the RW line buffers, for the case that m = 2n. The dotted vertical lines denote the boundaries of the distributed RW buffer memories. The vertical shift of the regions represents the time axis. The different shades of the RW regions indicate that they are used by different PE columns. Now, if every column of PE's uses the pixels of its RW in the same sequence, then there are never two PE columns reading from the same part of the RW line buffer. However, Fig. 7a shows that the first half of the number of RW regions used by a certain PE column are also used by its left neighbor, but at a later moment in time. Now, by altering the order of execution of the odd and the even PE columns, one can make sure

Real-time Motion Estimation Type Algorithms

197

Figure 6: Structural view of the proposed architecture

that neighboring columns need these pixels at the same time. This is indicated in Fig. 7b. The two-shaded regions correspond to the RW data simultaneously accessed by two neighboring P E columns. This leads to a reduction of the number of transfers by a factor of two, compared to Fig. 7a, which also significantly reduces the power consumption. In general, if cn < m < (c + 1)n, the number of transfers can be reduced by a factor of c. In practice, c is typically 2, as in Fig. 7.

Figure 7: Sharing of RW pixel transfers between PE columns

198 4.5

E. De Greef, F. Catthoor and H. De Man Bus Structure and Occupancy

So far, nothing has been said about the spreading of the load over the different PE's in one column. Each PE should execute part of the BOl's of its corresponding CB. If this spreading would be done randomly, the different PE's would require different pixels at the same time. As a result, the number of buses between the PE's and the distributed line buffers would have to be increased. However, we propose a scheme where they can share almost every pixel transfer. If the needed regions of the RW's are read in a row-wise manner, then the first row should be assigned to the first PE, the second row to the second PE, and so on. In that way, the RW regions used at a certain time by the different PE's overlap (except for a 1 row shift). This is indicated in Fig. 8.

Figure 8: Overlap of the RW regions used by different PE's in the same column If the pixels of the RW regions are read column-wise, most of them can be used by each PE, except for the first few and last few pixels of each column, which are needed only by some of the PE's. In order to make sure that the pixels are needed at the same time by every PE, the different PE'8 need to be shifted 1 cycle in time relative to each other. In that case, a CB pixel cannot be read in simultaneously by all PE's though. This can be easily solved by putting the CB pixels in a delay line with a number of taps equal to the number of P E's in one column. As a consequence, some of the PE's become inactive during certain cycles. In order to provide p PE's with the pixels of one pixel column, n + p - 1 cycles are needed (n pixels in one column, for p overlapping columns) instead of n. Because of this, the necessary throughput may not be achieved any longer. If so, one could try to avoid this by adding extra buses, but then the number of buses and ports would become too large for off-the-shelf VSP's. A better solution is to add an extra PE in every column. In this way, the throughput can be increased [=~ ~ PE's per column, each active during 5696out of 5760 cycles].

5

5.1

D E T A I L E D O P T I M I Z A T I O N OF T H E T E M P L A T E

Delay Line Distribution

If one would implement this architecture using discrete chips, it would be a large disadvantage if the delay lines for the CB pixels had to be implemented as separate chips. However, a more elegant solution exists. It is better to store a copy of the CB on every VSP in each column, since most of the commercially available VSP's have some on-chip RAM. This also reduces the number of off-chip transfers for the CB's drastically (by a factor of ra2).

Real-time Motion Estimation ~pe Algorithms

199

For the RW's, this solution would be less elegant because there will still be communication with the neighboring columns, such that the number of transfers is not decreased in spite of the additional memory. Instead of simple RAM's for the RW line buffers, an extra VSP with on-chip RAM can be used for every column. This VSP can handle the addressing of its own RAM, moreover, it can postprocess the results of its corresponding column, i.e. execute BO2, such that the other VSP's in that column are relieved from this job. So, each column would consist of a number of VSP's executing BOl's, and one VSP which provides them with the necessary pixels and takes care of the BO2's (fig. 4). The VSP's are labeled with VSP1 and VSP2 in order to indicate their functional difference. Every VSP is connected to exactly two buses, which are shared with its neighbors. In general, this is not a problem for the commercial VSP's ava~ilable.

5.2

Pairwise Bus Merging

In case m ~ 2n, transfers can be shared between neighboring columns, and the buses are only used during at most one half of the time on the average. The load on the odd and even buses as a function of time is indicated in Fig. 9, for the parameters of section 3.1. In that case, it is possible to reduce the number of buses by another factor of two. However, then a large number of multiplexers and switches would be necessary to connect each VSP to the right bus at different moments in time. If the system is realized with discrete components, it is certainly better not to merge the buses, because of the extra chips. In case it would be integrated on a chip, then this merging can be used, although some flexibility would be lost.

Figure 9: Bus occupancy of the odd and even buses

5.3

Partitioning over Custom Chips

An important advantage of this architecture is that it can be easily partitioned for potential custom integration without inter-partition data flow. If the system would be cut along one of the vertical buses, it is sufficient to add an additional VSP2 at both sides of the cut,

200

E. De Greet', F. Catthoor and H. De Man

which contains a copy of part of the ItW line buffer located at the opposite side of the cut. This way, data exchange between partitions is avoided. D I S T R I B U T I O N OF I N C O M I N G P I X E L S , CB P I X E L T R A N S F E R S A N D OUTGOING RESULTS Another problem to be tackled is the transfer of pixels from the frame memory to the distributed line buffers. As already mentioned, the pixels arrive row-wise via two input buses. In case m >_ 2n, there are plenty of cycles available on the buses, and a demultiplexer can send the pixels to the VSP2's at the right moment, both for the pixels of the current frame and the previous frame. However, the pixels arrive synchronously and this will result in bus conflicts. Therefore, one of the pixel streams has to be delayed by 2n cycles. That way there are never two pixels being sent to the same VSP2. The reason for the delay to be 2n cycles, is that in that case, the pixels are always sent to two odd or two even buses, which have free cycles at the same time. This is also indicated in Fig. 4. The output wires of one of the demultiplexers are shifted by two positions relative to the other demultiplexer. If m < 2n, the buses are no longer free during half of the time. In that case, there are two possibilities. One can increase the number of input buses of the VSP2's or one can add a large number of demultiplexers and/or switches. Both solutions however would be rather inefficient. In practice however, m is usually equal to 2n, such that this problem doesn't occur. A similar problem occurs when the results (e.g. motion vectors) are sent to the output bus. However, the output rate is typically 1 or 2 orders of magnitude less than the pixel rate, which makes this problem less critical. A multiplexer, connected in a similar way as the input demultiplexers can be used for this. This is indicated in Fig. 4. Note that the multiplexers and demultiplexers in Fig. 4 can be reMized with a few commercial single chip routing FPGA's. The last problem to be tackled is the transfer of the remaining data between the VSPI's and the VSP2's, i.e. the CB pixels and the results of the BO l's. During the line period W, every VSP1 should receive n new pixels for its CB. On top of that, every VSP1 should send its results to the corresponding VSP2 from time to time. In case m _ 2n, this can be done easily because there are still plenty of free cycles on every bus. In case m < 2n, the problem is somewhat different. Generally, there are (slightly) more cycles available than necessary for the execution of the BO l's, which means that there are extra free cycles on the buses (apart from the possible free cycles due to the sharing of transfers). If the number of free cycles is large enough, they can be used for the CB pixel and result transfers. If not, then the number of VSP1 rows has to be increased, such that more cycles are available. 7

USING MORE POWERFUL VSP'S

Up till now, the proposed architecture contains a number of VSP's, which are assumed to operate at the pixel frequency. In practice this number can become rather large [=v ~50

Real-time Motion Estimation Type Algorithms

201

VSP's]. However, several high-performance VSP's have been published [14, 15, 9], which operate at very high clock rates (up to 250 MHz). They also contain large on-chip memories and are capable of executing several operations in parallel. The architecture proposed here, can be easily mapped on a realistic number of these processors, by folding rows and/or columns of VSP's onto each other, while still maintaining its flexibility [ o e.g. eighteen 250 MIPS processors]. The global memory and interconnect requirements would not be changed, except for the CB line buffer, since it would be duplicated less times. 8

CONCLUSION

In this paper, a flexible programmable architectural template was presented, which is able to execute regular block-oriented video algorithms in real time. It is fully compatible with off-the-shelf VSP's and could be integrated on a board with very few routing chips or on a few custom chips. It has been shown that a detailed analysis of the data flow and a novel tuned choice of the bus architecture can lead to a significant reduction of the memory requirements and the number of transfers.

Acknowledgments This research was sponsored by Siemens, Miinchen, Germany.

References

[1}

M. Adfi, R. Lauwereins, J.A. Peperstraete, "Buffer Memory Requirements in DSP Applications", Proceedings of the 5th IEEE International Workshop on Rapid System Prototyping, Grenoble, France, pp. 108-123, June 21-23, 1994

[2]

A.A. Argyros and S.C. Orphanoudakis, "Load Redistribution Algorithms for Parallel Implementations of Intermediate Level Vision Tasks", Proceedings of Darthmouth Advanced Graduate Studies Symposium on Parallel Computing (DAGS/PC), Dartmouth College, Hanover, New Hamshire, pp. 162-175, June 1992.

[3]

U. Banerjee, D. Gelernter, A. Nicolau, D. Padua, editors, Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, California, USA, August 7-9, 1991, Proceedings.

[4]

K.M. Baumgartner, B.W. Wah, "Load Balancing Protocols on a Local Computer System with a Multiaccess Network", International Conference on Parallel Processing, St. Clarles, Illinois, pp. 851-858, August 1987.

[5] A. Darte, T. Risset, and Y. Robert, "Loop nest scheduling and transformations", J.J. Dongarra and B. Tourancheau, editors, Environments and Tools for Parallel Scientific Computing, North Holland, 1993. [6] L. De Vos, M. Stegherr, "Parameterizable VLSI Architectures for the full- search blockmatching algorithm", IEEE Transactions on Circuits and Systems, Vol. 36, pp. 13091316, October 1989.

202

E. De Greef, F. Catthoor and H. De Man

[7] M. Dubois, C. Scheurich, and F. Briggs, "Synchronization, Coherence, and Event Ordering in Multiprocessors", IEEE Computer, Vol. 21, no. 2, pp. 9-21, February 1988. [8] M. Engels, R. Lauwereins, J. Peperstraete, "Design of a processing board for a programmable multi-VSP system", Special Issue of Journal of VLSI Signal Processing, Vol. 5, pp. 171-184, 1993.

[9] J. Goto et al., "A 250 MHz 16b 1-million transistor BiCMOS super-high-speed video signal processor", IEEE ISSCC91, pp. 254-255, 1991. [10] P. Hoang, J. Rabaey, "Scheduling of DSP programs onto multiprocessors for maximum throughput", IEEE Transactions on Signal Processing, Vol. 41, no. 6, June 1993. [11] Y. Jehng, L. Chen, T. Chiueh, "An efficient and simple VLSI tree architecture for motion estimation algorithms", IEEE Transactions on Signal Processing, Vol. 41, pp. 889-900, February 1993. [12] T. Komarek, P. Pirsch, "Array Architectures for Block Matching Algorithms", IEEE Transactions on Circuits and Systems, Vol. 36, October 1989. [13] K. Konstantinides, R. Kaneshiro, J. Tani, "Task Allocation and scheduling models for multiprocessor digital signal processing", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 38, no. 12, December 1990. [14] M. Maruyama et al., "A 200 MIPS image signal multiprocessor on a single chip", ISSCC90, pp. 122-123, February 1990.

[15] T. Minami et al., "A 300-MOPS video signal processor with a parallel architecture", IEEE JSSC, Vol. 26, no. 12, December 1991. [16] P. Quinton and Y.Robert (eds.), "Algorithms and parallel VLSI architectures II", Elsevier, 1992. [17] Q. Stout, "Mapping vision algorithms to parallel architectures", Proceedings of the IEEE, Vol. ?6, no. 8, Aug. 1988.

[18] A. van Roermond et al., "A general-purpose programmable video signal processor", IEEE Transactions on Consumer Electronics, pp. 249-257, Aug. 1989. [19] K. Yang, M. Sun, L. Wu, "A Family of VLSI designs for the motion compensation block-matching algorithm", IEEE Transactions on Circuits and Systems, Vol. 36, pp 1317-1325, Oct. 1989.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

INSTRUCTION-LEVEL PARALLELISM PROCESSOR ARCHITECTURES

203

IN ASYNCHRONOUS

D. K. ARVIND and V. E. F. REBELLO

Department of Computer Science, The University of Edinburgh Mayfield Road, Edinburgh EH9 3JZ, Scotland, U. K. { dka, vefr} @dcs.ed.ac.uk ABSTRACT. The Micronet-based Asynchronous Processor (MAP) is a family of processor architectures based on the micronet model of asynchronous control. Micronets distribute the control amongst the functional units which enables the exploitation of fine-grained concurrency, both between and within program instructions. This paper introduces the micronet model and evaluates the performance of micronet-based datapaths using behaviourM simulations. KEYWORDS. Instruction-level parallelism (ILP), asynchronous processor architecture, self-timed design.

1

INTRODUCTION

Centralised controls have been traditionally used to correctly sequence information within processor architectures. However, the ability to sustain this design style is under pressure from a number of directions [6]. This paper examines the effect of relaxing this strict synchrony on the design and performance of processor architectures. The reasons are the following. The the clock frequency of a synchronous processor is determined a priori by the speed of its slowest component (which takes into account worst-case timings for execution and propagation for pessimistic operating conditions). In contrast, the performance of an asynchronous processor is determined by actual operational timing characteristics of individual components (effectively the average delays), and overheads due to asynchronous controls. Secondly, an important consequence of asynchronous controls is the ability to exploit fine-grained Instruction-level Parallelism (ILP), and this is explored in greater detail in the rest of this paper. ILP can be achieved either by issuing several independent instructions per cycle as in superscalar or VLIW architectures, or by issuing an instruction every cycle as in a pipelined

204

D.K. Arvind and V.E.F. Rebello

architecture where the cycle time is shorter than the critical path of the individual operations [5]. This work concentrates on the design and evaluation of asynchronous pipelines for exploiting ILP, as a number of control issues resulting from data and structural dependencies between instructions have to be resolved efficiently. A few asynchronous processors have recently been proposed [3, 8, 9]. These designs are based on a single micropipeline datapath [10]. One disadvantage of viewing a datapath as a linear sequence of stages is that, in general, only one of the functional units will be active in any cycle. Pipelining the functional units themselves is expensive both in terms of additional hardware and the resulting increase in latency. We introduce an alternative model for an a~ynchronous datapath called a micronet. This is a network of elastic pipelines in which individual stages of the pipelines have concurrent operations, and stages of different pipelines can communicate with each other asynchronously. This allows for a greater degree of fine-grained concurrency to be exploited, which would otherwise be quite expensive to achieve in an equivalent synchronous datapath.

2

MICRONETS

AND ASYNCHRONOUS

ARCHITECTURES

Micronets are a generaUsation of Sutherland's micropipeline [10], which dynamically control which stages communicate with each other. Thus micronets can be viewed not just as a pipeline but rather as a network of communicating stages. The operations of each of the stages are further exposed in the form of microagents which operate concurrently and communicate asynchronously with microagents in other stages. Each program instruction spends time only in the relevant stages and for just as long as is necessary. This is in contrast with synchronous datapaths in which the centralised control forces each instruction to go through all the stages, regardless of the need to do so (in effect a single pipeline). Furthermore, the microagents within a stage might operate on different program instructions concurrently. Micronets are controlled at two levels: the data transfer between microagents is controlled locally, whereas the type of operation carried out by a microagent (called a microoperation) and the destination of its result is controlled by the sequencer or by other microagents. Microagents can communicate either across dedicated lines or via shared buses where arbitration is provided either by the sequencer or some other decentralised mechanism such as a token ring. Data dependencies in synchronous pipelines are resolved by using either hardware or software interlocks [4], which increases the complexity of the controls. Micronets use their handshaking mechanisms together with simple register locking to achieve the same effect, but with trivial hardware overheads. In synchronous designs the structural hazards are normally avoided in hardware by using a scoreboarding mechanism. In micronets this is provided by existing handshaking protocols. Out-of-order instruction completion can be supported in synchronous designs, but at a non-trivial cost. Micronets are able to relax the strict ordering of instruction completions and thereby further exploit ILP. The result is to effectively increase the utilisation of the functional units by reducing their idle times or stalls. Better program performances can be achieved by exploiting both ILP and actual

ILP in Asynchronous Processor Architectures

205

instruction execution times. 2.1

Asynchronous Architectures

Figures 1-3 illustrate micronet models of a generic asynchronous RISC datapath. The intention is not to focus on the functional units themselves but rather on their asynchronous control and investigate their effect on the performance. The number of units and their functionality may be changed without side-effects. Tile architecture can be described as a network of microagents (denoted by solid boxes) which are connected via ports. The microagents which are labelled in the figures, called Functional Microagents (FMs), perform microoperations which are typical of a datapath. On each of their ports are Communicating Microagents (CMs) which are responsible for asynchronous communications between FMs and the rest of the micronet. The FMs are effectively isolated and only communicate through their CMs, and can therefore be modified without affecting the rest of the micronet. 2.2

Measuring Performance

We next introduce a few metrics for measuring improvements due to the distribution of control. There are two principal characteristics which affect performance- the microoperation latency (the time between initiating the operation and the result being available), and the microoperation cycle time (the minimum time between successive initiations of the same operation, i.e. throughput). The metrics defined for MAP are as follows: M i n i m u m D a t a p a t h Latency ( M D L ) - The time between asserting the control signals (i.e. initiating instruction issue) and receiving the final acknowledgement of the instruction's completion. I n s t r u c t i o n C y c l e T i m e ( I C T ) - The time between two identical instruction issues once that instruction's pipeline is full. In asynchronous pipelines which usually have nonuniform stage delays, the time between successive instruction issues is influenced by the slowest stage currently active in the pipe. P r o g r a m Execution T i m e ( P E T ) - The actual execution time of the program. A more detailed exposition of performance-related issues is presented in [1]. To study the effectiveness of the micronets, it is sufficient to focus on the LD, ST, and ALU instructions. Five simple test programs were devised to exercise the design. The Alu, Load and Store test programs measure the maximum attainable utilisation of their respective FMs. Each of these programs contain a number of identical instructions, such that only structural dependencies exist between instructions (in effect setting up a static pipeline or a fixed path through a network of components). The number of instructions in the test programs are sufficient to fill the pipeline, i.e. enough instructions exist for the Control Unit (CU) to achieve a steady issue rate. The Hennessy Test (HT1) consists of a mix of the previously-mentioned instructions but without any data dependencies, which

206

D.If. Arvind and V.E.F. Rebello

exercises the spatial concurrency and out-of-order completion, both of which are provided by the micronet, for a particular schedule devised by the compiler. HT2 is a variant of HT1, with data dependencies, which exercises the data forwarding mechanism as well. This program represents a "typical" basic block of compiled code (actually a line of code in C from [4]). To facilitate the simulation of instruction sequences within reasonable run-times and without sacrificing accuracy, the timing characteristics of the architecture (in 1.5 #m CMOS) were extracted from a post-layout simulation tool within a commercial VLSI design package called SOLO 1400 [2] and incorporated into a mixed-level (mainly register-transfer level) model. The processor was described in Occam2 and simulated on a parallel asynchronous event-driven simulation platform, on a transputer-based MEiKO Computing Surface. 3

REFINEMENTS

The following sections discuss a number of refinements which were made in three stages to the base design as shown in Figure 1. This highlights the ease with which the micronet model can efficiently exploit ILP and without the difficulties normally encountered in synchronous datapath design, such as implementing hazard avoidance, data-forwarding or balanced pipeline-stage design. The processor design as illustrated in Figure 1 only explgits the actual execution timings of microoperations (Stage 1), whereas later designs exploit both this property and the available concurrency between the microoperations of different instructions. The execution of each instruction requires a predetermined set of microoperations, each initiated by signals from the CU. These are four-phased controls whose acknowledgement signals are used as status flags for mimicing a scoreboarding mechanism. In general, the microoperations for an instruction are initiated as soon as possible by asserting the necessary control signals. The receipt of an acknowledgement confirms that the associated microoperation has begun and the initiating control signal is de-asserted. The instruction is said to be issued once all the asserted control signals have been acknowledged, which then allows the next instruction issue to begin. 3.1

Stage 1

Figure 1 illustrates a naive implementation of the datapath of an asynchronous processor, which does not as yet fully exploit the full repertoire of micronets. The control signals generated by the CU for Stage 1 are described in greater detail below: Rx, ( R y ) - This signal identifies the source register for the X (Y) Bus. The corresponding acknowledgement is asserted once the register has been accessed, and cleared once the data has been transferred to the operand fetch CM. Rz - Same as above. The ST microoperation obtains the third operand over the Z Bus. R o f - Same as above, but the value in the offset register is output onto the X bus.

ILP in Asynchronous Processor Architectures

207

Figure 1: The micronet model of Stages 1 & 2 AUs - This signal identifies the next operation of the ALU. The corresponding acknowledgement is asserted when the interface is ready to fetch the ALU's operands from the registers and is cleared when it initiates the write-back handshake. MUs

This signal identifies a load instruction to the MU and is asserted and cleared in the same manner as above. (Control signals for the other MU microoperations have been omitted for the sake of clarity). -

ZMs - This signal identifies the destination register for data write-backs from the ALU or MU via the Z bus. The corresponding acknowledgement signal is asserted when the register is ready to receive data and cleared once the data has been written back. In Stage 1, all the microoperations for a particular instruction are initiated together, and the next set cannot be initiated until the completion of the set of microoperations of the previous one. This effectively serialises the instruction execution, as illustrated in the timing diagram in Figure 1. In successive refinements the r61e of the CU is diminished by distributing the control of the micronet to local interfaces and microoperations are individually initiated as early as possible.

D.K. Arvind and V.E.F. Rebello

208

Instruction ALU LD ST .

.

.

.

Inst. Cycle Time (ICT) 24nS 43nS 23nS

Datapath Latency (MDL) 24nS . . . . . 43nS 21nS

.

Table 1: Instruction Execution on Stage 1 In the base stage, the ICT is determined by the slowest control signal handshake since the next instruction issue cannot begin until all the previous handshakes have been completed. The results in Table 1 show that the ICT is equal to the MDL (except for the ST instruction), which is not surprising as instructions execute sequentially but only take as long as is necessary. The higher value for the ST instruction is due to a handshake delay, which in the LD instruction is hidden by the write-back stage. Although there is no explicit pipelining of the datapath, different phases of the handshaking may occur at the same time, e.g. a CM may initiate a handshake with another CM while completing one with its FM. As was expected the execution times of the test programs (Table 5) are the sum of their individual instruction execution times together with startup overheads.

3.2

Stage 2

The strict condition which was employed in Stage 1 for initiating a set of microoperations after the completion of the previous set is now relaxed. Furthermore, the CU can now assert any of the individual microoperations for an instruction asynchronously, where previously the set of microoperations for an instruction were initiated in unison. This allows microoperations relating to different instructions to overlap (Stage ' 2 in Figure 1). Note that a control signal which is related to an instruction can only be de-asserted once all of the relevant control signals have been acknowledged. The effect of relaxing this constraint is to introduce possible hazards and efficient mechanisms have been devised to avoid them. Fortunately, these hazard avoidance mechanisms are implicit in the orderings of the assertions of the control signals, known as the pre-issue conditions and these are discussed below: R e a d - a f t e r - W r i t e ( R A W ) - A register locking mechanism is implemented in the register bank without the CU having to keep track of the "locked" registers. The acknowledgement signal ZMs is asserted after the locking operation, and is de-asserted once the data is written back (signaling the unlocking of the register). By definition an instruction is issued once all the acknowledgements of the relevant microoperations have been received. This implies that the destination register of the previous instruction will have been locked before the CU initiates any of the current instruction's microoperations. W r i t e - a f t e r - R e a d ( W A R ) - This hazard is avoided without additional hardware overheads. When a register is used as both source and destination within the same instruction, then it is necessary to ensure that the source data is obtained before the register is locked, otherwise deadlock will occur. The CU stalls the assertion of ZMs until the source operand control signals R x and R y have been asserted.

ILP in Asynchronous Processor Architectures

209

W r i t e - a f t e r - W r i t e (WAW) - Although concurrent instruction execution can now take place, write-backs are still enforced in order. It is necessary to ensure that destination register has been locked, and that data is then written to its correct location. These conditions are met by simply preventing a functional unit (FU) from writing data back until the control signal from the CU has been de-asserted (an implicit go-write signal). This is sufficient since the control signals cannot be de-asserted before ZMs is asserted (see Figure 1). Note that if the CU attempts to lock a register which is already so, then the acknowledgement signal cannot be asserted a.nd the current request will stall. This mechanism guarantees that write-backs to the same register occur in the correct order without stalling the instruction issue, and thereby allowing the instructions to execute concurrently with only the write-backs being sequential. The CDC6600 [11] used a similar go-write signal which sequentialised the execution of the offending instructions. O p e r a n d Fetch - Simultaneous operand requests by FUs to the same Register Bank CM microoperation can lead to one of them acquiring the wrong operand. This can be avoided by the CU by delaying the assertion of the control signal to a FU until the previous FU has made its operand request(s) to the registers, i.e. until the acknowledgement signals of "operand fetch" microoperations have been de-asserted. Bus C o n t e n t i o n - Due to the mechanism to avoid WAW hazards only the Register Bank and either the ALU or Memory Unit can write onto the Z Bus simultaneously. Thus bus access is arbitrated by the CU through mutually-exclusive assertions of Rz and ZMs. Insiruction

;~Lu LD ST

Inst. Cycle Time (ICT) 21nS 42nS 23nS .

.

.

.

.

.

.

.

.

.

.

.

Datapath Latency (MDL) 24nS 43nS 21aS

.

....

Table 2: Instruction Execution on Stage 2 The improvements in the instruction cycle times, as shown in Table 2, are small. This can be explained by the limited overlap between the operand access of the current instruction and the write-back of the previous one. In the design under consideration there can only be two program instructions active in the datapath simultaneously. 3.3

Stage 3

In Stage 3, the r61e of the CU is diminished further by distributing the control of the micronet to individual CMs. The CMs have been enhanced to more than just controlling local communications between FMs. They effectively buffer the initiations of the microoperations from the CU until their respective FMs are ready to perform. Also, the write-back to the register bank is no longer controlled by the CU, but directly by the CMs of the FMs which require the service, i.e. the write-back microoperation is initiated by the microoperations in the previous stage.

210

D.K. Arvind and V.E.F. Rebello

Figure 2: The micronet model of Stage 3 Enforcing write-backs in order restricts the degree of concurrency which can be exploited, especially when the FU executions times vary significantly. However supporting out-oforder completion of instructions in an asynchronous environment is more difficult than under synchronous control. Determining the precise order in which results will be available is virtually impossible since microoperation delays vary. Out-of-order instruction completion is supported by tagging the write-back data with the address of the destination register. The CU cannot predict the write-back order, therefore a decentralised bus arbitration scheme as in a token ring is employed. The ring is distributed amongst the CMs and is very simple to implement in VLSI. However, the ring's cycle time will increase with the number of FMs, and might be infeasible for larger numbers. With data transfer on the Z bus being tagged, CMs can identify and intercept operands for which it may be waiting. This mechanism is reminiscent of the IBM 360/91 common bus architecture [12]. Data-forwarding has been implemented by exploiting the feedback loops of the micronet. In the event of data forwarding, where data is routed directly to the CM of a waiting FM, the CM's previous request for that operand is in effect cancelled by initiating a separate handshake. This frees the corresponding "operand fetch" CM to service its next request. An alternative approach would be to implement operand bypassing, where the operand is fed back to the "operand fetch" microoperation. This avoids the need for data forwarding CMs and the cancel handshake. The dual rSle of the Z bus can no longer be supported due to the data-forwarding mechanism. A separate operand fetch bus (W bus) is used, thereby making the Z bus purely a write-back one (see Figure 2). As a result of these modifications, the acknowledgements to the control signals and the pre-issue conditions have to be revised as shown below: Rx, (Ry, R w ) - The acknowledgement is asserted by the CM of the register bank when the X (Y, W) bus operand fetch microoperation is ready, and de-asserted once the operand fetch handshake is in progress. R o f - Same as above. Note that both the control signals Rx and Rof cannot be active

ILP in Asynchronous Processor Architectures

211

simultaneously. AUs,

M U s - The acknowledgement is now asserted when the corresponding CMs are ready to fetch the operands from the registers and is cleared once the FM microoperation has completed.

ZMs

- The acknowledgement signal is asserted when the CM is ready and de-asserted once the operation has been completed (i.e. the register has been locked).

R A W -The CU delays the assertion of the operand fetch control signals l~x, l~y and l~w until the previous ZMs control acknowledgement signal has been de-asserted, which indicates the locking of the previous destination register. W A W - The mechanism is unchanged except that the go-write signal originates from the register interface and not the CU (i.e. the mechanism has now been decentralised). W r i t e - b a c k C o n t e n t i o n - This is prevented by the use of a token ring to arbitrate accesses to the write-back (Z) bus. Of course, this problem could be obviated by using dedicated buses for small number of FMs, but is impractical for designs with larger numbers. Further concurrency is achieved by applying these pre-issue conditions only when necessary by explicitly checking register addresses for dependencies between successive instructions. Instruction ALU LD ST

inst. Cycle Time (ICT) 15aS 38nS 23nS

Datapath Latency (MDL) 24nS 43nS 21nS

, .

Table 3: Instruction Execution on Stage 3 We observe an improvement in the cycle times of instructions which require to write data back to the registers, such as the LD and ALU instructions, as shown in Table 3. This is due to the de-centralisation of the write-back control to the relevant CMs. These improvements are reflected in the shorter PETs for Load, Alu and HT1 test programs, as shown in Table 5. Columns "HT2" and "HT2(DF)" refer to the cases without and with data-forwarding, respectively. 3.4

Stage

4

In this final stage, both the assertion and de-assertion of the control signals now occur independently of each other. The states of the FU acknowledgements no longer represent the activity of their FMs, but rather that of their operand-fetch CMs. All of this further increases the concurrency between microoperations which makes possible the exploitation of fine-grained concurrency between instructions. The ICT value for the LD instruction in Table 4 is the best attainable as it represents the MU delay for the operation. These figures show that the micronet can exploit the

D.K. Arvind and V.E.F. Rebello

212

Figure 3: The micronet model of Stage 4

Table 4: Instruction Execution on Stage 4

actual operational cost and effectively hide the overheads of self-timed design. The ICTs for the ALU and ST instructions are limited by their operand fetch cycle times. The overall improvements in the program execution times in Stage 4 over Stage 1 for the first three test programs (shown in Table 5 and Figure 4) are due to improvements in temporal concurrency due to the pipelining of the datapath. The actual speedup which is achieved is less than the maximum attainable improvement (the ratio of the ICTs in Tables 1 and 4), due to the MDL and the startup overheads (for longer test programs the speed-up will approach this maximum value). The speed-up for HT1 is due in part to pipelining of the instructions as observed in the other test programs, but also due to additional spatial concurrency due to the overlapping of different instructions in the same stage of the micronet. This further improvement is still significant (approximately 17~163 in this example) given that both successive instruction operand fetches and write-backs are effectively forced to take place sequentially due to resource constraints. (In fact, since these delays are larger than

PET Stage 1 Stage 2 Stage 3 Stage 4 Effective Speed Up .

Alu Test 175nS 157nS 121nS 103nS 1.75

Load Test 308nS 302nS 280nS 188nS 1.66

Store Test 164nS ..... 165nS 165nS 98nS 1.71

HT1 143nS ll9nS 83nS 79nS 1.89

.

HT2 143nS ll9nS 97nS -

HT2(DF) 91nS 91nS 1.62 .

Table 5: Execution Times of the Test Programs

ILP in Asynchronous Processor Architectures

213

Figure 4: Comparison of Execution Times of the Test Programs the FM delays for the Store and ALU operations, the scope for spatial concurrency in this particular example is quite small). As the number of microagents in each stage is increased, the spatial concurrency effect will be more pronounced. The speed-up for HT2 as expected reflects the reduced concurrency which can be exploited, due to the data dependencies in the program. In summary, the rSle of the CU in an asynchronous processor has been considerably simplified to just initiating individual microoperations as early as possible. The control of the datapath is distributed to local interfaces, courtesy of the micronet.

4

CONCLUSIONS

This work has investigated the influence of an asynchronous control paradigm on the design and performance of processor architectures. By viewing the datapath as a network of microagents which communicate asynchronously, one can extract fine-grain concurrency between and within instructions. The micronet can be easily implemented using simple self-timed elements such as Muller C-elements [7] and conventional gates. Future work will investigate the suitability of asynchronous processors as targets for optimising compilers.

214

D.K. Arvind and V.E.F. Rebello

Acknowledgements V. Rebello was supported by the U. K. Engineering and Physical Sciences Research Council (EPSRC) through a postgraduate studentship. This work was partially supported by a grant from EPSRC entitled Formal Infusion of Communication and Concurrency into Programs and Systems (Grant Number GR/G55457). References

[1] D. K. Arvind and V. E. F. Rebello. On the performance evaluation of asynchronous processor architectures. In International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS'95), Durham, NC, USA, January 1995. IEEE Press. [2] European Silicon Structures Limited. Solo 1~00 Reference Manual. ES2 Publications Unit, Bracknell, U.K., 1990. [3] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, and J. V. Woods. A micropipelined ARM. In T. Yanagawa and P. A. Ivey, editors, IFIP TC 10/WG 10.5 International Conference on Very Large Scale Integration (VLSI'93), pages 5.4.1-5.4.10, Grenoble, France, September 1993. [4] J. Hennessy and T. Gross. Postpass code optimisation of pipeline constraints. A CM Transactions on Programming Languages and Systems, 5(3):422-448, July 1983. [5] N. P. Jouppi. Available instruction-level parallelism for superscalar and superpipelined machines. In ASPLOS III, pages 272-282, April 1989. [6] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley, Reading, Mass., 1980. [7] R. E. Miller. Switching Theory. Volume II: Sequential Circuits and Machines. John Wiley and Sons, 1965. [8] W. F. Richardson and E. L. Brunvand. The NSR processor prototype. Technical Report UUCS-92-029, Department of Computer Science, University of Utah, USA., 1992. [9] R. F. Sproull, I. E. Sutherland, and C. E. Molnar. Counterflow pipeline processor architecture. Technical Report SMLI TR-94-25, Sun Microsystems Laboratories Inc., April 1994. [10] I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720-738, June 1989. [11] J. E. Thornton. Design of a Computer: The Control Data 6600. Scott Foresman and Company, 1970. [12] R. M. Tom~sulo. An ei~cient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1):25-33, January 1967.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

HIGH SPEED WOOD ARCHITECTURE

INSPECTION

215

USING

A

PARALLEL

VLSI

M. HALL, A..~STtt6M

Image Processing Group Department of Electrical Engineering Link~ping University S-581 83 Link5ping, Sweden [email protected], [email protected] ABSTRACT. Image processing techniques have found use in the forest product's industry to detect various defects on wood surfaces. Normally, the applications have been targeted towards industries that use length transport of wooden boards (approx. 1 m/s). Where crosswise transport is used (corresponds approximately to length transport at 6 m/s) the computational demands of the systems are very high, too high for today's inspection systems. In this paper we present a parallel computer architecture that is targeted towards automatic inspection in the forest product industry. We implement an algorithm that is able to detect various surface defects in wooden boards. The algorithm that is implemented resembles an algorithm that is used in the existing wood industries. We show that the new architecture permits scanning of boards fed lengthwise, at the speed of 100 m/s. This indicates that we can use the architecture in application were the boards are fed crosswise, and that we can use more sophisticated algorithms for higher accuracy of the inspection. KEYWOttDS. SIMD, automatic wood inspection, real time image processing 1

INTRODUCTION

Image processing algorithms often demand very high execution speed if they are to run in real time, which is required in industry. The wood industry is one of the fields were image processing has found practical use. There exist a number of systems, commercial as well as research systems [5][6][15], for automatic wood inspection and quality sorting. The basic task for such a system is to detect surface defects, e.g. knots, wane edges, resin pockets etc. These systems need to do this at a high speed. Using length transport the feeding speed of a piece of wood typically is as high as 1-2 m/s. If the boards are fed crosswise, the computational demands are even higher since cross transport corresponds to

M. Hall and A. AstrOm

216

length transport with a feeding speed 6 m/s, Figure 1. Lets calculate the number of image pixels per second. We assume that the resolution of the pixels is 0.1xl.0 mm (typically it is 0.1xl.0mm to 0.2x2.0 mm). If we are scanning a board that has a width of 200 ram, the system needs to handle more than 2 million pixels per second. In an environment where the boards are fed crosswise, the boards can have the width of 6 m and the number of pixels per second will then be 6 million.

Figure 1: Length transport and cross transport of wooden boards. A standard sequential computer often fails to deliver the desired performances. However, in most cases we can divide the algorithm into smaller parts, letting different processors solve different parts of the algorithm, i.e. use a parallel computer architecture. There exists a number of possible parallel architectures to choose from. We can for instance use Digital Signal Processors, Pipelined processors, MIMD or SIMD architectures, [1][14]. Two different solutions to the partitioning problem are a data-serial pipeline of processors and a data-parallel pipeline of processors respectively, see Figure 2. This paper describes a parallel architecture WVIP, which belongs to the category of SIMD architectures. SIMD architectures are very effective to use for low level signal processing. They also have the property of being easier to control and synchronize compared to the other parallel architecture MIMD. The processing elements in WVIP are bit serial, which makes it possible to integrate many processors on one chip. When comparing integer multiplication/(area*time), [9], parallel serial processors are more effective to implement compared to sequential bit parallel processors. 2

VIP, VIDEO IMAGE PROCESSOR

2.1

History

VIP, Video Image Processor, was originally designed as a general purpose video processing array, see [18] and [3]. VIP was designed to be a modular SIMD-architecture, if necessary it should be easy to modify the architecture to suit some special needs. In this sense VIP can be said to be a flexible SIMD-concept. VIP was originally designed to consist of a micro-controller and a number of processing elements, PE's. Each PE consists of a number of units, e.g. arithmetic unit, memory etc. The idea was that it should be possible to remove and/or resize the units depending on the applications. A modified version of VIP was designed in 1993, called RVIP (Radar VIP) which was

High Speed Wood Inspection

217

Figure 2: The difference between a data-serial and a data-parallel pipeline processor array. aimed at radar signal processing [13][12]. RVIP was designed to perform static signal processing, i.e. the number of operations are constant from line to line. In the case that the operations are data dependent, for instance as in segmentation, the current RVIP and VIP design has some drawbacks. The main problem when mapping a data dependent algorithm to a SIMD-architecture is that we need to perform different parts of the algorithm depending on the current data [4]. In a data dependent algorithm we need to be able to extract information from the PE's and use this information to decide which part of the algorithm to execute. This cannot be done in the VIP or RVIP. This new design of VIP is targeted towards industrial applications in general and wood inspection in particular. This design is called WVIP, Wood VIP. We are currently investigating four slightly different version of VIP, FVIP (airborne radar signal processing), WVIP, IVIP (infra red signal processing), UVIP (Ultrasound signal processing) [8].

2.2

General Layout of the W V I P

The WVIP consists of a linear SIMD array of 64 bit-serial processing elements. Each processing element, PE, consists of one bit-serial ALU with a 32-bit accumulator, a 10-bit serial-parallel multiplier, 2048 bit static ram, a 16-bit internal bidirectional shift register and four two ported 32-bit IO registers. These units are the same as in RVIP [13][12]. WVIP also has a new unit called the GLU, section 2.4, which is used to extract data dependent information from the PE's. The chip is designed to run at a clock frequency of 50 MHz.

218

M. Hall and A. AstrOm

To avoid clock skew problems, the communication between different chips is limited to a clock frequency of 20 MHz. In Figure 3, the floor plan of a single WVIP-chip is drawn. A special 2x2" Multi Chip Module, MCM, with eight WVIP chips, a micro controller, and a program memory has been designed. In the MCM, the micro controller sends the program instruction to the processing elements in WVIP.

Figure 3: Floor plan of a single WVIP-chip. The processing dements are connected to each other via the shift registers. 2.3

L a y o u t of t h e P r o c e s s i n g E l e m e n t s

The layout of a single PE is shown in Figure 4. All communication between the different units within each P E is performed via the one-bit P E-bus, except communication between the ALU and the accumulator. The ALU and the accumulator communicate via two onedirectional buses. The output from the accumulator cart be connected to the P E-bus by closing a bus switch. This makes it possible to write data from the accumulator to other units connected to the P E-bus. Data from other units to the accumulator must be written through the ALU. The PE-bus has broadcast facilities so that one bit of data can be loaded from one unit and stored in several other units during one clock cycle.

High Speed WoodInspection

Figure 4: Layout of a single processing element in WVIP.

219

M. Hall and A. AstrOm

220

2.4

G L U , Global Logical Unit

WVIP includes a unit called the Global Logical Unit, GLU. The GLU performs global operations on the whole PE array. The functions implemented are MARK, LMARK, RMARK, LRMARK, FILL, LFILL, RFILL, and LRFILL. These functions are the same as those found in the GLU in LAPP 1100 and MAPP 2200 [10][7]. LAPP 1100 and MAPP 2200 are two commercially available smart image sensor arrays, an internal SIMD processor architecture. LAPP1100 has 128 sensor and processing elements (1D). MAPP 2200 has 256x256 sensors and 256 processing elements. Both of these chips can't be cascaded. Each GLU function takes two binary operands (a, b). We denote the functions with MARK(a, b), FILL(a, b) and so on. Operand a, is either an operation specific constant or the content in the ALU-register A. Operand b, is the current data in the PE-bus. For the description it is convenient to visualize the operands as two registers, the length equal to the number of processing elements in the array. The effect of the operation MARK is that those objects in operand b are kept which are vertically connected to object in operand a, see Figure 5. FILL is complementary to MARK, holes in operand b, which are connected to objects in operand a are kept ~ All other bits are set.

Figure 5: Description of the GLU-instructions M A R K and FILL. Two more global functions, GOR and COUNT, are also included. GOR, Global OR, has one operand, a. Operand a, is the content in the GLU register G. The result of applying GOR(a) is true if any G-register holds the binary value one, otherwise the result is zero. C o u N T is a fast ripple counter, which counts the number of ones in the G-register. The result of GOR and COUNT are stored in a status register, placed in the external micro controller.

2.5

Control Unit

The control unit consist of three parts, a controller, a micro code sequencer and a status register. The controller receives the instructions from the host computer. These instructions can be of two kinds, instructions that shall be executed within the controller or instructions that shall be executed by the PE's. Instructions that shall be executed by the PE's (called PE-instructions) are sent to the sequencer (Figure 6), which are the same unit as the controller in RVIP. The sequencer translates these instructions to microcode instructions, It also unrolls loops. Since the PE's are bit-serial, and we almost always works with eight bit or more, the same instruction in the algorithm can be executed within a loop where the

High Speed Wood lnspeaion

221

loop ranges over the size of the data [11].

Figure 6: The connection between the control unit and the PE-array.

The status register holds the results from the COUNT and GO R operations. The controller can execute instructions by itself, without the use of the P E-array. For instance it can repeat a number of P E-instructions depending on the result from a COUNT or GOR operation. The Controller together with the COUNT and GOR information make it possible for us to implement data. dependent algorithms on a SIMD-architecture, and the execution speed of the algorithm becomes high. A similar approach to control structure is described in [4].

3 3.1

EXAMPLE, WOOD INSPECTION WITH WVIP Problem

An automated wood inspection system consists of several parts, see [15]. One part is image processing. Given an image, the problem is to detect defects, e.g. knots and checks (see Figure 7), to determine the structure of the wood etc. In this study we concentrate on the problem of detecting defects. For the image processing systems of today it is common that the algorithm for detecting defects takes approximately 50% of the total time [16]. The other main parts in the system are classification of defects, calculating surface structure, which takes approximately the rest of the time [16]. We assume that we have a system which delivers one line of pixel data at a time to the VIP-chip. There e~sts a number of algorithms for detecting defects, see for example [15] and [2]. The algorithm we chose to implement is very similar to an algorithm used in many wood processing companies, in a system called WoodEye. By using this algorithm, we wanted to see if our proposed extension of VIP could solve the same problem and at a greater speed. The algorithm assumes that the images are already digitized and fed into the WVIP. The classification is neglected, in the sense that it is assumed to be performed by the host computer.

M. Hall and A. AstrOm

222

Figure 7: Example of some surface defects in wood. The defects in this image are checks.

3.2

Algorithm

The algorithm can be divided into three parts:

9 thresholding the pixel data, which gives objects. 9 extracting objects on the same line. 9 connecting objects on neighboring lines. Each part is executed in WVIP and the result, the list of objects, is sent to the host computer for classification. In general a possible defect is darker than the surrounding wood. Of all the detected objects a few will remain as defects after the classification. Input to WVIP is a line with pixel data. Each processing element is assigned to one input pixel. We use a fixed threshold value to get a line with binary data. A pixel with a value below the threshold may belong to a defect. In one line of data we can have several defects, i.e. several segments. As in all segmentation with a threshold, the actual value of the threshold is essential. In the current implementation we assume a fixed threshold. However, an adaptive threshold can easily be implemented. The global operation GOtt, performed on the result of the thresholding, tells us if we have any objects in the current line. If we do, we use COUNT to get the number of defects. For every defect we extract some features of the connected object, e.g. the position and the number of pixels in the current line that belong to the connected object. We also use two more threshold values to calculate how many pixels that are darker than the first threshold value. These features are used in the classification. A defect may be spread out over several lines. To get the global defect we need to connect segments on neighboring lines. This step in the algorithm is the most complicated one. It is also in this step that the new GLU functions are most useful. The basic outline of this

High Speed WoodInspection

223

step in the algorithm is as follows: 1. For each object in the current line. (a) Check if the objects are connected to an object in the previous line (Basically M A R K operations). (b) If so, update the feature database for the connected object with the new data (To get the features, we use FILL operations and C O U N T operations). (c) If not, create a new entry in the feature database. 2. For each old object check if it has disappeared (end of defect). (a) If so, move the feature database entry to the new entry in the database of the found objects. 3. Update the old line with the new line, remove objects that have disappeared and insert objects that have appeared ( M A R K operations). In the algorithm we have to check some special events. One example is that two old objects can be connected to tile same object in the new line, i.e. they are actually the same defect. In Figure 8 a part of the wooden beam from Figure 7 is shown together with two lines, representing the old line, accumulated so far, and the current line.

Figure 8: Example of an algorithm for detecting surhce defects in wood.

4

PERFORMANCE

The algorithm for detecting surface defects in wood has been designed and written in the VIP assembly language, [13][11], with the new GLU operations added. As customary, [1],

224

M. Hall and A. AstrOm

we have made a theoretical analysis I of the algorithm and the time it requires to solve the problem. Since the segmentation operations are very data dependent, the results from the theoretical analysis were two numerical expressions, one upper limit and one lower limit of the number of cycles needed to execute one line of data. The evaluation of the algorithm has been simulated using these equations. The numerical expressions were implemented in a computer program. The input data to the program were 17 real wood images, sizes approximately 200x1000 pixels obtained at ALA sawmill, S0derhamn Sweden, by an existing wood inspection system called WoodEye. The images were chosen so that they each had some typical defect, i.e. the images have more defects than normal. The preprocessing consisted of removing those objects that only consisted of one pixel, since they would anyway be removed in the classification. The test program thresholded the input data and counted the number of objects in each line. These numbers were then used as input in the theoretical expression, giving a number of cycles spent in each image line. For each image we translated these results to mean time in each row, see Figure 9. The results show that we in most cases can expect the algorithm to execute one image line in 10 ps. This corresponds to 100 000 image lines per second, which corresponds to a feeding speed of 100 to 200 m/s given the pixel sizes above. Typical feeding speeds in industry today are 1 to 6 m/s. Our aim is not to have a system capable of scanning wooden boards at the speed of 200 m/s. Since the algorithm in a totally automatic wood inspection system takes approximately .50% of the total time, we also need time to do the other parts in the inspection system (classification, calculating surface structure etc.). This decreases the maximum scanning speed, approximately to 70 m/s. This scanning speed, 70 m/s, shows that there is enough computation capacity to implement more sophisticated algorithms. An example on such an algorithm is described in [171. 5

CONCLUSIONS AND FUTURE WORK

We have shown that the global logical unit in Wood VIP allows us to implement data dependent algorithms easily on a SIMD array. We have also showed that we can implement a wood inspection algorithm with very high performance. We believe that the Wood VIP can be used as a very powerful general signal processor. In the future we will investigate different designs of the global logical unit and the control unit in order to make them as optimal as possible. We will also investigate other applications to verify that the functions in the global logical unit are correctly chosen. Some interesting questions regarding the control unit are tile minimum data size of it (64 bit ?), the delay time between a COUNT operation and the decision to act on it. The current status of the project is that Radar VIP has been evaluated and compared to other signal processors, [11], to other digital signal processors by Ericsson Radar Electronics, Sweden. Ericsson found that the Radar VIP was the only processor that could deliver real time performance in their application. Ericsson is currently designing and manufacturing the Radar VIP and a prototype is scheduled to be available early in 1995. 1The new simulator wasn't ready at the time of this workshop.

High Speed Wood Inspection

225

Figure 9: Result from evaluation of the algorithm.

References [!] S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall Inc., 1989. [2] D. A. Butler, C. C. Brunner, and J. W. Funck. A dual-threshold image sweep-andmark algorithm for defect detection in veneer. Forest Products Journal, 39(5):25-28, May 1989. [3] K. Chen and C. Svensson. A 512-processor array for video/image processing. In From Pixel to Features II, Bonas, France, 1990. [4] T. S. Cinotti, R. Cucchiara, L. D. Stefano, and G. Neri. Driving a SIMD array with a RISC control unit: a modular and flexible approach to computer vision applications. In Proc. of lnfoJapan'90, pages 217-224, Tokyo, Japan, 1990.

[5] R. Conners. Machine vision technology in the forest products industry. Industrial Metrology, special issue, 2(3-4), May 1992.

M. Hall and A. AstrOm

226

[6] R. W. Conners, C. W. McMiUin, and R. Vasquez-Espinoza. A prototype software system for locating and identyfing surface defects in wood. In Proc. 7th Int. Conference on Pattern Recognition, volume 1, pages 416-429, 1984. [7] R. Forschheimer, P. Ingelhag, and C. Jansson. MAPP2200-a second generation smart optical sensor. In Proc. of the SPIE- The International Society.for Optical Engineering, volume 1659, pages 2-11, 1992. [8] M. Hall, M. Johannesson, and A. ~str6m. The RVIP image processor array. In Proc. of Symposium on Image Analysis, pages 119-124, Halmstad, Sweden, Mar. 1994. [9] P. Ingelhag. A contribution to high performance CMOS circuit techniques. Lic Thesis LiU-TEK-LIC-1992:35, Department of Physics, Link6ping, Sweden, 1992.

[10] Integrated Vision Products AB, Sweden. LAPP1100 Picture Processor, Instruction Set.

[11] M. Johannesson, A. Astr6m, and P. Ingelhag. RVIP, The final report. Technical Report LiTH-ISY-R-1595, Department of Electrical Engineering, Link6ping, Sweden, Nov. 1992. [12] M. Johannesson, A./~str6m, and P. Ingelhag. The RVIP image processor array. In Proc. of CAMP'93 Workshop, New Orleans, USA, Dec. 1993. [13] M. Johannesson, K. Jonasson, and A..~strSm. PASIC/VIP system design and evaluation. Technical Report LITH-ISY-I-1332, Department of Electrical Engineering, Link6ping, Sweden, 1992. [141 H. S. Stone. High Performance Computer Architecure. Addison Wesley Publishing Company, 2nd edition, 1990.

[151 E. Astrand. Detection and classification of surface defects in wood, a preliminary study.

Technical Report LiTH-ISY-I-1437, Department of Electrical Engineering, Link6ping, Sweden, 1992.

[16] E..~strand. Personal communication, 1994. [17] E. ~strand and A..~str6m. An intelligent single chip multisensor approach for wood defect detection. In Proc. of Symposium on Image Analysis, pages 115-118, Halmstad, Sweden, Mar. 1994.

[18] A..~str6m. Smart Image Sensors. PhD thesis, LinkSpings Tekniska H6gskola, 1993.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

227

CONVEX EXEMPLAR SYSTEMS: SCALABLE PARALLEL PROCESSING

J. VAN KATS CONVEX Computer Europalaan 514 3526 KS UTRECHT [email protected]

ABSTRACT. The Convex Exemplar is the first of a family of production-oriented multipurpose parallel computer (MPC) systems. The Exemplar systems make available a spectrum of applications, from thousands of non-parallel applications to a growing number of highly parallel, production-oriented applications. The systems' hardware, programming environment and development toolset deliver a high-productivity platform for porting and developing parallel, scalable codes. KEYWORDS. Parallel, multipurpose parallel computer systems, scalable parallel processing, RISC, metacomputing, Numa, computer architecture. 1

INTRODUCTION

In the last couple of years parallel processing architectures are introduced in a rapid rate, in many different application areas. About ten years ago commercial multiprocessor development was only experimental and in a limited number of cases, a serious alternative to more conventional approaches to high performance computing. Today, Multipurpose Parallel Computing (MPC) employing a large number of microprocessors is challenging the preeminence of mainframes and of vector based supercomputers. One of the most recently introduced MPC systems is the Convex Exemplar Series. The design of such a MPC system is complicated. One of the important dimensions in comparing the diverse vendor offerings is the difference in which the data is shared among the processing elements and their associated memory modules comprising the distributed resources. Clustered workstations (such as the IBM SP1, SP2 series [3]) represent one end of the spectrum, exchanging information packets over LAN's through high level operating and network protocols. Distributed memory multiprocessors such as the Intel Paragon [4] and the Thinking Machine TMC CM-5 [5] provide high bandwidth interconnects between

228

J. van Kats

processor and memory nodes but reflect a fragmented memory name space by maintaining logical independence between node memories. Exchanging data between nodes is controlled by service routines at both ends of the data transfer. This is in contrast with the shared memory systems, which permit direct access to all system memory by any of their processors. The shared memory node supports a common global logical reference space enabling hardware mechanisms to replace software service routines for exchanging data. The Kendall Square KSR [6], the Cray Research CRI T3D [7] and the Convex Exemplar [2] are implementations of these type of systems. Modern microprocessors require sophisticated hierarchical memory systems to minimize access latency time though advanced cache mechanisms; see e.g. [3], [10], [11]. For a detailed discussion of cache a.rchitectures one is referred to [12]. Without caches today's high-end microprocessors would operate an order of magnitude Slower. Combining multiple microprocessors into a single shared memory system adds the complexity of maintaining consistency of shared variable values across separate caches. Appropriate measures are required to make it impossible that separate processors see different values for the same variable at the same time. One approach is to avoid caching shared variables. This is implemented on the CRI T3D [7] and e.g. the Butterfly Parallel Processor BBN TC2000 [8]. An alternative is to provide hardware mechanisms to ensure such consistency. The Sequent Balance [9] as an example employs snooping mechanisms by which every processor/cache node monitors a single shared bus and responds to all transactions that involve variables stored in the node's local cache. But bus based systems are limited in scaling with upperbounds say between 8 and 16 processors. The KSR systems [6] provide a ring based strategy to replace the bus with the objective to be scalable. The Exemplar SPP1000 is the first multipurpose Parallel Computersystem to incorporate directory based concepts to maintain cache coherence in support of a scalable configuration. For a complete overview of available parallel systems, including a short description, one is referred to [13].

2

THE DETAILED EXEMPLAR ARCHITECTURE

The Convex Exemplar SPP-1000 is the first of a new family of computer systems that have the software environment and ease-of-use of a general-purpose computer combined with the scalability, architecture and performance of a massively parallel processing (MPP)system. By combining the best features of both worlds, Exemplar becomes a scalable, highperformance supercomputer system that has the look and feel of a desktop system. Good parallel performance starts with fast single processor performance. There is little use in applying massive amounts of processors in parallel if one fast single processor can achieve a similar level of performance. Convex has chosen to use the HP PARISC superscalar architecture [10] as processor cornerstone in the Exemplar Architecture. The first generation SPP-1000 systems incorporates the PA-RISC 7100 processor running at 100MHz (10 ns), delivering a peak performance of 200 Mfiop/s in 64 bit.

CONVEX Exemplar Systems

229

Figure 1: An overview of the Exemplar System

This implies that even non-parallelable applications will run at a considerable speed. Directly coupled to the fast RISC processor are an instruction cache and data cache, containing 1 MByte of instructions and data respectively. In total up to 8 of those processors, plus an optional I/O port, are connected with the memory through a crossbar. This can be considered as a complete SMP (Symmetric Multi Processing) subsystem. Within the Exemplar architecture it is referred to as a "hypernode". See also Figure 1 for a schematic representation. The hypernodes can be connected together to form a full homogeneous multi-node Exemplar system. The hypernodes are connected through the CTI-ring system, consisting of four independent Coherent Toroidal Interconnect (CTI) rings. This design allows for fault-tolerance and provides higher aggregate datatransfer bandwidth between the hypernodes. The SPP-1000 supports a two-level hierarchy of memory, each of which is optimized for a particular class of data. sharing. The first level consists of a single-dimensional torus of identical processor/memory hypernodes as shown in Figure 1. Convex expands this

230

J. van Kats

concept to two levels by making each hypernode a fully functional symmetric multiprocessor. Convex also expands the normal idea of a toroidal interconnect by splitting each link of the torus into four interleaved links for additional bandwidth and fault resilience. This hierarchical memory subsystem was chosen for several reasons: 1. The low-latency shared memory of a hypernode can effectively support fine-grained parallelism within applications, thus improving performance. This is often accomplished by simply recompiling the program with the automatic parallelizing compilers. 2. The two levels of memory latency will be the model of the future: as semiconductors become more dense, multiple CPU's will likely be placed on a single die. Thus, the first level of the hierarchy will be multiple CPUs sharing a common memory; the second level of the memory hierarchy will be "off-chip" (possibly to a nearby multiprocessor chip). 3. This system organization is a superset of a cluster of workstations, traditional experimental MPP systems and SMP systems. Processors within a hypernode are tightly coupled to support fine-grained parallelism. Hypernodes implement coarser grained parallelism with communication through shared memory and/or message passing mechanisms. Despite what might be suggested by Figure 1, the CTI system is truly a ring i.e. hypernode 1 and hypernode 16 are connected. The ring controllers can be sending and receiving at the same time. From Figure 1 it is also apparent that the total memory is physically distributed over the hypernodes. However, through the four CTI-rings any processor can reference any memory location, providing for a true shared memory system. It may be clear that the memory access time is dependent on whether the requested data is local to the hypernode the processor is part of or not. Such a memory access hierarchy is often referred to as a Non Uniform Memory Access (NUMA) architecture. Note that this is not part of the programming model as this still offers one single memory view. The access latencies are part of the architecture and can be seen as a natural extension to the well-known hierarchy of memory-Lcache-/,register that can be found in many current computer systems. By examining the memory address, the system knows whether data is located in the memory physically located on the hypernode of the requesting CPU or not. In the latter case the CTI-rings will transport the data from the source hypernode to a reserved part of the memory in the destination hypernode. This reserved part is called the CTI-cache and is configurable by the system manager. Another important item that needs to be addressed in an efficient way is the so-called cache coherency. It is beyond the scope of this paper to elaborate upon it, but it is a critical component for good parallel performance. Basically it amounts to the necessity that when executing a load instruction, any processor must receive the most recent value of a data

CONVEX Exemplar Systems

231

element even if it has just been updated by another processor in a different hypernode and is stored in the local processor data cache. The Exemplare architecture is fully cache coherent. All coherency has been implemented in hardware to allow for optimal parallel performance. Also efficient and fast I/O performance has received considerable attention. First generation MPP systems often have a (slow) front-end system which is the controlling system for the whole configuration. Typically all I/O is done through the front-end, creating a serious bottleneck in performance. In contrast with this, the Exemplar architecture is a standalone system with I/O implemented at the hypernode level. The crossbar on each hypernode may be equipped with an I/O port that is directly attached to the peripherals. On a multi-hypernode system, several hypernodes can have such an I/O port depending on the needs. This will reduce I/O constraints. For a more detailed description one is referred to [1] and [2].

3 3.1

THE EXEMPLAR SOFTWARE ENVIRONMENT The SPP-UX Operating System

The Exemplar Operating System is called SPP-UX. It is based on the OSF/1 Advanced Development (AD) Mach 3.0 microkernel. A microkernel based Operating System consists of a lightweight microkernel plus a collection of so-called servers. These servers provide the functionality as found in the more traditional monolithic systems. By design, SPP-UX provides scalable OS performance. One copy of the microkernel runs on each hypernode. In addition to this, servers are running on each hypernode and, if desired, multiple identical servers may be operational to provide additional performance on the desired functionality, like file handling. Within the complete system, the servers communicate over the CTI-rings. Aside from tuning for the distributed architecture, several important features have been added to SPP-UX: 9 Binary compatibility with HP-UX through support of the HP-UX ABI (Application Binary Interface). This means that the Exemplar system will be able to run all middleware, tools and applications running under HP-UX. 9 Support for subcomplexes. The Exemplar system can be subdivided into several independent subcomplexes consisting of groups of CPUs and memory. 9 Support for parallelism. SPP-UX fully supports parallel processing on the Exemplar system. For example the operating system takes care of the task and thread management.

232

3.2

J. van Kats

T h e Compilers

Convex has developed automatically optimizing and paraUelizing Fortran, C and C + + compilers for the Exemplar. In addition to this, the powerful Convex Application Compiler (APC) is available. The APC compiles and links Fortran and C programs or a mixture of both languages. The APC makes use of interprocedural analysis to find more opportunities for optimization, especially parallelization. Key in the interprocedural analysis is the data dependency part where the APC analyzes the data usage throughout the complete application, as opposed to the more conventional procedural compilers that only analyze individual functions and subroutines one at a time. The interprocedural data-dependency analysis makes it possible to parallelize at a much higher level, like performing automatic parallelization of loops containing calls without the need to do inlining. We would like to point out that the above described shared memory programming model with automatic parallelization is very similar to the familiar C-series environment. Loops will be examined for parallelization and the appropriate parallel code will be generated accordingly. The programmer can declare and use variables and arrays in the traditional way. The underlying hardware mechanism will take care of data distribution and retrieval. In addition to the shared memory type of parallelization, the programming model supports explicit message passing such as implemented with PVM. Also a combination of both approaches is possible. The P VM implementation on Exemplar will exploit the shared memory model. Instead of the traditional approach to send packets of data over a network using sockets, the Convex PVM implementation will perform reads and writes directly into the shared memory. This not only gives much higher bandwidth, but even more important a low latency. To the programmer this will be transparent as he or she will link the PVM library as usual. Similar approaches will hold for PARMACS and MPI.

3.3

Program Development Tools

Convex has developed three important tools to assist program development and tuning of Fortran, C and C++ programs on the Exemplar system: a debugger and two profilers. 9 The C X d b debugger is fully graphical oriented and allows the users to debug optimized code, including parallel sections. CXdb gives access to information from the high level source code down to the lowest level, like disassembled code, registers and memory locations. It is possible to step asynchronously through the individual threads in the code and monitor variables and arrays. 9 The C X p a profiler is also fully graphical oriented and gives detailled insight into the performance of an application. Good performance will depend largely on optimal data access and therefore the Exemplar system contains hardware event counters

CONVEX Exemplar Systems

233

to monitor this. CXpa gives access to this information, like cache usage (latency and misses) for all threads participating in the (parallel) computation. The analysis information can be presented in a variety of graphical displays, 2D and 3D. 9 The C X t r a c e profiler is meant to give insight into the distributed memory characteristics of a PVM application. Through visual presentation of messages being sent and received, processor activity and idle time, possible performance bottlenecks will be exposed. 4

CONCLUSIONS

Ill this paper, the new Convex Exemplar architecture has been presented and compared to other state-of-the-art parallel machines. Highlights of the Exemplar Series include: 9 A family of computer systems based on a large number of Hewlett Packard's precision architecture (PA) line of RISC processors, one of the highest performance RISC processors available [10]. 9 Binary compatibility with the HP-UX_operating system, a high-performance production version of standard AT&T UNIX TM This compatibility insures that thousands of commercially available applications execute on Exemplar systems. It also provides system administration tools, development tools and a production environment that conform to the proposed COSE (Common Open Software Environment) standards. 9 A scalable operating system kernel that distributes operating system functionality across the system. This microkernel approach results in an operating system that is easier to maintain, more extensible and more scalable than monolithic designs. 9 A sophisticated assortment of software tools designed specifically for porting and tuning of highly parallel application. These tools provide high productivity and feature GUI user interfaces, automatic performance analysis and source-level debugging tools. References

[1] Convex Exemplar System Overview", First Edition, March 1994. [2] Exemplar Architecture", First Edition, November 1993, Convex Press, DHW-014. [3] IBM 9076 Scalable POWER parallel systems (Guide SH26-7221-00)", IBM, 1993. [4] Paragon User's Guide (312489-002)", Intel Corporation, 1993. [5] Connection Machine CM-5 Technical Summary", Thinking Machines Corporation, 1992. [6] KSR Technical Summary", Kendall Square Research Corporation, 1992. [7] Cray T3D System Architecture Overview (HR-04033)", Cray Research Inc.

234

J.

van g a t s

[8] Butterfly Parallel Processor Overview (BBN Report 6148", BBN Laboratories 1986. [9] Balance: a shared memory multiprocessor" in Proceedings of the second International Conference on supercomputing, Santa Clara, May 1987. [10] PA-RISC 1.1 Architecture and Instruction Set Reference Manual", Hewlett Packard, 1992. [11] TFP microprocessor Chip Set", MIPS Technologies, 1993. [12] High Performance Compute Architectures", H.S. Stone, Addison Wesley, New York, 1990. [13] An overview of (almost) available paraUel systems ", A.J. van der Steen, publication of NCF, the Netherlands, 1993.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

MODELLING

T H E 2-D F C T O N A M U L T I P R O C E S S O R

235

SYSTEM

C. A. CHRISTOPOULOS}, A. N. SKODRAS~, J. CORNELIS}

t Vrije Universiteit Brussel, VUB-ETRO (IRIS), 1050 Brussels, Belgium. University of Patras, Electronics Laboratory, Patras ~6110, Greece. E-mail: chchrist@etro, vub. ac. be ABSTRACT. This paper describes the parallel implementation of the 2-D Fast Cosine Transform (FCT). The examined algorithms are: Row-Column (RC) FCT and VectorRadix (VR) FCT. A SUN SPARC 10 with 4 processors and shared-memory is considered. It is observed that the parallel approach provides considerable gain for block sizes of more than 128x128 pixels. For smaller sizes there is no speed-up: by using more than one processor, the execution times are larger than on one processor due to the time required for the creation and synchronization of the threads. In addition, it is observed that the usual measure of performance of an algorithm in terms of number of additions and multiplications is not adequate and other operations as memory accesses and data transfers have to be taken into account.

KEYWORDS. 1

Transforms, Fast Discrete Cosine Transform, Multithreads.

INTRODUCTION

The design of efficient algorithms for the computation of the 2-D Discrete Cosine Transform (DCT) has been studied for many years and a variety of algorithms have been proposed [2,4,5,8,10,11]. The algorithms in [2,4,8,11] are based on a Decimation-In- Frequency (DIF) radix-2 or split-radix approach, they have the same computational complexity but are not in-place due to an input-mapping stage required before the computation of the butterflies. The in-place 1-D radix-2 FCT is presented in [10]. This algorithm has the same computational complexity a~ other existing 1-D FCT's and also a pruning property. An in-place version of the 2-D FCT given in [4] is presented in [5]. The algorithm ha~ also a pruning property, allowing the computation of any number of DCT coefficients in a zone of any shape and size resulting in computational savings in many image processing applications. In [6], the computation of the 2-D FCT through the Row-Column (RC) approach using various 1-D FCT's and through the Vector-Radix (VR) FCT is studied. It is shown that the VR FCT outperforms the RC approach due to its lower computationM complexity.

CA. Christopoulos, A.N. Skodras and J. Cornelis

236

Although the computational complexity for the calculation of NxN DCT points can be reduced by using the VR FCT instead of the RC approach, a further decrease in the computation time is important in areas like image compression and image sequence coding. In these fields, a large amount of data must be handled fastly. Furthermore, low latency is required in many real-time processing applications. In these areas, the conventional general purpose computers are insufficient and parallel computers are an interesting alternative. This paper describes our first results obtained from the parallelization of the VR FCT [5] and the RC approach using the 1-D FCT [10]. The algorithms are implemented on a SUN SPARC system with 4 processors using the Solaris 2.3 multithread approach. The paper gives an indication of how much improvement can be achieved by using a general purpose and cost effective parallel machine before we try to implement these algorithms in more complex MIMD architectures as has been done for the FFT [1]. It also provides an insight into the design of parallel VLSI chips for faster coding.

2

COMPUTATION

O F T H E 2-D

FCT

The 2-D DCT of an NxN-point digital signal x(nl,n2),, 0 < nl, n2 < N is defined by the equation: X ( k l , k2)

k(kl,k2)2~nl=o2~n2=ox(nl,n2)cos[H(2nl + 1)kl ]cos[ H(2n2 + 1)k2 ] ,,-,N-1 ,-,N-1 2N

2N

(1)

where kl,k2 - 0 , 1 , . . . , N - 1 and k(kx,k2) - 4ek~k2 and Eki -- 1/V/2 for ki "- 0 and (ki "- 1 otherwise, with i = 1,2. In the following, the scaling factors will not be considered. N = 2 m.

It is also assumed that

The 2-D DCT of the NxN-point sequence defined in equation (1) can be computed in the two successive steps of the Row- Column (RC) FCT: Xt(nl,k2)

N-1 = En2=0x(nl n2)cost[ II(2n2 "}" 1)k2 ] ' 2N

Xl(kx k2)= v,N-1 v,t_ ,

. n , = o ~ (,q, k2)cos[

II(2nl -I- 1)kl 2N ]

(2) (3)

Alternatively, the 2-D DCT can be calculated by working both along columns and rows simultaneously. The Vector-Radix FCT is derived by expressing the NxN-point DCT in terms of four N/2xN/2 - point DCT's [4,5]. This procedure can be repeated until we reach a transform of size 2x2. The resulting VR FCT algorithm requires 25% less multiplications than the RC FCT while the number of additions remains the same [4,5]. After this brief introduction to the 2-D FCT algorithms, we proceed to their parallel implementation on a SUN SPARC with 4 processors using the Solaris 2.3 multithread approach. The 1-D FCT algorithm used in the RC approach is an in-place version of the 1-D DIF [4] algorithm and the VR FCT is an in-place version of the DIF VR FCT [4]. The in-place algorithms are described in [5,10]. These algorithms are used due to their advantages, i.e. they can be computed in place, have attractive pruning properties and their regular structure permits easy DCT computation of large block sizes. The parallelization

Modelling the 2-D FCT on a Multiprocessor System

237

capabilities of the algorithms are evaluated with respect to their performance in terms of arithmetic operations and the speed-up obtained for different block sizes and for a different numbers of processors.

3

THE COMPUTER ARCHITECTURE

The computer platform used for our experiments is a SUN SPARC 10 with 4 processors using the Solaris 2.3 multithread approach. The Sun multiprocessor implementation is based on a tightly coupled, shared-memory architecture. Processors are tightly coupled by a high speed interconnection bus, MBus, and share the same image of memory, although each processor has a cache where recently accessed data and instructions are stored [3]. In this implementation, the Solaris operating environment co-ordinates the activities of all processors, scheduling jobs and co- ordinating access to common resources. All CPU's have equal access to system services, such as I/O and networking. The overall objective is a general-purpose system with sumcient computing and I/O resources to improve performance for a wide range of applications. Inevitably one must focus on the software that must run on the system. The most important feature is the Sun OS multithread architecture. A thread is a single sequence of execution steps performed by a program or a series of programs. Multiple threads are an efficient way for the application developer to utilize the parallelism of the hardware. In Solaris, the threads library uses execution resources (called lightweight processes) that are scheduled on processors by the operating system. A lightweight process (LWP) can be thought of as a virtual CPU that executes code or system calls. All the LWP's are independently dispatched by the kernel, perform independent system calls, and run in parallel on a multiprocessor system [3]. Programs with coarse-grain parallelism can benefit from multi-threading when run on multiprocessor hardware. However, programs in which each thread does not execute enough code between synchronisation can have performance problems because of the overhead involved in the creation and the synchronization of threads. Synchronization of threads can be achieved through mutual exclusion locks, condition variables, semaphores or the multiple reader-single writer lock [3]. Once a thread is created, it executes the function specified by the call.

4

P A R A L L E L 2-D F C T T H R O U G H T H E RC F C T

An input-reordering stage (merging of an input-mapping and a bit-reversal stage) [12], the computation of the butterflies and a number of post-additions are required for the computation ofN DCT points in the 1-D case using the in-place algorithm of [10]. Therefore,

C.A. Christopoulos, A.N. Skodras and J. Cornelis

238

the computation of the NxN DCT points can be done following the next steps: step 1: step 2: step 3:

Create threads t i, i---0,1,2.....(NCPUS-1) that will run in parallel in a multiprocessor system with NCPUS processors. Thread t i will compute in parallel with the other threads the DCT in the rows i*N/NCPUS to {[(i+ I )*N/NCPUS]- 1 }, i.e. For each thread t i, i---0,1,2.....(NCPUS-I) For each row in i*N/NCPUS to {[(i+I)*N/NCPUS]-1 } Apply the I-D FCT of [ 10] II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

II II

I I I I

I I I I

I I I I

I I I I

I I I 1

I I I I

I I I I

I I I I

Figure 1: Computation of the rows DCT in the 8x8 case with 2 CPU's Figure 1 shows how the computation of the DCT for the rows is done when two CPU's are available (I denotes the rows allocated to the first CPU and II the rows allocated to the second one). When all threads have finished the execution of their job, the above steps can be applied in a similar manner to the columns. Threads must be synchronized before the computation of the DCT for the columns starts. So only when the created threads have finished the DCT computation of the rows, new threads are created to calculate the DCT of the columns. At the end of the whole computation, synchronization of the threads is required in order to save the results in the right order. 5

P A R A L L E L V R FCT

2-D input-reordering is required before the computation of the radix 2x2 butterflies. This can be performed in the row-column approach [5], using the method described in the previous section. The same technique can be applied for the computation of the post-additions. The computation of the butterflies (radix-2x2 butterflies) is done in two steps. In the first step, the first stage (stage 0) of the computation is performed in the following manner: step I: Create threads t i, i=0,I,2 .....(NCPUS-I) that will run in parallel in a multiprocessor system with NCPUS processors step 2: Thread t i will compute in parallel with the other threads the radix-2x2 butterflies that exist in the rows i*N/NCPUS to {[(i+I)*N/NCPUS]-I }, i.e. step 3: for (rows=i*N/NCPUS; rows<0+ I)*N/NCPUS; rows+=2) for (cols---O;cols
Modelling the 2-D FCT on a Multiprocessor System

239

i = O, 1 , . . . , ( N C P U S - 1), that will run in parallel. The computation of the butterflies is based on the following procedure: int max, step=2; for (stage= 1; stage<m; stage++) { max=step; step=2*max; cream threads t i to run in parallel the following code: for (k l=i*max/NCPUS; k l <(i+ 1)*max/NCPUS; k l ++) for (k2=0; k2<max; k2++) for (y=kI; y
Threads need to be synchronized to ensure that each thread at each stage finishes its job, before the next stage is started.

Figure 2: Diagram of the 4x4 VR FCT Figure 2 shows the diagram of the 4x4 VR FCT (x~(nl, n2) denote the input-mapped data and broken lines represent computations that can be avoided when pruning is applied, see [5]). Figure reffig-FCT-alloc shows how the computation of the butterflies is allocated in two CPU'S for the 8x8 case and for the first two stages. The representation Xy defines points in butterfly y allocated to CPU X. Notice that in the VR FCT a butterfly consists of 4 points (see Figure 2, stage 0, in which there exist 4 butterflies).

240

2c 2c 2a 2a lc lc la la

C.A. Christopoulos, A.N. Skodras and J. Cornelis

2c 2c 2a 2a lc lc la la

2d 2d 2b 2b ld ld lb lb

2d 2d 2b 2b ld ld lb lb

2g 2g 2e 2e lg lg le le

2g 2g 2e 2e lg lg le le

2h 2h 2f 2f lh lh If If

2h 2h 2f 2f lh lh If If

2e le 2e le 2a la 2a la

2f If 2f If 2b lb 2b lb

stage 0

2e le 2e le 2a la 2a la

2f If 2f If 2b lb 2b lb

2g lg 2g lg 2c lc 2c lc

2h lh 2h lh 2d ld 2d ld

2g lg 2g lg 2c lc 2c lc

stage 1

Figure 3: Allocation of the 8x8 VR FCT butterflies on two CPU's 6

SIMULATION RESULTS AND DISCUSSION

The total number of additions (al) and multiplications (ml) required to compute the NxN DCT points using the RC approach is al = 3raN 2 - 2 N 2 + 2 N and m l = m N 2 [4,5]. The corresponding figures for the VR FCT are [4,5] a2 = 3raN 2 - 2 N 2 + 2 N and m2 = 3 m N 2 / 4 , i.e. 25% fewer multiplications are required in comparison to the direct RC FCT. Execution time (in msec) versus the numberof operations (floating point additionsand multiplications)for the RC and VR FCT (for ICPU) for blocksof size NxN N

Time (RC)

Operations (RC)

Time (VR)

16 32 64 128 256 512 1024

8 25 93 431 2705 14982 72922

3616 18496 90240 426240 1966592 8913920 39847936

5 20 77 315 2183 10991 51760

Operations (VR)

3360 17216 84096 397568 1835520 8324096 37226496

The table above gives the execution times obtained by running the two algorithms on one processor, together with the number of arithmetic operations (floating point additions and multiplications) for various block sizes. The following can be noticed from this table: (a) When one CPU is used, the computation of the 2D DCT through the VR FCT is faster than by using the row-column approach. This is so, because the VR FCT has less computational complexity than the RC FCT. (b) No obvious explanation of the execution time in relation to the number of operations can be given. When running the algorithms in a workstation, the number of operations can not be well correlated with the experimental execution times, because of various optimizations performed by the compiler, the OS, the pipeline architecture used, etc.. The correlation coefficient between the number of operations and the experimental execution time both for

2h lh 2h lh 2d ld 2d ld

Modelling the 2-l) FCT on a Multiprocessor System

241

the RC and the VR is 0.73 indicating low correlation between these measures. However, for VLSI implementation, the number of operations can still be a useful performance measure for the algorithm. (c) Although the VR FCT has 6-7% less total arithmetic computational complexity (in terms of number of additions and multiplications), it is 17-37 percent faster than the RC FCT. In our approach, the VR FCT requires 50% less data transfers during the computation of the butterflies compared to the RC FCT. This is so because a VR butterfly processes 4 points at a time while the radix-2 butterfly processes 2 points. Since a VR butterfly is equivalent to 4 1-D FCT butterflies, the amount of memory accesses and data transfers for the VR FCT is 50% less. However, if the 1-D FCT is programmed to process two stages at a time with a 4 point kernel, the number of memory accesses and data transfers will be the same for both approaches (when the number of stages is an odd number, the RC approach will require slightly more memory accesses than the VR approach). The execution times for the algorithms described in the previous sections for up to 4 processors are given in the table below (NCPUS denotes the number of processors). As expected, the speed-up increases with the image size NxN, because in this case the computational load is larger than the overhead involved in creating and synchronizing threads. Total time in msec for the NxN RC FCT Total time in msec for the NxN VR FITT NCPUS N=I6 N=32 N=64 N=128 N=256 N=512 N=I024

I 8 25 93 431 2705 14982 72922

2 10 21 62 244 1380 7921 38433

3 17 27 58 191 1059 5593 27235

4 19 27 52 163 883 4641 22509

1 5 20 77 315 2183 10991 51760

2 14 24 63 204 1101 5655 26294

3 32 46 77 201 854 3891 18397

4 54 68 98 212 756 3690 16799

(d) There is almost no significant improvement in speed by using more than one processor for small blocks of size up to 64x64. For the RC approach, a small speed-up is achieved by utilizing 2 CPU's while for the VR FCT results indicate that for such block sizes a serial algorithm is better. This is so because the VR FCT has lower computational complexity than the RC FCT (and therefore the processors have less computational load than in the RC FCT) and it also requires creation and synchronization of more threads. As an example, for the computation of NxN DCT points in the RC approach, NCPUS threads will be created to compute the rows and NCPUS to compute the columns. Two synchronizations are required: one after the computation of the rows (to guarantee that all threads have finished their job before the new threads are created to compute the columns) and one after the whole computation finishes. In the VR algorithm, we need to create and synchronize threads during the RC computation of the input-reordering, the RC computation of the post-additions and in all stages of the computation of the butterflies. Suppose that the number of running threads at each moment is equal to the number of

242

C.A. Christopoulos, A.N. Skodras and J. Cornelis

CPU's. Then for the computation of the DCT based on the RC approach, the total number of created threads will be N t h _ R C = 2 , N C P U S and the number of synchronizations will be N s y n . R C = 2. In the VR FCT, the corresponding values are N t h _ V R = (4 + m ) N C P U S and N s y n _ V R = 4 + m. It is dear that the increased number of threads and synchronizations required in the Vtt approach, reduces its advantage over the RC approach and in small block sizes affect the performance of the parallel approach. For large block sizes however, the computational load of each CPU is high and these factors do not affect significantly the performance. The performance of a VLSI implementation will probably be higher for the VR FCT but at a cost of more hardware due to the irregular communication. If you would compare the performance for the same area, it is not clear whether you would still gain, so this is a topic for future research apparently. The speed-up achieved by the use of more than one processor for various sequence lengths is shown in Figure refspeed-up the VR FCT. It is observed that the speed-up achieved is more important in processing large blocks and in full frame image compression as required in medical image compression applications [7].

Figure 4: Speed-up versus the number of processors for the NxN VR FCT algorithm (e) In the case of the VIt FCT, there are some stages in which some of the processors are idle while others are working. For example, this is the case of stage 1 when N C P U S = 4. In this case only two of the processors are working. Therefore some slight improvements can still be obtained after a suitable load balancing of the job on the available processors. (f) The FCT application, although it is only one image processing kernel, gives an indication of what can be achieved in a complete image processing application by using a general purpose parallel architecture. The computational load of each CPU in this application was known a priori. In some application like Segmented Image Coding (SIC) [9], the

Modelling the 2-I) FCT on a Multiprocessor System

243

computational complexity of the problem is not fixed, but it depends on the shape and size of the region as well as on the compression factor required. Allocating tasks to the CPU's becomes a difficult problem and efficient load balancing is essential, since parallelization should now be data dependent. Research is done therefore to evaluate the multithread approach in such applications as SIC and to develop efficient ways of distributing the work among the available processors. It is expected that in such applications, under certain conditions, significant speed-up can be achieved due to the large computational complexity encountered. 7

CONCLUSION

This paper investigated the row-column and the vector- radix approaches for the computation of the 2-D FCT on a general purpose multiprocessor system. It was found that for large block sizes (> 128x128), a significant speed-up can be achieved by using more than one processor, while for small block sizes no significant improvement is achieved. The number of arithmetic operations can only be used as a qualitative descriptor for the execution time of an algorithm on single or multiple processor workstations. Acknowledgements This work was supported by the EC contractsE R B C H B G C T 9 3 0 2 6 0 and ERBCHRXCT930382. The authors would like to thank Dr. Wilfried Philips,Dr. A d a m Damianakis and Fubo Zhang for useful discussionson the subject.

References [1] G. Angelopoulos and I. Pitas, "Two-dimensional FFT algorithms on hypercube and mesh machines", Signal Processing, Vol.30, No.3, pp.355-371, February 1993. [2] F. Arguello and E.L. Zapata, "Fast cosine transform based on the successive doubling method", Electronics Letters, Vol.26, No.l, pp.1616-1618, September 1990. [3] B. Catanzaro, "Multiprocessor system architectures", Sun Microsystems, Inc., USA, 1994. [4] S.C. Chan and K.L. Ho, "A new two-dimensional fast cosine transform", IEEE Trans. on Signal Processing, Vol.39, No.2, pp.481-485, February 1991. [5] C.A. Christopoulos and A.N. Skodras, "Pruning the two-dimensional fast cosine transform algorithm", Proceedings of the European Signal Processing Conference (EUSIPCO 94), Edinburgh, Scotland, pp.569-599, 13-16 September 1994. [6] C.A. Christopoulos, A.N. Skodras and J. Cornelis, "Comparative performance evaluation of algorithms for fast computation of the two-dimensional DCT", IEEE Benelux & ProRISC Workshop on Circuits, Systems and Signal Processing, Papendal, Arnhen (The Netherlands), pp.75-79, March 24, 1994..

244

CA. Christopoulos, A.N. Skodras and J. Cornelis

[7] B.K.T. Ho and H.K. Huang, "Specialized module for full- frame radiological image compression, Optical Engineering, Vol.30, No.5, pp.544-550, October 1991. [8] H.S. Hou, "A fast recursive algorithm for computing the discrete cosine transform", IEEE Trans. Acoustic, Speech and Signal Processing, Vol.35, No.10, pp.1455-1461, October 1987. [9] M. Kunt, M. Benard, and R. Leonardi, "Recent results in high-compression image coding", IEEE Trans. on Circuits and Systems, Vol.34, pp.1306-1336, November 1987. [10] A.N. Skodras, "Fast discrete cosine transform pruning", IEEE Trans. on Signal Processing, Vol.42, No.7, July 1994. [11] A.N. Skodras and C.A. Christopoulos, "Split-radix fast cosine transform", Int. J. Electronics, Vol.74, No.4, pp.513-522, 1993. [12] A.N. Skodras and A.G. Constantinides, "Efficient input-reordering for fast DCT", Electronics Letters, Vol.27, No.21, pp.1973-1975, October 1991.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

PARALLEL

245

GREP

J. CHAMPEAU, L. LE PAPEr, B. POTTIER

Laboratoire d'Informatique de Brest Universitd de Bretagne Occidentale BP 809, ~9~85 Brest cedex, France champeau, pottier@univ-brest, fr

tProjet API, IRISA Campus de Beaulieu 35042 Rennes cedex, T~rance [email protected]

ABSTRACT. ArMen is a parallel MIMD machine with a global linear F P O A network, that can be used as a local or global accelerator, agrep is a program to search patterns in texts. It uses a fast sequential algorithm based on bit registers and few simple operations such as or, shift, and. This algorithm has been proposed by Wu and Manber. Parallel agrep is an implementation of agrep on the ArMen machine, where the global linear FP(]A network is configured as a shared operator used in an SPMD mode. The recurrent solution from the original agrep has been transformed into a cellular automata program for the F P G A ring. The use of a cellular automata compiler (CCEL) allows a quick synthesis of the FPGA configuration files for a direct test on ArMen. This solution remains scalable and could be used in new architectures. The development methodology is suitable to fast application prototyping. KEYWORDS. Pattern matching, FPGA synthesis, regular array architecture, grated in MIMD machine.

1

FPGA

inte-

INTRODUCTION

Hardware and software co-design is currently investigated as a method for fast application prototyping. An important characteristic of co-design is the presence of a fixed framework where specific hardware components and software functions are embedded. As it is proposed in Gupta, Coelho and De Micheli's model [7], as well as in the PRISM approach [8], the basic architecture has a standard processor, memory, and a coprocessor part that can be statically or dynamically adapted to the application. The software part can be a standard compiler able to generate hardware descriptions for the components of the coprocessor, calls to these components, and maybe reconfiguration of the coprocessor.

246

J. ammpeau, L. Le Pape and B. Pottier

An important motivation for co-design methodologies is the availability of huge integration resources that will allow low cost production of circuits having a standard processor and a variable part: specific or reconfigurable unit. Applications are intensive computations or controllers. As these machines will take their control flow from a sequential program they can be described as sequential (SISD) with an adaptable part. Mixing MIMD and adaptable hardware is attractive for at least two reasons. First, the use of adaptability mixed with SISD is interesting as a basic component into high speed parallel architectures. Second, the presence of a global adaptable part inside the MIMD framework has a very important impact on the architectural model. The ArMen Project has been started in 1990 by the computer science laboratory (LIBr) at Brest to address this question. The architectural framework is a general purpose multiprocessor that is programmed under the control of an operating system. This does not exclude the use of the machine under a data-parallel compiler. The adaptable part of the machine is a ring associating local adaptable parts (FPGAs) that will implement global coprocessors. To preserve MIMD extensibility and scalability the ring has internal asynchronous communications on a parallel data path. These coprocessors can be used to implement local accelerators as in the sequential codesign model. They can also help to build additional shared functions that can be classified in three categories:

9 global controllers that have their own control system and are loosely coupled to the nodes. Distributed programs communicate with these functions by writing local values and reading back global results. 9 global operators that are used sporadically by the whole machine acting as an SPMD computer. The main data flow comes from the local memories, but there exist local inter-node communications at the operator level. 9 specific communication scheme8 where the ring implements specific control system and data paths. Examples are support of systolic and all-to-one communication. Usually several of these models coexist in the same coprocessor design. As an example the SPMD phases that are necessary for systolic communications or global operators, are obtained after a synchronization barrier computed by a global controller. The aim of this paper is to describe the methodology used in the implementation of a global operator for a specific function. The whole program will be made of a software part computed on the array of processors, and a hardware part computed on a global operator. The software part is an array of processes synchronized during their access to the global operator. The C program describing these processes addresses the hardware using memory mapped FPGA-based registers. The hardware part is described under the cellular automata model: small specific processors computing on a set of neighbor values. Several algorithms have been implemented using this method, in which we notice erosion, dilatation for image processing, lattice gas simulation and string recognition. To address the string recognition problem, we have deliberately started from a very efficient sequential algorithm (agrep) described by Wu and Manber in [9]. This algorithm has been used to implement the agrep command that allows search operations with a fixed number of errors, wildcards, exact

Parallel Gap

247

matching on fixed parts of a pattern ... Here the purpose is to show t h a t specific global operators are effectively faster than sequential use of an ALU, and that they allow to scale the co-designed solution in the MIMD framework. The paper is divided in two parts. First, there is a short description of the experimental ArMen computer and explanations about the global operator behavior. The second part consists of a review about the agrep algorithm. Then it is shown that a transformation on the basic recurrence formula provides a cellular automata definition suitable for compilation on ArMen FPGA. 2

ArMen: A PARALLEL FPGA-BASED ARCHITECTURE

ArMen is a modular distributed memory architecture where processors are tightly connected to an FPGA ring. This first section gives details on the implementation, and the internal architectures of the synthesized coprocessors. 2.1

Machine Presentation

Each ArMen node has a processor connected to a bank of memory and a Xilinx FPOA(see Figure 1). Current hardware uses transputers and 3090/3195 FPOAs on VME-like boards. The processor loads configuration data into the FPOA in a 100 ms delay using memorymapped registers. The FPOA topology is an array of cells with four 32-bit ports. Its north ports are connected to the local address/data multiplexed system bus. West and east ports are externally accessible on front edge connectors. Ribbon cables plugged into these connectors allow to assemble nodes into rings. South ports are available to implement input/output, or to connect to another ArMen node. Figure 1 shows a ring with two nodes. The ArMen FPGAs are local coprocessors for each node, or parts of a global shared coprocessor for the processor array. Interconnectlon network

iili

__ItIi

, I ,

|

..........

I ]

Figure 1: Two node ring architecture The transputer/FPOA interface module provides: (1) execution of master/slave handshaked or buffered exchanges, (2) generation of interrupts to the processor, (3) DMA cycles, (4) coprocessor spying of the system bus.

248

J. Ozampeau, L. Le Pape and B. Pottier

The ArMen machine is a set of such nodes interfaced to a workstation hosting a transputer board. The machine executes the TROLLIUS [4] operating system to ease the application developments. TROLLIUS provides concurrent input/output, routing facilities and supports to load and observe processes on the nodes. 2.2

Shared Coprocessors

The logic layer (ring of interconnected FPGAS) of the parallel machine provides support for a large diversity of global coprocessors. They combine operational and control functions for specific tasks.

Global Controllers: An implicit model for global computations is based on pipelines with one stage within each node. In the pipeline mode, data is encapsulated into tokens to ensure the synchronism of their collection in the node. An automaton in the node 0 coordinates the pipelines [6]. The node processor interacts with its pipeline stage by reading or writing FPGA-based registers. Generally, local programs simply drop significant information into FPGA-based double buffered channels. Global predicates like termination detection or synchronization conditions can be computed with these global controllers. In the current implementation, tokens circulate faster than 100 ns per node with inter-FPGAs asynchronous handshaking and a 20 MHz local clock for FPGAS. A dedicated software tool based on the Unity formalism allows to synthesize these circuits [5]. Global Operators: It uses the set of local memories as a large contiguous data store, and combines the actions of the processor array with FPGA synchronous processing and micro-grain communications. A global operation occurs in one machine cycle. During this cycle the processor array pushes data in memory to a global operative unit. A move from the FPGAS array to the local memories usually follows the first transfer. A linear cellular automata model is suitable for this kind of computation [3]. This model is used in the implementation of the agrep algorithm on ArMen machine. It is described in more detail in the next section. 3

GLOBAL OPERATORS

During a computing cycle on a global operator (also called shared operator), one or several linear arrays of data are extracted from the node memories, they are passed as operands to the operator, and usually a result is fetched to be written back to the memories. As an example, it is possible to consider a whole line from an image as an operand. In the same way that operands can cross the node word boundaries, the basic element of information can also be redefined for example as 1, 2 or 4 bit cells. Thus, a global operand is an array of cells, and its length matches the total linear data path of the node array. 3.1

Architectural Model

The global operators must have registers to hold the operands and an array of elementary processors producing a result in another register. It is reasonable to define them as a global array of local computations in order to keep a low delay in producing results. These

Parallel Gap

249

considerations lead to the use of the cellular automata model. Cellular automata have a data space evolving synchronously by local transformations. They are defined by a neighborhood relative to a central position, and a transition function applied to the neighborhood to transform the central position. We can see that cellular automata can be mapped on the global operator concept providing that the number of registers is equal to one height of the neighborhood. A side-effect of this model is that the set of registers must be able to shift their contents to avoid a full reloading from the node processor. Figure 2 shows a possible implementation of the model on ArMen. Each F P G A has a fixed interface module to the node processor, a bank of registers that shift data down, an array of simple processors. Register bank as well as processor array are connected to the lateral sides of the FPGA. The interest of such an architectural model is to allow an automatic generation of the hardware, based on a behavioral high level specification of the cellular automata. [ ',

"'

I.'te~onnectlon'Network

....

]

ii Figure 2: The global operator architecture Given such operator addressed by a pointer, the node processors can transform a set of values using simple memory accesses: *FPGA ffi *sourcePtr++; *destPtr++ ffi *FPGA; ... During the accesses, a synchronizer ensures data consistency by locking the processor interface memory until the adjacent nodes are ready.

3.2

Data Folding

To preserve the scalability of the MIMD framework, the global operator must support some kind of data folding. The most efficient method is to divide the data plane into slices that will be transformed sequentially. This implies special processing at the side of each slice to take care of data dependencies. It is noticeable that step t needs lateral information from step t - 1 and produces lateral informations for transformation of adjacent slices at step t + 1. A special node called the margin node manages the lateral stripes for the operator[3]. The hardware part for this node is also produced by the CCEL compiler, described in section 3.3. The software part must be revised to sweep horizontally the operands in order to simulate a larger hardware. At the difference of a sequential computer, during the word by word sweeping of a large operand, the margin node eliminates the overhead for inter-word data dependency management.

250

3.3

J. Champeau, L. Le Pape and B. Pottier

H a r d w a r e Synthesis

The behavior of the operator is fully separated from the application software part. This means that an operator definition can be reused easily, and that an application sees the operator as an interchangeable black box with memory based addresses allowing to push operands or to fetch results. Another consequence is the possibility to define the operator using a specific tool. The COEL compiler has been designed for this purpose. A CCEL program is partly based on the

syntax of the C language. It has three parts: 9 a declarative structure which describes the topology of the neighborhood using combinations of cardinal directions and a size in bits of the cells. Internal variables of a cell can be defined with the CellS tat type variable. 9 a set of C functions that operate on the neighborhood. Each function will be translated into an array of elementary processors. 9 a main function which schedules the functions The compiler computes the shift register bank from the neighborhood definition. It produces a table of input and output values for each function. All the values are converted to boolean variables to build truth tables that are minimized using a logic optimizer such as ASYL or SIS. This first step provides a boolean definition of the local behavior for an elementary processor. In a second step CCEL replicates these description to produce a full processing element for one FPGA. Last, there is a binding to standard components such as the processor interface, the handshake automata, and the FPGA ports. The design is given to technology mapping tools to produce a bitstream that will be loaded in the FPGA b y the node processor.

4

AGREP ALGORITHM

For the string-matching problem, we are searching for a string P = PiP2-.. Pm inside a large text T - t i t 2 . . , tn, both sequences of characters from a finite character set E. We want to find all occurrences of P in T. However, we do not always know exactly the pattern or the text. For example, we may not remember the exact spelling of a name we are searching, the name may be misspelled in the text, the text may be a sequence of DNA molecules and we are looking for approximate patterns. The approximate string-matching problem allows to find an substrings in T closed to P. The most common measure of closeness is known as the edit distance. The Smith and Waterman algorithm is often implemented to compare the biosequences [1, 2]. In the article [9], Wu and Manber present "a new algorithm wich is very fast in practice, reasonably simple to implement, and supports a large number of variations of the approximate string-matching problem". The algorithm can handle most of the common types of queries, including arbitrary regular expressions, and several variations of closeness measures.

Parallel Gap

251

In this section, we begin by describing the algorithm for the pure string-matching problem. Then, we present the approximate matching. Here, we do not describe in detail the variations and extensions of the basic algorithm of Wu and Manber. 4.1

Exact Matching

The original algorithm uses a bit array R of size m (the size of the pattern). R 1 represents the value of the array R after the jth character of the text has been processed. 9 R/ill - 1 if the current letter t 1 of the text matches the first letter Pl of the pattern. 9 Rj[2] = 1 if tj = P2 and tj-1 = pl. 9 In general, Rj[i] = 1 if the first i characters of the pattern match exactly the last i characters up to j of the text: pip2...pi = tj-i+ltj-i+2.., tj. The transition from Rj to Rj+I follows the next rule:

1 i f R j [ i - 1 ] = 1 and tj+ 1 =pi Rj+I[i]--

0 otherwise.

(1)

There is a match between the whole pattern and the last m characters up to j of the text when Rj[m ] = 1. To speed up the algorithm, the authors of [9] suggest the use of a bit array Si for each character si in the alphabet, where Si denotes the indices of character si in the pattern. So, we have S i [ r ] - 1 i f P r - " 8i. Then, Wu and Manber show that the algorithm can be written with only a right shift of Rj and an AND operation with the corresponding mask Si: R1+1 := R 1 >> 1 A N D mask[j+l]. An example is given in table 1, where the text is aabaacaabac and the pattern is aabac. 4.2

Approximate Matching

Wu and Manber show that the previous algorithm supports various kind of errors (insertions, deletions or substitutions) when a fixed set of registers is maintained per error level. Actually, d additional arrays R 1, R 2, ... , R a must be maintained to store all possible matches with up to d errors. To determine the transition from R i~ to R~.+I d , there are four possibilities for obtaining a match of the first i characters with _< d errors up to ti+1: 9 There is a match of the first i - 1 characters with _< d errors up to tj and tj+l = pi. This case corresponds to matching t/+y. 9 There is a match of the first i - 1 characters with _ d - 1 errors up to tj. This case corresponds to substituting tj+l. 9 There is a match of the first i - 1 characters with _< d - 1 errors up to tj+l. This case corresponds to deleting Pi.

J. Champeau, L. Le Pape and B. Pottier

252

9 T h e r e is a match of the first i characters with <_ d corresponds to inserting ti+ 1.

1 errors up to tj.

This case

After this demonstration, the authors give a recurrent specification of the transition from R :i+1 d to R~: Initially: R0d : = 1 1 . . . 1 0 0 . . . 0; with d ones. then:

Rdj+l := Rshift[R]] AND mask[tj+l] OR Rshift[R]-'] OR Rshift[R]+~] OR R~-1 masks

a b c

R~

a

a

b

a

c

1 0 0

1 0 0

0 1 0

1 0 0

0 0 0

a a b a a c a a b a c

R1

a

a

b

a

c

1 1 0 1 1 0 1 1 0 1 0

0 1 0 0 1 0 0 1 0 0 0

0 0 1 0 0 0 0 0 1 0 0

0 0 0 1 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 0 0 1

a a b a a c a a b a c

a

a

b

a

c

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

0 1 1 1 1 1 0 1 1 1 0

0 0 1 1 1 0 1 0 1 1 1

0 0 0 1 1 1 0 0 0 1 1

Table 1: an example of masks, exact matching and approximate matching with one error Table 1 presents an example of R ~ and R 1, where the text is

aabaacaabac and

the p a t t e r n

is aabac. 5

PARALLEL

AGREP

ON ArMen

MACHINE

For the Agrep implementation, the specification combines the p r o g r a m m i n g model, cellular a u t o m a t a , and the execution model based on a linear array of operators. T h e OCEL tool is used for synthesizing this specification. T h e characteristics of the algorithm are closed to the cellular a u t o m a t a with local d a t a dependencies and a recurrence for the d registers computation. T h e current state is comp u t e d with the previous state register. A c o m p u t a t i o n of a register is parallel for all the bits of this register.

5.1

Mapping

On Global Operators

T h e exact matching is obtained using a one dimension cellular a u t o m a t o n where each cell is one bit of R ~ T h e neighborhood definition maps the Rshift[R~i_l] on a west cen. T h e mask handle of the current character in the text falls outside the cellular a u t o m a t a context

(2)

253

Parallel Gap

because the whole data space uses the same current character. In our implementation, the mask is managed by a specific type of variable in the CCEL tool. For d errors, the algorithm introduces d registers of size m. These registers create a bidimensional array of size m • d. This array is similar to a bidimensional cellular automata data space. For the general equation, the four terms are not managed in the same way. On the one hand, the exact matching, insertion and substitution terms are respectively represented by the west, south and south-west cells. On the other hand, the deletion term is computed in the same step, the left part of the equation. Following formula 2, one can see that the recurrence topology is as in table 2.

l do-orsiii [ d-1

.....R,h StIR )

errors I R s h i f t [ R ~ - ' ]

w'-"

"

SW t-l ,'. . . .

i d-

I

sw!,,

R d-t

... S t'''l ] step t - 1

.J ' ~ _. ,Rj+,,,.c

] t

I ..... top I

,

]

,,

,

Table 2: Mapping

Figure 3 shows the neighborhood and the deletion term that cannot be represented by a cell of the neighborhood. So the bidimensional array must be processed line by line, R d after R d-1. The execution model of the global operator based on a linear operator array is suitable for this type of computation.

Figure 3: The resulting neighborhood and the implementation Due to the F P G A data pipeline, the deletion term cannot be pushed in a good timing. This term is stored in the FP(]A in order to use it in the next computation. With respect to the execution model, the deletion term is processed in a parallel redundant computation. Actually, the result of a basic operator cannot be communicated to another operator. This computation increases the neighborhood definition with the west-west and the south-westwest cells. For the MIMD part, the error registers and the mask register are cut in slices and distributed on the nodes of the ArMen machine. On the other hand, the whole text is distributed over all nodes. With this specification, the processor program is as follow:

J. Champeau, L. Le Pape and B. Pottier

254

for each letter in the text

writeFPGA mask[letter] for 0 to d error reg/sters writeFPGA reg/ster readFPGA register In this program, for each letter of the text, the appropriate mask is written in the FPGA. Afterwards, all the registers are computed by d + 1 write-read cycles. Now, the internal FPGA architecture is synthesized respecting the specification and the processor program. 5.2

T h e CCEL P r o g r a m A n d T h e S y n t h e s i z e d C i r c u i t

As we have described in the section 3.3, the circuit is automatically synthesized with the CCEL tool. A CCEL program is based on a declaration structure and C functions. The neighborhood is defined in the declaration part with one bit cells. Three CelIStat wriables are declared which allow to store the mask, the deletion term and a counter. fonc(ww,sww,w ~sw~s,centre) int ww,sww,w~sw~s,centre;

{

int tmp;

switch (counter) { case 0:counter-1;mask--(centre&0xl)l(w<
}

Figure 4: The C function of the CCEL program The counter is necessary to express the scheduling in the FPGA corresponding to the different cases of the s w i t c h instruction: 9 First, the two bits mask are stored in the variable mask. Two bits are used for the masks of the current position and the west position are stored. The west one is useful to the delTerm computation. 9 Second, the delTerm variable is initialized for the R ~ computation, 9 Case d is the computation of the last register. variables are reset.

The mask, counter and deITerm

Parallel Gap

255

9 The default case is the standard computation. The new value of the R~+ 1 (with 0 < l < d) is combined with the current deITerm value and the value of the delTerm for the R t+l. This program (figure 4) is compiled to an F P O A platform. The synthesized circuit computes 16 bits in parallel in a circuit. The logic resource of the XiUnx 3090 (320 CLBs) is 38 % full. With a 7 processing node machine, the pattern is 112 bits long. For a larger pattern, the margin node is used to partition the pattern. A longer pattern is divided in 112 bit patterns which are computed sequentially. The margin node manages the consistency between the processing actions on the 112 bit patterns. Another use of this special node is to detect a successful matching of the pattern. The detection condition is when the last position of the registers is set. A 1 is produced to the margin node by FPGA communications and the position of the pattern is detected in the text. 6

RESULTS AND CONCLUSION

The original algorithm described by Wu and Manber was known to be limited in the case where the size of the pattern exceeds the size of a machine word. This difficulty is discarded by assembling processors to share a global operator. Moreover the solution is scalable and can work on a general purpose computer that will also be able to execute higher level algorithms. A naive theoretical interpretation of the implementation can be given as follows. Every Write-Read cycle on the cellular automata involves two calls to a barrier with the adjacent nodes. A barrier has a minimum delay of Tb. By simulating the inner loop on no wait state memories, one can compute Tw the delay from a Write operation to the next Read operation, and respectively Tr from two successive Read and Write. Then the total delay for a cycle on the shared operator is Tw + Tr + 2 • Tb. Practically, Tb has been measured as 175ns (Xilinx 3090 clocked at 20Mhz). This very long delay is implied by the current implementation based on F P G A automata clocked at 20Mhz. In the best conditions (stack and code in internal memory), a 25Mhz transputer can execute a memory based Read-Write cycle, including pointers update, in a 1.3#s delay. It is difficult to give more accurate information about theoretical performance, because the Transputers have a very specific stack oriented instruction set, and also because of the unpredictable delays of combinatorial functions in programmable circuits. So each operation for cellular automata access has a cost of 1.5#s. Considering the algorithm itself, the computation of a match with d errors accepted requires d + 3 Write-Read cycles on the operator in the case where the machine data path matches exactly the pattern size (m). A exact matching is detected after 3 • m cycles. If the fines to be searched are provided using communication links, this data rate must at least be equivalent to the computation rate. In the transputer case we have 20Mbit/s links that provide an approximated 1.Sbyte/s data rate. If the data base is distributed on disks, this means that each node can receive a 32bit word from an operand every 2.5#s period that must be compared to the 1.5 • (d + 3)#s of computation rate. A buffering and overlapping data input provides easy the data to the nodes in a scalable way. Short investigations on

256

J. Champeau, L. Le Pape and B. Pottier

current ~ s c processors show that the global operator cycle could be at least 10 times faster than on ArMen: current processors and FPGA appears to able to execute global operator cycle within lOOns. Another interesting point is the speed of development on ArMen for such problem. The cellular automata generator has been used to develop solutions for applications such as binary image processing (erosion, dilatation), 2D gas diffusion simulation, bit-serial computations, and pattern matching. Typical development cycle times for such circuits are in the order of I to 3 hours. Acknowledgements This work has been achieved within the ArMen project, that has been funded by PRC/GDR "Architectures Nouvelles de Machine" and C 3. Many thanks are also given to the "R~seau Doctoral en Architectures des Machines et des Syst~mes de la DRED" as well as R~gion Bretagne and ANVAR. References [1] D. Archambaud and I. Saraiva Silva and J. Penn~. Systolic Implementation of Smith and Waterman Algorithm on a SIMD Coprocessor. In M. Moonen, editor, Proceedings of the 3rd International Workshop on Algorithms and Parallel VLSI Architectures, Leuven, Belgium, August 1994. Elseiver. [2] L. Audoire and J. Codani and D. Lavenier and P. Quinton. Machines sp&ialis~es pour la comparaison de s~quences biologiques. Technique et Science lnformatiques (TSI), 13(6), December 1994. [3] K. Bouazza, J. Champeau, P. Ng, B. Pottier, and S. Rubini. Implementing cellular automata on the ArMen machine. In P. Quinton and Y. Robert, editors, Proceedings of the 2nd International Workshop on Algorithms and Parallel VLSI Architectures, pages 317-322, Bonas, France, June 1991. Elseiver. [4] G. Burns, V. Radiya, R. Daoud, and R. Machiraju. All About TROLLIUS. Occam User Group Newsletter, pages 55-70, July 1990. [5] P. Dhaussy, J.-M. Filloque, B. Pottier, and S. Rubini. Global Control Synthesis for an MIMD/FPGA Machine. In D. Buell and K. Pocek, editors, IEEE Workshop on FPGAs for custom computing machines, pages 51-58, Napa, California, April 1994. IEEE Computer Society Press. [6] J.-M. Filloque, E. Gantrin, and B. Pottier. Efficient global computation on a processor network with programmable logic. In E. Aarts, J. van Leeuwen, and M. Rein, editors, Proceedings of PARLE'91, pages 55-63, Eindhoven, NL, June 1991. Springer-Verlag. [7] R.K. Gupta, C. N. Coelho, and G. De Micheli. Program Implementation Schemes for HardwareSoftware Systems. IEEE Computer, 27(1):48-55, January 1994. [8] M. Wazlowski, L. Agarwal, T. Lee, S. E. Larn, P. Athanas, H. Silverman, and S. Ghosh. PRISMII Compiler and Architecture. In D. A. Buell and K. L. Pocek, editors, IEEE Workshop on FPGAs for Custom Computing Machines, Napa, California, April 1993. [9] Sun Wu and Udi Manber. Fast Text Searching Allowing Errors. Communications of the ACM, 35(10):83-91, October 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

259

COMPILING FOR MASSIVELY PARALLEL ARCHITECTURES: A PERSPECTIVE P. FEAUTRIER

Laboratoire PRISM Universit~ de Versailles Saint-Quentin 45 Avenue des Etats.Unis, 78035 VERSAILLES CEDEX FRANCE Paul. Feautrier@prism. uvsq.fr ABSTRACT The problem of automatically generating programs for massively parallel computers is a very complicated one, mainly because there are many architectures, each of them seeming to pose its own particular compilation problem. The purpose of this paper is to propose a framework in which to discuss the compilation process, and to show that the features which affect it are few and generate a small number of combinations. KEYWORDS. Massively parallel compilers, automatic parallelization. A FRAMEWORK FOR DISCUSSING MASSIVELY PARALLEL COMPILATION The problem of automatically generating programs for massively parallel computers is a very complicated one, mainly because there are many architectures, each of them seeming to pose its own particular compilation problem. The purpose of this paper is to propose a framework in which to discuss the compilation process, and to show that the features which affect it are few and generate a small number of combinations. We will first introduce some notations for discussing parallel programs. We will then restate in our framework several classical techniques: memory expansion, scheduling, partitioning and tiling. It is then possible to explore the spectrum of parallel architectures, and show that each of them may be programmed by one of the above techniques, or by a combination of them. In the conclusion, we will point to several short range and long range unsolved problems. 1.1

Static C o n t r o l P r o g r a m s

An operation of a program is one execution of an instruction. While the number of instructions is roughly proportional to the size of the program text, the number of operations is

260

P. Feautrier

proportional to the running time of the program and may vary according to the size of the data. W h a t is to be taken as an instruction depends on the purpose of the analysis. In the case of source-to-source parallelization, instructions are identified with simple statements in the source high level language, e.g. with Fortran assignments. At present, paraUelization techniques apply only to static control programs, i.e. programs for which one may describe a priori the set of operations which are going to be executed in a given program run. Static control program are built from assignment statements and DO loops. The only data structures are arrays of arbitrary dimension. For technical reasons, loop bounds and array subscripts are restricted to affine forms in the loop counters and integral structure parameters, which are assumed to be known at the program start. For such a program, we may use linear algebra as a tool for semantical analysis, which means that the structure parameters can be carried as variables all through the compilation process. Generally, the result of a program depends on the order in which its operations are executed. The fact that operation u is executed before operation v is written u -~ v. If several programs are under discussion, their execution orders will be distinguished by subscripts. A sequential program is associated to a total execution order. A parallel program is associated to a partial execution order, and a particular run is associated to a total extension of the execution order. Two operations u and v are independent if their order of execution can be reversed without changing the global effect on the program store. There is a simple sufficient condition for independence, Bernstein's condition. If this condition is not satisfied, the operations u and v are said to be dependent, written u .l_ v. Suppose that two programs have the same set E of operations. One of them is sequential, with total execution order -~. The other one has execution order -~//, presumably parallel. One may show that a sufficient condition for the equivalence of these two programs is: Vu, v E E : u .l_ v ^ u -~ v =~ u -
(1)

in words, that if two operations are dependent, then they are executed in the same order in the sequential and the parallel version. Another equivalent formulation is that the execution order of any correct parallelization must be an extension of the transitive closure ~DDG of the relation -~ N • the so-called Detailed Dependence Graph (DDG) of the source program. As a consequence, the parallel compilation process may be divided in two steps: firstly, compute the DDG of the source program, then select any extension of "~DDa which can be executed efficiently on the target architecture. With the exception of very simple p r o g r a m s - the basic blocks -: this proposal cannot be carried out literaly, since the size of the DDG is enormous and may vary from run to run. The concern of most parallelization methods is to construct a summary of the DDG, the objective being to keep just enough information for the construction of a parallel program.

Compilingfor Massively Parallel Architectures 2

261

BASIC ANALYSIS TECHNIQUES

2.1

Array

Dataflow

Analysis

Each edge in the dependence relation may be seen as a constraint on the final parallel program. There is an obvious interest in removing as many edges as possible. Some edges are related to memory reuse (the so-called output- and anti-dependences) and can be removed by modifying the data structure of the program. The remaining edges represent dataflow from a definition to a use of the same memory cell. However, a definition may be killed before being used by another operation. The aim of array dataflow analysis is to characterize in a compact way the set of proper flow dependences of a program. Let us consider an operation u in which a memory cell c is used. Let us write W(c) for the set of operations which write into c. The source of c at u is the latest write into c which precedes u: source(c, u) = m ax{v I v -~ u, v E W(r In most cases of interest, the source function is a piecewise affine function of the coordinates of the iteration vector x of u. In case of need, such a piece will be written: source(c,

(R,x))= (S, HRsx + hRs),

(2)

HRs is a matrix and hRs is a vector of suitable dimensions.

where

It may happen that some memory cell is not defined in the program fragment under study. In that case, we use a special operation which may be interpreted as the initial program loading. To simplify the exposition, we will suppose here that all initializations have been made explicit in the program text, and hence that this operation never occurs in our sources. There are many practical techniques for computing the source function. See for instance [5, 11, 10]. Consider the following matrix multiplication code: do J.= 1,n do j = 1,n 1

2

3

4

a(i,j) : ... b(i,j) = ... end do end do do i = 1,n do j = 1,n c(i,j) = o.o dok = t,n

c(:i.,j) = c(:i.,j) + a ( i , k ) * b ( k , j ) end do end do end do

The operations of this program are the instances of its four statements which are induced by the loops iterations. One iteration of e.g. statement 4 for given values i, j, k of the loop counters will be written (4, i, j, k/.

P. Feautrier

262

The various source functions are:

source(c(i,j),(4,i,j,k)) source(a(i,k),(4,i,j,k)) source(b(k,j), (4,i,j,k))

= ifk E 2then (4, i , j , k - 1)else (3, i,j), = (1,i,k), = (2, k,j).

(3) (4) (5)

When the source function is known, a new version of the equivalence condition (1) can be stated. L e t / ~ ( u ) be the set of memory cells which are used by operation u. One must have:

(6)

v,,, v,: e

in words, that the source of a value in any operation is executed before that operation in the parallel version of the program. The Dataflow Graph (DFG) of the program is a graph whose vertices are the operations. There is an edge from v to u iff v is the source of a value which is used by u. However, using any execution order which satisfies (6) for constructing a parallel program will give an incorrect result, because output- and anti-dependences have not been taken into account. One can get rid of these dependences by memory expansion. We will suppose here, for simplicity, that each operation returns only one result. Let us associate to each operation u one distinct memory cell M[u]. Let us write the statement associated to u as: a:= f(...,c,...). Consider the program in which operation u executes the following statement:

M[ul := f(.

. ., M[source(c, u)],...).

(7)

Since each operation u is executed only once, and since the result location M[u] is in oneto-one correspondence with u, this program has the single assignment property. When the program starts, all memory cells are undefined. The cell M[u] gets a value when u is executed, and this value does not change until the end of the program. One may prove that this single assignment program, executed according to any order -
Scheduling

The problem is now to specify an execution order for the parallel program, i.e. an extension of the DDG or of the DFG. The specification of an order on a large set is very complicated. Our aim here is to find simple representations, even if we have to sacrifice some parallelism in order to achieve simplicity. The use of a schedule, i.e. of a function which maps the set of operations to "time" (i.e. to any linearly ordered set) is such a simple representation.

Compiling for Massively Parallel Architectures

263

To any function 0 mapping the set of operation to an ordered set, we may associate the partial order: u -~0 v = e ( u ) < 0(v).

Usually, the domain of 0 is taken to be the integers. 0 is a valid schedule if the corresponding order satisfies (1) or (6), i.e: z ~ ^

~, -~

~ ~ 0(~) < e(~),

(8)

or

Vu, Va E ~ ( u ) : 0(source(a, u)) < O(u).

(9)

Part of the difficulty of the scheduling problem is that these constraints have many solutions. The usual approach is to find the "best" schedule according to some quality criterion, the most widely used figure of merit being the maximum value of 0 or latency. It is clear that the source relation is included in the DDG. Hence, (8) is a tighter constraint than (9). As a consequence, the valid schedules according to (9) have a lower latency than those which satisfy (8). The price to pay is that if a schedule for (9) is used, the data space has to be expanded in order to restore correctness. In the case of our running example, the following functions:

0(1, i, j) = O,

0(2, i, j) = O,

0(3, i, J) = O,

/9(4, i, j, k) = k.

are valid schedules. Consider for instance the constraint associated to memory cell r (i., j ) in statement 4. We have to check that: if k > 2 t h e n O(4,i , j , k - 1)else 0(3, i,j) < O(4,i,j,k). The test splits into two subproblems. If k > 2, we have k - 1 < k. If k < 1, then, from the loop lower bound we deduce that k = 1, and the condition to be verified is that 0 < 1. The reader may care to test the other conditions on 0. The scheduling problem has been widely studied in the context of systolic array design [12] and for massively parallel programming ([6, 7] and the references therein). Let us only say here that in the case of static control programs, there are several practical algorithms, mostly based on linear programming, for constructing affine and piecewise affine schedules. In some cases, one has to resort to multidimensional schedules, whose values are vectors. These two cases are in fact very similar, and we will not make the distinction in what follows. To any function 8, one may associate the system of sets:

.r(t) = {,, l e(u) = t}. ~'(t) is the front at time t. If 8 satisfies (8), it is clear that all Operations inside a front are independent, i.e. can be executed in parallel. If the schedule satisfies (9), then the source and the sink of any value do not belong to the same front: there is no data exchange inside a front. We may say that the set of operations has been partitioned into anti-chains - sets of non comparable operations - which are executed sequentially. This is the Sv.[~ of PhR

264

P. Feautrier

construction of [2]. Two operations in different fronts are executed in sequence, but are not necessarily dependent some parallelism has been lost in the interest of a simple parallel program representation.

2.3

Distribution

Let Q be any partition of the set of operations of a given program. One may associate to Q an execution order in the following way. Operations which belong to the same part of Q are executed according to the sequential execution order. Operations which belong to different parts are ordered in the sequential order if and only if they are dependent. It is quite clear that the resulting order satisfies (1). Intuitively, ~each part corresponds to a process. Ordering between operations in different parts corresponds to a synchronization operation. This is the PAR of $Eq style of programming of [2]. In the same fashion, one may introduce an ordering between two operations in different parts only if one is the gource and the other a sink for a given value. In that case, the ordering is obtained by transmitting a message from the source to the sink; the message carries the shared value. One may show that replicating the workspace of the original program in each processor provides enough memory expansion to eliminate anti- and output- dependences. Here, any partition gives a correct parallel program. However, efficiency considerations dictate that the set of residual synchronisation or communications be kept to a minimum, subject to the condition that all processes execute about the same amount of work. The problem has been widely studied (see [8] and the references therein). A plausible solution is the following. One postulates a p l a c e m e n t function II from the set of operations to the set of processes (also called virtual processors). II(u) is the name of the processor which executes u. The following equation: II(source(a, u)) = II(u)

(10)

expresses the fact that the source for cell a in operation u is in the same process as u. If all such equations can be satisfied, all residual communications will disappear. Since an arbitrary placement function is useless for program restructuring purposes, one makes the additional assumption that H is affine: II(R, z) = 7rn.z + qn, where lrR is an unknown vector and qR an unknown alignment constant. To each piece (2) of the source function corresponds a system: rsHns ~rs.hRs + qs

=

rn,

=

qn.

Let us write ~ for a vector in which all unknowns lrn, qn for all R are collected. The above system may be summarized as: C~ = O, meaning t h a t ~ must belong to the null space of C. If, as is likely for real world programs, C is of full row rank, ~ = 0 is the only solution, and the calculation collapses on processor O. To obtain an interesting solution, one has to select a submatrix of C with a non trivial

Compiling for Massively Parallel Architectures

265

null space. The excluded rows corresponds to residual communications. Heuristics may be used to select the excluded communications among those with the lightest load. To the source functions (3-5) are associated the following placement equations:

II(4, i,j,k)

= II(4,i,j,k-1),

n(4, i, j, 1) = n(4, i, j, k) = II(4,i,j,k) =

12(3,i, j), n(1, i, k), II(2,k,j).

(11) (12) (13) (14)

From (11) we deduce that II(4, i,j,k) does not depend on k. Similarly, (13-14) imply that this function does not depend on either j or i, and hence is a constant. It is thus impossible to build a distributed program for our example without residual communications. Suppose now we ignore (14). We may now take:

II(4, i,j,k) = II(1, i , j ) = II(3, i , j ) = i. Equations (11-13) are satisfied. We may choose II(2, i,j) arbitrarily. Let us take II(2, i, j) = j. There are then two solutions. The first one is to program a communication from processor j to processor i in which b ( k , j ) is sent. The second one is to duplicate b on all processors. There is no residual communication in this ease. A linear placement function for a program whose iteration domain has characteristic dimension n has a domain of cardinality O(n), and hence, generates O(n) processes. This may be too much for some architectures. In that case, one folds the placement function by assigning several processes to one physical processor. Alternatively, O(n) may not be enough for some architectures like the CM2 or Maspar. In that case one uses two or more linearly independent placement functions. Such a d-dimensional placement function generates O(n d) processes, d being limited only by the dimension of the iteration space. One can compute a placement function without any reference to a schedule. However, there are two reasons not to do that. The first one is that knowing the schedule allows one to choose the dimensionality of the placement. If the iteration space has dimension d, and if the schedule is one dimensional, then each front is included in a subspace of dimension d - 1, and there is no need to use a processor grid of higher dimension. More generally, for each statement the dimension of the processor grid (or template in HPF, or geometry for the Connection Machine) is d - s where s is the dimension of the schedule. Secondly, the schedule and placement function seen as a space-time transform has to be one-to-one, meaning that each process executes at most one operation at any given time: Vu # v: H(u) = II(v) ~

2.4

8(u) # 8(v).

(15)

Supernode Partitioning

In this technique [9], one starts again with a partition S of the operation set. The elements of S are called supernodes or tiles. This partition is subjected to the requirement that the quotient of the dependence graph by S is acyclic: Va, r E S, ~ u , v E a , x , y E r : u -< x,y-~ v,u .1. x,v 2. y.

266

P. Feautrier

In the parallel version of the program, operations which belong to the same supernode are executed sequentially according to -~, while supernodes themselves are executed according to the quotient order. Supernode partitioning is an important technique for improving the performance of a parallel program, by adapting the "grain of parallelism" of the program to the grain of the target computer. Most often, supernodes are defined as identical tiles which have to cover the set of operations of the program. As a first approximation, the computing time of a tile is of the order of its volume, while the necessary communications or synchronization are of the order of its surface. Increasing the size of tiles improves the computation to communication ratio, at the price of reducing the amount of parallelism. The extreme is the case of only one tile, which generate no communication and no parallelism. The problem of writing the actual parallel program after tiling is simply displaced from the original dependence graph to the quotient graph, and the methods abovestill apply. 3

A D A P T I N G T H E C O M P I L E R TO T H E A R C H I T E C T U R E

3.1

Classifying A r c h i t e c t u r e s and Languages

It is a truism that each programming language defines - sometime explicitly, most of the time implicitly - an underlying virtual architecture. In many cases, the user of a massively parallel computer only sees the virtual architecture provided by his favorite programming language. This leads to the distinction between the programming model and the execution model [2]. In this discussion, we will mostly stay at the level of the programming model. For instance, any computer which runs Fortran 90 will be deemed a vector processor. Obviously, when constructing programs for massively parallel computers, one has to take the target architecture into account. My contention is that only broad characteristics of the target computer are important for the compiling process. Detailed parameters, like e.g. message latencies or cache size, are to be taken into account only when fine tuning the resulting program, as for instance when one has to decide the size of supernodes. The main characteristics of a parallel architecture are the following: 9 Is there a central clock to which all processors are synchronized? 9 Is there a global address space which can be accessed in a uniform manner by all processors? These two parameters are largely independent, and thus gives rise to four architectural classes. 3.2

Global M e m o r y Synchronous A r c h i t e c t u r e s

Under this category fall static control superscalar and VLIW processors, and also a few designs like 0psila (a global memory SIMD machine). Parallelism is obtained by executing a large number of operations simultaneously in each clock cycle. One may argue that

Compilingfor Massively Parallel Architectures

267

pipeline processors belong to this class, if one stays at the level of the vector instructions, and ignore the detailed programming of the pipelines. For a synchronous computer, each operation has a well defined date, which is obtained simply by counting clock cycles since the beginning of the program. This gives a natural schedule. Conversely, to a given schedule, one may associate the following abstract synchronous program: do t = 0, L

doall Y(t) end do where L is the latency of the schedule. The body of the loop may be understood as a very large instruction, each processor taking responsibility for one of its elementary operations. This program cannot in general be executed directly. Firstly, in a synchronous computer, all operations in a front are to be instances of the same instruction. Secondly, the size of the front is limited by the number of identical processors. One has to split the front into subfronts ~'s according to the statement S which is executed. One also has to adjust the schedule in such a way that no front has more operations than the available number of processors. This can be integrated into the scheduling process, or done a posteriori by variants of the well-known strip mining technique.

3.3

Distributed Memory Asynchronous Architectures /

This is a class of computers with a very large population, from Workstation networks to hypercube based architectures. Each processor works in asynchrony and has its own independent memory. A message exchange is necessary if one processor needs a value which has been computed elsewhere, and since message passing is always much slower than computation, such exchanges must be kept to a minimum. These computers are best programmed from a partition as in Section 2.3. To a first approximation, the data space of the original program can be replicated in each memory, thus insuring the needed memory expansion. Each processor runs a copy of the source program, each instruction being guarded to insure that it is only executed if it has been assigned to that processor. The overall effect is Single Program, Multiple Data programming style. Let is suppose that distribution is specified by a placement function II, and let q be the current processor number. Operation u is replaced by the following code [14]:

Va q R(u): i f II(u) # q ^ II(source(a, u)) - q then Send(a) to H(u) i f II(u) = q ^ II(source(a, u)) # q then Receive(a) from II(source(a, u)) i f II(u) = q then c - f(R(u)) One may prove that there is a one-to-one correspondence between Sends and Receives, that values are sent in the order in which they are received, and that Send-Receive pairs implement the needed synchronization between operations in different processors. Efficiency of the above program is dependent on several factors. Firstly, as much as possible of the guards in the above code should be pushed up into

P. Feautrier

268

the surrounding loops bounds. This can be done only for simple forms of the placement function. Secondly, proper choice of the placement function should minimize the number of residual communications. Sends and Receives should be grouped in order to have longer messages, perhaps by supernode partitioning. Nevertheless, any partition function leads to a correct if perhaps inefficient object program. This is the reason why it is feasible to have the distribution specified by the programmer, as in the HPF language. 3.4

SIMD Architectures

SIMD architectures, as for instance the CM-2 or Maspar computers, or systolic arrays, have synchronous processors and a distributed memory. Hence, for generating a para.llel program, one needs both a schedule and a placement function which together have to satisfy constraint (15). This is a very difficult problem. One does not even know which is to be found first, the schedule or the placement. Here is a possible suggestion for the case where the schedule has been found first, and one needs a compatible placement. The first step is to split each front into subfronts; since at any given time, each processor of a SIMD machine executes the same instruction, each subfront is associated to one statement in the source program (either a high level statement or a machine instruction, depending on the programming model). After this step, we only have to check (15) for operations which are instances of the same statement. Let (S, i) be one such instance. Suppose we have found an affine schedule of the form:

e(s, i) = hs.i + ks, where hs is known as the timing vector ill the systolic literature. The placement function has to satisfy: Vi ~ j : r [ ( i - j ) = 0 ~ h s . ( i - j) ~ O. This means that hs should not be orthogonal to the kernel of H. Since this is a non convex constraint, there is at present no known way of integrating it in the placement process except by using exhaustive search algorithms. 3.5

Asynchronous Shared Memory Architectures

An asynchronous multiprocessor with a global address space is apparently tile easiest parallel architecture to program. In fact, there are two ways of tackling the job. Firstly, it is easy to emulate a synchronous architecture - one needs only a fast barrier primitive - and still easier to emulate a distributed memory machine: one has only to partition the address space and to restrict access of each processor to the associated memory segment. This is the policy most operating system implement for data security reasons. The memory access limitation is raised for the time it takes to execute communication primitives. However, one should be aware that today global memories are build on top of message passing architectures, either as a Shared Virtual Memory, or with the help of distributed caches. In both cases, the performance of the computer is sensitive to the placement of

Compiling for Massively Parallel Architectures

269

the data and calculations. The most important parameter is the coherence protocol, which insure that no obsolete data is returned to a read request. Strong coherence protocols [3] work by invalidating copies of data which are going to be modified. In that case, the main concern is to avoid coherence induced cache misses. This is done, not by distributing the data, but by distributingthe operations, two operations sharing the same datum being preferably located on the same processor. Weak coherence protocols delimit sections'of the code where it is safe to waive the coherence check because there are no concurrent writes to the same memory cell. Observe that a front is such a section, because all operations inside it have the single assignment property. Coherence is restored as a side effect of the barrier operation. It is clear that fl'onts are exactly what is needed by weak coherence protocols, and that a schedule is the natural tool for constructing fronts. 4

CONCLUSION

This paper has attempted to give guidelines for the central part of massively parallel program synthesis in the case of static control programs. This has to be preceded by an analysis phase and followed by a code generation phase. The construction of the DFG and scheduling are well understood processes. Distribution/Placement is a fuzzier subject since there is no real constraint on the placement function, the problem being one of trading off load balancing against communications. Some work is still needed in that direction. The basic tools for code generation also are well understood. One of them is the construction of a loop nest from a set of affine constraints [1]. Recent results allows handling of non-unimodular space-time transformations [13]. There is still a difficulty when generating code for unperfectly nested loops, in which case some guards may still be necessary after restructuring. In the case of distributed memory architectures, there are different solutions for the object code [4]. Experimentation is needed here to select the best approach. The next step after that is to go beyond static control programs, exploring while loops and irregular data structures. There is still some hope of improving analysis and synthesis techniques in this direction. However, we are nearing the point at which the information content of the program text is nearly exploited to the full. After this stage, the only possibilities are either run-time parallelization methods or the definition of new programming languages, where more information is available for parallelization. Acknowledgment Many thanks to Luc Bough, who carefully criticized a first version of this paper. References Irigoin. Scanning polyhedra with do loops. In Proc. third SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages

[1] Corinne Ancourt and lh'anw

39-50. ACM Press, April 1991.

270

P. Feautrier

[2] Luc Bough. Le module de programmation ~ parall~lisme de donn~s : une perspective s~mantique. T.S.I., 12(5):541-562, 1993. [3] Lucien M. Censier and Paul A. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. on Computers, C-27:1112-1118, December 1978. [4] Jean-Francois Collard. Code generation in automatic parallelizers. In Claude Girault, editor, Proc. Int. Conf. on Application in Parallel and Distributed Computing, IFIP WG 10.3, pages 185-194. North Holland, April 1994. [5] Paul Feautrier. Dataflow analysis of scalar and array references. Int. J. of Parallel Programming, 20(1):23-53, February 1991. [6] Paul Feautrier. Some efficient solutions to the affine scheduling problem, I, one dimensional time. Int. J. of Parallel Programming, 21(5):313-348, October 1992. [7] Paul Feautrier. Some efficient solutions to the affine scheduling problem, II, multidimensional time. Int. J. of Parallel Programming, 21(6):389-420, December 1992. [8] Paul Feautrier. Toward automatic distribution. Parallel Processing Letters, 4, 1994. to appear. [9] Francois Irigoin and R~mi Triolet. Supernode partitioning. In Proc. 15th POPL, pages 319-328, San Diego, Cal., January 1988. [10] Dror E. Maydan, Saman P. Amarasinghe, and Monica S. Lain. Array dataflow analysis and its use in array privatization. In Proc. of A CM Conf. on Principles of Programming Languages, pages 2-15, January 1993. [11] William Pugh and David Wonnacott. An evaluation of exact methods for analysis of value-ba~ed array data dependences. In Sixth Annual Workshop on Programming Languages and Compilers for Parallel Computing, Portland, OR, August 1993. [12] Patrice Quinton. The systematic design of systolic arrays. In F. Fogelman, Y. Robert, and M. Tschuente, editors, Automata networks in Computer Science, pages 229-260. Manchester University Press, December 1987.

[13] J. Xue. Automatic non-unimodular transformations of loop nests. Parallel Computing, 20(5):711-728, 1994. [14] H. P. Zima, H. J. Bast, and M. Gerndt. Superb : A tool for semi-automatic mired/sired paraUelization. Parallel Computing, 6:1-18, 1988.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

271

DIV, FLOOR, CEIL, MOD AND STEP FUNCTIONS IN NESTED LOOP PROGRAMS AND LINEARLY BOUNDED LATTICES

P.C. HELD and A.C.J. KIENHUIS

Delft University of Technology Mekelweg 4 2600 GA Delft The Netherlands held @dutentb. et. tudelft, nl

ABSTRACT. This paper describes the conversion of nested loop programs into single assignment forms. The nested loop programs may contain the integer operators: integer division, floor, ceil, and modulo, in expressions and the stride, or step size, of for loops may be greater than one. The programs may be parametrized but must have static control. The conversion is done completely automatical by the tool HiPars. The output of HiPars is a single assignment program (SAP) and a dependence graph (DG). The description of the dependence graph is based on linearly bounded lattices. We will show the relation between the integer division operators in the SAP and linearly bounded lattices in the corresponding DG. KEYWORDS. Nested loop programs, data dependencies, single assignment programs, linearly bounded lattices.

1

INTRODUCTION

Many algorithms in the field of signal-processing are available in the form of sequential programs in various programming languages such as Fortran and C. To execute programs on dedicated parallel processor arrays [13][4], we need to know the data dependencies between the operations of the program [1]. We restrict ourselves to programs that belong to the class of nested loop programs. In addition, we require that the programs have static control and that the expressions inside the program are linear expressions. For the data dependence analysis, a linear programming technique (PIP - parametric integer programming) due to Feautrier can be used [5] [6]. We show that the class of nested loop programs which contains non-linear functions integer division, modulo, ceil and floor in its expressions can be converted to the previous. Based on

P. C. Held and A. C.J. Kienhuis

272

the definition of integer division, we can express these operators in terms of linear control statements, allowing us to use PIP for the analysis of these programs too. The result of the dependence analysis is a single assignment program (SAP). Our goal is to write this SAP in the form of a piecewise regular dependence graph. The description of the nodes and edges of the dependence graph is based on linearly bounded lattices [12]. We pay special attention to the problem of converting the integer division operators that occur in the SAP into lattice descriptions. 2

DEPENDENCE

ANALYSIS OF NESTED LOOP PROGRAMS

We have implemented a tool, called H iPars, which finds the data dependencies inside nested loop programs (NLP) [7]. We allow these NLPs to contain two kinds of control statements. The for-loop control statement and the conditional statement. In addition, the NLPs contain assignment statements, which take the form of function calls. We restrict ourselves to the class of a.ffine nested loop programs [4]. Some properties of affine nested loop programs are, among others: (1)expressions of loop bounds, conditionals, and indices of variables are ai~ne; (2) the programs have static control; (3) the number of iterations of the program is independent on the value of the variables of the program; (4) the programs may be parametrized. We will extend this class of programs to include nested loop programs containing the non-linear functions: integer division, ceil, floor, and modulo. The functions may appear in the expressions of the program. In addition we allow that the stride, or step size, of a for-loop to be greater than one. Let a be an aftine expression of variables with integral coefficients and b an integral constant. We denote integer division by div(a, b). We define the other operators in terms of the div operator, with ~ standing for the division of a and b:

9 floor(~) = div(a, b) rounds the result of ~ to the greatest integer smaller than or equal to ~. 9 ceil(~) - - d i v ( - a , b) rounds the result of ~ to the smallest integer greater than or equal to ~. , rood(a, b) - a - b , div(a, b) returns the value of the remainder of the integer division. With these definitions, we rewrite nested loop programs containing these functions in terms of the div functions only. Below we have listed program 2.1 containing two for-loops with stride two and div functions inside the expressions of the conditional statements. Two functions A and B are data-dependent if function A uses as argument a variable which value is defined by the evaluation of function B. To find the data dependencies, We represent the functions evaluated by the nested loop program as sets of iterations. We call these sets of iterations iteration-domains.

DIV, FLOOR, CEIL, MOD and STEP Functions

273

Program 2.1 Let M be a parameter. Let lunch and funcB be two functions. Let a be a two dimensional variable array. for i = I to M step 2, for j = 1 to 2 M step 2, if i - 3 , div(i,3) <-- 0 then [a(i,j)] -- lunch( ); end if j - 3* div(j,3) <= 0 then funcB(a(i,j) ); end

end

end [] The kernel of HiPars is formed by a parametric integer programming routine (PIP). PIP differs from classical ILP in two ways. First the constant vector of the LP system may be parametrized leading to parametrized output by symbolic evaluation. Secondly, the objective function is defined by the lexicographical ordering of the feasible solutions. In order to use PIP [5] for our dependence analysis we describe iteration domains by polytopes. Let Z be the set of integers. We define an integral polytope as a bounded set of points taking values in Z n specified by a system of linear inequalities. By definition, a polytope is a dense space. When a program contains for-loops with stride other than one, the expressions of the lower and upper bounds are not sufficient to describe the values that the iterator takes on. We have to add an additional constraint specifying that the difference between values of the iterator is a multiple of the stride. To model the stride, we introduce an integral variable q. Let lb be the lower- and ub be the upper- bound expression of the for-statement with iterat0r i and stride s. We model the stride by the equation: i = q , s + lb

(1)

with Ib < i < ub. It is easy to show, that this equation together with the inequalities of the bounds form a polytope in the (i, q)space that define the values of iterator i. So we properly transformed the non-dense iteration domain in i into a dense iteration domain in a higher dimensional space at the cost of an additional variable in the domain description. The outline of the sequel of the paper is as follows. In section 3, we will give the definition of integer division and substitute these operators inside NLPs by new control variables such that the resulting program contains only affine expressions. In section 4, we explain how a nested loop program is converted into a single assignment program and show that the SAP will contain additional control variables standing for integer divisions. Then our goal will be to write the SAP description as a DG with linearly bounded lattices. We will define linearly bounded lattices in section 5. In section 6 and section 7 we will derive the lattice

P. C Held and A. C.J. Kienhuis

274

vectors and lattice offsets, respectively. The procedure will be illustrated by an example.

3

INTEGER

DIVISION

When expressions of bounds or conditionals contain an integer division operator, we have to do a similar transformation as we did for the stride in order to obtain polytope descriptions of iteration domains. For this purpose, we look at the definition of integer division. Definition 3.1 Let a be an integer and b a positive integer. The result of integer division of a and b are integers q and r, such that a = b,q+r (2) 0

_

r_(b-1)

(3)

We call b the divisor and r the remainder of the division. The value of q is equal to div(a, b). Observe that div(5 + 3,2) # div(5, 2) + div(3, 2), which shows that integer division is not a linear operation. Definition 3.1 gives us a way to define the integer division operator div(a,b) by two linear inequalities [9]. We write r as a - b , q according to the equation and substitute it in the inequalities, resulting in: O<_a-b,q<(b-1) (4) In the expressions inside the nested loop programs, we substitute each div operator by a control variable q and add the two inequalities defining variable q in the form of conditional statements. After substitution of the div functions, the NLP contains only affine expressions. As a result, we can define all iteration domains by polytopes. E x a m p l e 3.1 Program 2.1 contains the conditional statement: if i - 3* div(i,3)

<= 0 then

We introduce variable q which we substitute for the function div(i, 3) in the expression. By definition 3.1 we define div(i,3) by the inequalities: 0 _< i - 3q _< 2 (5) We add the inequalities as conditional statements before the original if-statement, resulting in the following statements: if 0 <= i -3*q then if i - 3*q <= 2 then if i - 3, q <= 0 then

Now all inequalities are linear expressions of the variables i and q.

DIP', FLOOR, CELL, MOD and STEP Functions 4

275

SINGLE ASSIGNMENT PROGRAM

After substitution of div operators by inequalities inside the NLP, we find the data dependencies between the variables of the program. HiPars presents the result of the dependence analysis in the form of a single assignment program (SAP)[14] [3]. A complete dependence analysis of the program involves finding the dependencies for all right-hand side (RHS) variables appearing in the function call statements. A RHS variable can only be dependent on a left-hand side (LHS) variable of the same name. In case there are more LHS variables of the same name, we find first the solution between the RHS variable with each of the LHS variables separately. This is done by PIP, which returns the solution in the form of the index of the LHS variable and the iteration-domain for which the solution is valid [5] [6]. The index may be undefined, which means that the RHS variable does not depend on the LHS variable. After applying PIP for all the LHS variables, we determine the lexicographical largest index among the indices, which is the dependency. Dependencies are linear functions on the iterators, parameters and variables standing for the integer divisions. The complete solution of the dependence analysis of a single ttHS variable may consists of multiple dependencies defined on mutually exclusive iteration domains [8]. The procedure to construct the SAP of the NLP is straightforward [3]. First we substitute LHS-variables by variables with unique names and with identity functions as indexing functions. Next we replace each ttHS variables by the corresponding LHS variable in which the dependency is used as indexing function. If the dependency is undefined, we do not substitute the RHS variable. The iteration domains belonging to the dependencies are inserted in the SAP in the form of conditional statements. Below we show the SAP of example 2.1 that is automatically generated by HiPars. E x a m p l e 4.1 Let Mbe a parameter. Let funcA and funcB be two functions. Let a_l be a two dimensional variable and in0 a temp variable. ~. SAP Generated By HiPars Version 2.08

for i=1 to M step 2, for j=l to M step 2, q1=div(i,3); if -i+3*q1>=O then [ a_l( i,j ) ] = funcA( ); end

P. C. Held and A. C.J. Kienhuis

276

q2=div(j,3); i f -j+3,q2>=O, ql=div(i+l,2); if -i+2*ql-l>=O then q2=div(j+l,2); if -j+2*q2-1>=O then q3=div(2.ql+2,3); if -i+3.q3-3 >=0 then q4 = air(q3,2);

in0 = a _ l ( i , j ) ; else inO = a ( i , j ) ; end else

in0 = a ( i , j ) ; end else

inO = a ( i , j ) ; end

funcB( inO ); end end end

o Observe that the SAP contains additional div operators. The div functions are introduced by PIP during the' process of finding a solution of the integer programming problem. To find an integral solution, PIP adds extra inequalities, called cut planes, to the polytope. The construction of the cut plane involves integer division. If parameters are involved in the division, PIP introduces a new parameter in order to linearize the cut plane [2] [5] [10]. These new parameters corresponds to the div functions in the SAP. 5

LINEARLY BOUNDED LATTICE

The SAP is in fact an intermediate format in the HiFi design system [13]. Our goal is to represent the NLP as a dependence graph (DG). Nodes of the DG represent functions of the DG and edges data dependencies between the functions. We use domains to represent the elements of the graph in a reduced way. We define domains as linearly bounded lattices [12][41. A domain is specified by a polytope and a lattice. Let I be an index vector of length n and let K be a vector of m integral variables. With A 6 Z mXn an integral matrix and

DIV, FLOOR, CEIL, MOD and STEP Functions

277

B E Z ra a constant vector, we define a polytope by [10]: A K >_B (6) Now, with L an n • m integral matrix and O E Z '~ an integral vector, we define a lattice aS"

I = LK + 0 (7) We say, that the lattice of I is generated by the columns of L, with the variables of K bounded by the polytope. We call O the offset of the lattice and the columns of L the lattice vectors.

An example of a two dimensional lattice is:

Example 5.1 0

Below we will show how to write the SAP as linearly bounded lattices. We will focus on the lattice specification defined by the div functions inside the SAP.

6

INDEX DOMAINS WITH DIVS

In this section we will show the relation between lattices and the integer division functions. To illustrate the subject, we take as example the piece of code of program 4.1 containing four div statements:

Example 6.1 ql = div(i + 1,2) if-i+2,ql1 >_0 q 2 = d i v ( / + 1,2) i f - j + 2 , q 2 - 1 >_ 0 q3=div(2, ql + 2, 3) if - i + 3 , q3 - 3 >_ O q4= div(q3, 2) a 1 ( 3 , q 3 - 3 , 2 , q2 - 1) E! The example shows that div functions may be nested. The div functions and inequalities of the example form a part of specification of the iteration domain of the dependency for variable al. Our goal is to describe iteration domains of variables as linearly bounded lattices. The method is based on the so-called hermite normal decomposition [10]. Other approaches can be found in [9] [11]. To find the lattice defined by the integer divisions and inequalities involving the q's, we start by writing the div's as equations by setting the remainders r to zero.

P. C. Held and A. C.J. Kienhuis

278

Let N be a matrix of which the rows are the normals of these equations. Let Q = (ql,..,qm) be the vector of variables of the m divisions and let I be the vector of the iterators. We write the system of equations defined by the div's as: N

(i) Q

=0

E x a m p l e 6.2 With I = (i,j) t and Q = (ql,q2,qa, q4)t, the system of equations of example 6.1 is: i - 2ql = 0 j-

2"q2 = 0

2q2 - 3q3 = 0 q3-2q4 = 0 Thus matrix N is

i

._

1 0 0 0

0 -2 1 0 0 2 0 0

0 -2 0 0

0 0 -3 1

0 0 0 -2

We assume that the system has a solution. Otherwise, we would have removed this piece of code from the program by dead code elimination procedures. The system has m equations in n + m variables. Because each row k introduces variable qk it follows that the rows of N are independent. The nullspace of the system is thus n-dimensional, equal to the dimension of the iteration-space. We will call the variables corresponding to the nuUspace the free variables of the system. To find the solution, we use the hermite normal decomposition [10]. This procedure gives us two matrices C1 and C2 such that:

N[C, C2I = [H0] in which matrix H is called the hermite normal form of N. Matrix H has an inverse because the rows of N are linearly independent. Observe that matrix C2 consists of the vectors of the n nullspace vectors of N as NC2 = O. So any linear combination of the vectors of (72 added to a given solution s will also be a solution of the system. Because we are only interested in the values of I, we decompose matrix C1 into Cfl, size n by n, and C12 and decompose matrix C2 into matrices C21 and C2~ as follows:

Now, the columns of matrix C~1 are the lattice vectors. So the hermite normal form gives us directly lattice matrix L defined by the divs.

DIV, FLOOR, CEIL, MOD and STEP Functions

279

E x a m p l e {}.3 Hermite normal decomposition of matrix N gives: 1 0 0 0 0 0

C1 =

0 -2 1 0 0 -1 0 0 0 -1 0 0

0 0 0 0 0 -1

and matrix C2" 6 0 0 2 3 0 C2 :

0

1

2 0 1 0 The iterators i and j are defined by matrix C21. Let K I be the vector of free variables. We write I = (i,j) t, with offset O still to be determined, as:

I=

7

(60)Kf+O 02

LATTICE OFFSET

Next we have to find the lattice offsets. Let B = (bl, .., bin)t be the vector of the divisors of the integer divisions, with remainder rk between 0 < rk < bk. An offset 0 must first of all be an integral solution of the system:

0 _< N

(o) Q

_< B

(9)

Apart from these inequalities there may be others in the program that restrict the value of the variables standing for the integer divisions. We disregard inequalities not involving Q as they do not affect the lattice offset. Let < Nq, Bq > be the system of all inequalities involving Q. We assume that NqC2 = O. When this assumption is satisfied, we may use the vectors of C21 as lattice vectors because the variables corresponding to C21 are free. Let Kb be the vector of variables corresponding to matrix C1 and let Kf be the vector of variables corresponding to matrix C2. We define (O, Q)t as 0

(10)

P. C. Held and A. C.J. Kienhuis

280

and substitute it in the polytope:

Q) >_Bq

Nq( 0

(11)

after which we obtain the polytope:

NqC1Kb

>_ Bq

(12)

This polytope defines all offsets 0 = CllKb of the lattice and we call it the lattice offset domain. The number of offsets depend on the value of the divisors b. The lattices corresponding to the polytope in q are defined by:

I = C21K1 + 0 0 = Ca,Kb NqC1Kb >_ Bq

(13)

(14) (15)

The lattices are bounded by remaining inequalities of the nested loop program. These inequalities together with a lattice define an iteration-domain. A special case is when the offset domain contains a single point. Then the lattice descriptions reduces to I = C21K f + O, and we do not have to enumerate the lattice offset domain. E x a m p l e 7.1 In example 6.1 there are three if-statements defining inequalities in q: - i T 2 , q l - 1 >_ 0

(16)

-j+2,q2

(17)

1>_0

- i T 3 , q 3 - 3 >_ 0

(18) (19) After the substitution I = C11Kb and Q = C12Kb, we get inequalities in variables of I(b: - k 1 _> 1

-k2 > 1 - k l - k3 _~ 3 By the same substitution we get for the inequalities of the remainders: -l_
0 = CllKb = ( 3 , - 1 ) t

t is the only solution. So

DIV, FLOOR, CELL, MOD and STEP Functions 8

281

CONCLUSION

This paper shows the relation between several forms of describing the data dependencies of nested loop programs. In particular we have explained the relation between integer divisions inside the Single Assignment Programs generated by HiPars and linearly bounded lattices in descriptions of Dependence Graphs. We have extended the chss of nested loop that HiPar8 can take as input to programs containing non-linear integer division functions inside the expressions. This extension is based on the definition of integer division by which we can linearize the expressions in the program at the cost of additional variables. The conversion of SAP with div functions to linearly bounded lattice descriptions is achieved by taking the hermite normal form of the matrix N defined by integer divisions. This decomposition leads to matrices C1 and C2, with corresponding variable vectors Kb and Kf, respectively. Matrix C2 defines the lattice vectors with the variables of K I as free variables. The domain of lattice offsets is formed by a polytope in variables of Kb. The polytope is characterized by matrix C1 and inequalities of the variables standing for the integer divisions. As a result, we have transformed the polytopes defined by a SAP into linearly bounded lattices to be used in the description of the corresponding DG. References

[1] U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers, 1988. [2] L. Brickman. Mathematical Introduction to Linear Programming and Game Theory. Springer-Verlag, 1989. [3] Jichun Bu. Systematic Design of Regular VLSI Processor Arrays. PhD thesis, Delft University of Technology, Delft, The Netherlands, May 1990. [4] Ed Deprettere, Peter Held, and Paul Wielage. Model and methods for regular array design. Int. J. of High Speed Electronics and systems, 4(2):Special issue on Massively Parallel Computing-Part II, 1993. [5] P. Feautrier. Parametric integer programming. Reche~vhe Opdrationelle; Operations Research, 22(3):243-268, 1988. [6] P. Feautrier. Dataflow analysis of array and scalar references. Int. J. Parallel Programming, 20(1):23-51, 1991. [7] Peter Held. Hipars' reference guide. Technical report, Dept. Electrical Engineering, Delft University of Technology, 1993. [8] Peter Held and Ed F. Deprettere. Hifi: From parallel algorithm to fixed-size vlsi processor array. In Francky Catthoor and Lars Svensson, editors, Application-Driven Architecture Synthesis, pages 71-92. Kluwer Academic Publishers, Dordrecht, 1993.

282

P. C. Held and A. C..l. Kienhuis

[9] F.Balasa F.Franssen F.Catthoor H.De Man. Transformation of nested loops with modulo indexing to affine recurrences. In C.Lengauer P.Quinton Y.Robert L.Thiele, editor, Special issue of Parallel Processing Letters on Parallelization techniques for uniform algorithms. World Scientific Pub., 1994. [10] G.L. Nemhauser and L.A. Wolsey. Integer and Combinatorial Optimization. John Wiley & Sons, Inc., 1988. [11] W. Pugh. A practical algorithm for exact array dependence analysis. Communications of the ACM, 35(8):102-114, 1992. [12] L. Thiele and U. Arzt. On the synthesis of massively parallel architectures. Int. J. of High Speed Electronics and Systems, 4(2):99-131, 1993. [13] Alfred van der Hoeven. Concepts and Implementation of a Design System for Digital Signal Processing. PhD thesis, Delft University of Technology, Delft, The Netherlands, October 1992. [14] Kung S. Y. VLSI Array Processors. Prentice Hall, 1988.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

UNIFORMISATION TECHNIQUES RECURRENCE EQUATIONS

FOR REDUCIBLE

283

INTEGRAL

L.RAPANOTTI and G.M.MEGSON Dept. of Computing Science University of Newcastle upon Tyne Claremont Tower Newcastle upon Tyne NE1 7RU United Kingdom lucia, rapanotti @newcastle. ac. uk graham, megson @newcastle. ac. uk ABSTRACT. The established techniques for the synthesis of regular array designs from algorithm specifications strongly rely on the linearity of the algorithm data dependencies. In this paper we show how this linearity constraint can be relaxed and synthesis methods extended to more general classes of problems. We introduce the class of integral recurrences and discuss their relationship with the more traditional uniform and affine recurrences. The structure of this class is investigated and two important subclasses of integral recurrences, so-called atomic integral and reducible integral, are identified. A uniformisation technique is defined which allows the systematic routing of integral data dependencies. As a significant example, we apply the technique to the knapsack problem, a well-known optimistion problem from the literature. KEYWORDS. Synthesis methods, regular arrays, integral recurrence equations, data dependencies, uniformisation.

1

INTRODUCTION

Synthesis techniques for parallel regular computations have become increasingly significant as a methodology for the development of regular array designs through the disciplined process of abstract algorithm representation and its subsequent manipulation into parallel computations. Synthesis methods also represent a conceptual framework to explore the potential and limitations of the regular array approach to parallel processing and provide the theoretical support to automatic tools implementing the methodology and so assisting algorithm designers in the more tedious and error prone phases of the design process. The established synthesis methods are based on an interpretation of the algorithm as a set of computation points in a Euclidean lattice space, and the synthesis process consists of a

L. Rapanotti and G.M. Megson

284

sequence of transformations applied in this geometric setting. The syntax used to specify the algorithm is that of recurrence equations. Synthesis techniques still suffer from a number of limitations. A major criticism is that they are mainly restricted to algorithms characterised by data dependencies which can be expressed as (or are reducible to) linear expressions. Linearity is also required to characterise the computation domain of the algorithm and linear transformations constitute the core of synthesis methods ([7, 10, 9, 8, 5]). Indeed, linearity strongly restricts the scope of the techniques, hence the range of problems which can be treated systematically. In this paper we explain how to relax the constraint of linearity on the data dependencies. In particular we introduce integral recurrence equations (IREs) as a class of recurrences characterised by data dependencies which are defined by integral functions. We will focus on the structure of the class introducing a taxonomy of integral recurrences based on some decomposition properties of their data dependencies, in particular identifying the classes of atomic integral (AIREs) and reducible integral (RIREs) recurrences. We will explore their relationship with the more traditional uniform and affine cases, and show how reducible and atomic integral data dependencies can be systematically made uniform. This has the important practical implication of extending the available synthesis techniques to interesting classes of problems from the literature, such as the knapsack problems, cyclic elimination and recursive doubling algorithms, real-time signal processing algorithms requiring piece-wise linear index functions 1, to name a few, which can all be expressed as , systems, of reducible integral recurrences. The paper is structured as follows. Section 2 briefly introduces recurrence equations and data dependencies. Section 3 discusses the class of integral recurrence equations, its structure and properties. Section 4 discusses the uniformisation of integral recurrences. In Section 5 a significant application of the technique is given by considering the knapsack problem, a well-known NP-hard optimisation problem ([4, 3]). Section 6 concludes the paper.

2

BASIC CONCEPTS

We consider algorithms specified by recurrence equations of the formU: z E D : U(z) = f ( . . . , l / ( I ( z ) , . . . ) , where D is a convex polyhedron in Z n, U, V are variable names, U is the result of the equation, V is one of its arguments, f is a function strict on its arguments and of constant time comple~ty, I is a function from Z'* to Z ~, called the index function and the dots indicate an arbitrary finite number of arguments of the same type. The value of a variable on a point z of the lattice space Z" is called a variable instance. The index function I determines the kind of data dependence existing between the instances of the variables U and V on the points of the domain D: in particular, for each z 6_. D, z - I(z) is the vector expressing such a data dependence. In order to guarantee the existence of a 1For this particular class of functions, a synthesis method, substantially different from the one we propose in this paper, is given in [2]. ~Recurrences of this type are known in the literature as fully indexed computation conditional recurrence equations (see, e.g., [9]). We omit this set of adjectives as they are not essential to the understanding of the work presented in the paper.

Reducible Integral Recurrence Equations

285

! Index Fiznction Uniform ([7]) Affine ([10]) Param. Linear ([9])

.

.

.

.

.

.

.

.

.

t(z)= z+b

.....

l(z) = A . z + b I(z)= A.z +b+ B.p

.

.

.

.

.

.

.

.

.

.

.

bEZ n A E Z nxn b E Z n A E Zn• E Z '~ B E Z '~• , p E P C _ Z m , p size parameter

Fig. 1. Classes of Recurrence Equations finite design, the equation domain is either finite or has at most one infinite direction 3. A system of recurrence equations is a set of recurrence equations with disjoint sets of result instances. Most of the definitions and properties for recurrence equations naturally extend to systems of recurrences. In particular, the domain of a system of equations is the convex hull of the union of the equation domains. In a system of equations all infinite directions must be the same. The classes of recurrence equations and systems in the literature are characterised by (and named after) the type of their index functions. A summary is given in Fig. 1.

3

THE CLASS OF I N T E G R A L R E C U R R E N C E E Q U A T I O N S

In this section we discuss the class of integral recurrence equations 4 (IREs). 3.1

General Form

In their most general form, integral recurrence equations are characterised by data dependencies which can be expressed as integral combinations of a finite number of direction vectors in the lattice space of the recurrences. Syntactically, their index functions can be defined as summations of a finite number of basic components, each component being a direction vector dj in Z n multiplied by a non-negative integral function gj. More formally: D e f i n i t i o n 3.1 [Integral Data Dependence] Let D be a convex polyhedron in Z n. All index function I : Z n --, Z n defines an integral data dependence over D if and only if, for all z E D, I ( z ) = z + ~j=x m gj(z )dj, where , for j = 1,.. ., m, gj 9 Z n -+ Z are flmctions non-negative and bounded over D and dj E Z n. 1 3.1 The boundedness of the functions gj over D means that the computation of a variable instance cannot depend on variable instances at an arbitrary long distance in the direction of dj in the lattice space (the intended notion of distance is the usual Euclidean distance). This property is necessary for realistic implementation of the algorithms. Indeed if the equation domain is bounded, any integral function defined on it is also bounded. The requirement for the functions gj to be non-negative over D is not restrictive as any integral function can be rewritten as the sum of its positive and negative parts. This definition of 3Domains with one infinite direction arise naturMly, for instance, in tile synthesis of digital signM processing applications in which the signMs are modeled as streams of data in time. Because of tile page limit imposed on the paper, we do not include fully detailed proofs of tile technical results. Such proofs can be found ill [12].

L. Rapanottl and G.M. Megson

286

index function is general enough to include any integral function from Z n to Z n which is component-wise bounded over the equation domain s. Therefore, affine index functions can be regarded as special cases of integral index functions and affine recurrences (AREs) as a subclass of integral recurrences. E x a m p l e 3.2 [Integral Data Dependence] Given the domain D = {(i,J) I 1 _< i < n, 1 < j _< n}, the following index function defines an integral data dependence over D" I ( i , j ) = ( i - ( j - 1)mod4,j- 1 ) = (i,j) + gl(i,j)dl 4-g2(i,j)d2, where gl(i,j)= 1, dl = ( - 1 , - 1 ) , g2(i,j) = ( j - 1)rood4- 1 and d2 = ( - 1 , 0 ) . Note that this representation of I is not unique. 9 3.2

3.2

A t o m i c Integral Recurrences

When the number of addenda defining an integral data dependence reduces to one, we regard the corresponding equation as belonging to a special subclass of integral recurrences, so-called atomic integral recurrences (AIREs). Atomicity in this context stands for the simplest form of integral data dependencies which we want to consider. Intuitively, an atomic integral data dependence is characterised by vectors aligned according to some direction vector d, and with lengths varying according to the values of an integral function g, which is a multiplicative coefficient of d. More formally: D e f i n i t i o n a . a [Atomic Integral Data Dependence] Let D be a convex polyhedron in Z n. An index function I : Z '~ --, Z '~ defines an atomic integral data dependence over D if and only if, for all z E D, I(z) = z + g(z)d, where g : Z n --. Z is a function non-negative and bounded over D and d E Z n. 9 3.3 Note that uniform data dependencies, which characterise uniform recurrence equations (UREs), are particular types of atomic integral dependencies where g is a constant function (typically equal to one). There are several reasons to consider this special subclass of integral recurrences. First of all, as we discuss in Section 4, uniformisation techniques of affine recurrences naturally extend to atomic integral recurrences. Moreover, any integral recurrence can always be replaced by an equivalent system of atomic integral recurrences. This is realised through decomposition, which we discuss in the following section. 3.3

Reducibility and Decomposition

In this section we briefly discuss some of the issues concerning the decomposition of integral recurrence equations. As the focus of the paper is on uniformisation, and because of space limitations, the discussion is not intended to be exhaustive. Further details can be found

[12]. The decomposition of an integral data dependence into a number of atomic components, is realised by replacing its integral index function I by a system of composable index functions, each function defining an atomic integral data dependence, and such that their SThis fact can be easily proved by considering, for instance, the standard basis of Zn, and rewriting the function as an integral combination of the vectors of this basis, possibly separating the positive and negative part of each resulting component integral function.

Reducible Integral Recurrence Equations

287

Fig. 2. The Class of Integral Recurrence Equations composition assumes the same values as I for each point of the domain. In geometric term, this corresponds to a form of routing, in which integral data dependence vectors are replaced by corresponding paths of atomic integral dependence vectors. As a form of routing, the decomposition of an integral data dependence is always possible, and has to guarantee that data conflicts are not generated. In doing so decomposition may require increasing the dimensionality of the index space. In particular, in [12] we introduced a decomposition technique for integral recurrences which requires an increase of the space dimensionality at each application. The technique relies on geometric properties only and can always be applied. However, less costly decomposition techniques may be defined, which rely on particular properties of the index functions involved (indeed, with a loss of generality). For instance, we may define:

Proposition 3.4

Let us consider an integral recurrence equation z E D 9 U(z) m gj(z)dj If h , defined as h ( z ) f ( V ( I ( z ) ) ) , where n C_ Z '~ and I(z) = z + g(z)d + ~'~j=l z + g(z)d for each z in D, is injective over D, then the equation is equivalent, over D, the following system of integral recurrence equations in Z n' z E D" U ( z ) = f(Vl(II(z))) z E D1 9 Vl(z) = V(I2(z)), where 14 is an auxiliary variable, I2(z) = z + ~_,j'n__lgj(I~l(z))dj, I~l(z) is a left-inverse 11 over Dr, and D1 is the convex closure of II(D).

= = to

of

PROOF: [sketched] It amounts to showing that it is possible to define a left-inverse of 11 over D1 and that the composition of the new index functions I2 o 11 is equivalent, over D, to the original index function I. II 3.4

Example 3.5 [Decomposition of an Integral Data Dependencef Consider the index function defined in Example 3.2. The function h ( i , j ) = ( i , j ) + gl(i,j)dl = ( i - 1 , j - 1) is injective over D (as gl is a constant function, 11 defines a translation over D), and its left-inverse can be defined as I~X(i,j) = (i + 1,j + 1) over the domain D1 = II(D) (as 11 is linear, II(D) is equal to its convex closure). Then I can be replaced by the composition of the index functions I i ( i , j ) = ( i , j ) + gl(i,j)dl and h ( i , j ) = ( i , j ) + g~(i,j)d2, where g~(i,j) = 9 2 ( I f l ( i , j ) ) = j m o d 4 - 1. 1 3.5 From an algorithm engineering point of view, we would like to distinguish those integral recurrences which may be decomposed without increasing the dimensionality of the index

L. Rapanotti and G.M. Megson

288

Class

Transformation Dimensions

Atomic Integral (n, 1) ReduCible Integral (n, m)

Uniformisation Reduction Uniformisation Decomposition Uniformisation

(Wo~t c',,~)

Integral (n, m)

.

.

.

.

.

.

.

n+l n+l

n+m n+m+l

.

Fig. 3. Summary of the Transformations space as they allow us to derive simpler architectures. Therefore, we introduce a finer classification of integral recurrences distinguishing those recurrences, so-called reducible integral recurrences (RIREs), which allow a decomposition at "zero cost", i.e., not requiring an increase in the dimensionality of the space. Trivially, any atomic integral recurrence is also reducible integral. The complete structure of the class of integral recurrences is given in Fig. 2. Once again we would like to stress that this brief discussion does not exhaust the matter of decomposition and, indeed, that different forms of decomposition may be defined. 4

UNIFORMISATION

OF INTEGRAL RECURRENCE

EQUATIONS

The uniformisation of integral recurrences is a two-step transformation, which consists of the decomposition of an integral recurrence into a corresponding system of atomic integral recurrences and their subsequent uniformisation. Uniformisation is based on routing techniques and generalises to integral recurrences the routing techniques presented by Quinton and Van Dongen in [9]. As both decomposition and uniformisation may require an increased number of dimensions, we have summarised the worst-case situations for the different subclasses of integral recurrences in Fig. 3, where Class indicates the type of the recurrence together with, in brackets, its dimension and the number of summands of its index function, Transformation indicates the transformation applied to the equation (decomposition and uniformisation are applied in this order), and Dimensions is the maximum number of dimensions of the resulting system of equations. The uniformisation of atomic integral data dependencies includes two basic situations. Similar to the affine case, uniformisation consists of replacing a non-uniform data dependence by a uniform routing of the data. The two cases we consider differ in the way the routing is defined. This depends on the geometric characteristics of the data dependence vectors and the direction of the affine closure of the equation domain, denoted by lin(D) in the following 6. In order to avoid data conflicts, the basic requirement which we need to enforce is the existence of a unique routing path for each point of the equation domain D. This occurs when the only points of a routing path which also belong to D (or to its afline closure) are the end-points of the path. If the direction vector d of the data dependence is not in lin(D), the vector d itself can be used as a routing direction and the resulting routing paths meet this requirement. Otherwise, when d is in lin(D), a more sophisticated routing is needed, where a set of routing directions are selected to constitute routing paths ~

e.g., [6] for an introduction to linear algebra.

Reducible Integral Recurrence Equations

289

outside the affine closure of D. Indeed, this selection is not unique, although all the possible choices yield equivalent routing systems. In both cases, routing is realised by defining routing domains (i.e., convex polyhedral sets of points which perform routing functions) and a system of routing recurrence equations over such domains. As the index mappings of atomic integral recurrences are, in general, non-linear, the routing paths only involve proper (in general non-convex) subsets of the routing domains. We introduce control variables (tr,/3 and 7 below) in order to distinguish the points of the routing paths inside the routing domains 7. 4.1

F i r s t Case: d r l i n ( D )

The first uniformisation technique is given in the following theorem: T h e o r e m 4.1 Let us consider an atomic integral recurrence equation z E D ' U(z) = f ( V ( l ( z ) ) ) , where D C Z n and l ( z ) = z + g ( z ) d . If d f[ f i n ( D ) , there exists a system of conditional uniform recurrence equations in Z n which is equivalent, over D, to the equation. n4.1 The proof of Theorem 4.1 is constructive and amounts to defining such a system of uniform recurrence equations. The system is defined as follows. Let ~ be the maximum value of g over D. If 0 > 0, consider: r . z = 0 a hyperplane containing the domain D such that 7r is a vector in tile space ( l i n ( n ) + ( d ) ) f q l i n ( D ) • (where (d) denotes the space spanned by d and l i n ( D ) • the space orthogonal to i i n ( D ) ) ; 71 = 7r .el and D1 = {z + hl [ z E D,O <_ l < 0}. The system of equations is' z e D 1 , 7 ( z ) = 0" U ( z ) = f ( R l ( z ) ) z E Dl,o~(z) > 7(z)'

Rt(z)=

oo

z e D~, ~(z) < "r(z)" R~(z) = R~(z + d) e D~,,~(~)= .y(~)" R~(~)= V(~) z 6. D l , lr . z <110 + O "

c~(z) =

e~(z + d)

z 6- D l , rr . z = 70 + O "

~(z) =

g ( z - Od)

z6-_.Dl,r.z<~lfl+O"

7(z)=

7(z+d)-I

z 6. D l , ~r . z = llO + O "

"/(z) =

9

This system is illustrated in Fig. 4, which shows, in a 2-dimensional space, a generic atomic data dependence before and after uniformisation. Variable R 1 routes on the domain D1 the values taken by the variable 1/'. Variables a and 7 are control variables which guarantee that the right value of V is assigned to R 1 for each point of the equation domain D. In particular, a carries the values of g, while 7 acts as a counter, initially set to the maximum value 0 of g over D, and decremented by one a.t each step. When c~(z) = 7(z), then R l ( z ) is assigned the va.lue V ( z ) . When 7(z) = 0, then U ( z ) is computed using the value carried by R l ( z ) . This occurs on the original equation domain D, which is a subdomain of D1. Note that this uniformisation technique guarantees that the length of each routing path corresponds to the modulo of the data dependence vector which it replaces, i.e., for each z rThe introduction of control variables in regular array design has been addressed by several authors in the literature (see [11, 1, 13, 14]). In particular, in [14], Xue describes how control signals can be systematically derived from the equation domain predicates.

L. Rapanotti and G.M. Megson

290

Jtzffie .

fin(D)

a) Before Uniformisafion

b) After Uniformisafion

Fig. 4. Uniformisation (d ~' lin(D),~7 > O) in D, the length of the routing path corresponding to z is equal to g(z). 4.2

S e c o n d Case: d E lin(D) and dim(D) < n

The second uniformisation technique is given in the following theorem: T h e o r e m 4.2 Let us consider an atomic integral recurrence equation z E D 9 U(z) = f ( Y ( I ( z ) ) ) , where D C._ Z n and I(z) = z + g(z)d. If d e lin(D) and dim(D) < n, there exists a system of conditional uniform recurrence equations in Z n which is equivalent, over D, to the equation. [] 4.2 Note that when the condition dim(D) < n of Theorem 4.2 is not satisfied, the equation needs to be re-indexed in Z n+l before the theorem may be applied. The resulting system of equations is then (n + 1)-dimensional. In this case, the system of equations is defined as follows. Let gma= = maxzeDg(z) and ~ = [gma~,/2Js. If ~ > 0, consider: ~r. z = 0 a hyperplane containing the domain D, such that r in lin(D)L; the routing directions = d + r and ~ = d - r ; r/= ~r.d; D1 = { z + l ~ l z e D,0 < l < ~} and D2 = {Z+lld+12~l z E O1,0 < 11 _< 1, 0 < 12 _< ~} N {z I 7r. z >_ 0}. The system of equations is: z eD.

z e D l , a ( z ) > 7(z)"

Rl(z)=

z e D l , a ( z ) < 7(z)" Rl(z) = z e D l , a ( z ) = 7(z),/~(z)= 1," R ' ( z ) = z e D l , a ( z ) = 7(z),/~(z) = 0"

Rl(z) =

z e D 2 , r . z > O " R2(z) = z E D2,~r.z = 0" z E D1,Tr.z < r/~+ 0 :

R2(z)

oo R l ( z + ~) R2(z + d) R2(z)

R2(z + ~) V(z)

=

o(z +

= 7/~ + 0 :

~(z) =

z E D l , r . z < r/~+ 0 :

=

z E D1,Tr.z = r/~+ O :

/~(z) =

g(z- .r

z E D1,Tr'z < rLq + 0 :

7(z)=

7(z+~)-I

z E Dl,r.z

[g(z- .r +

SWe denote Ix] the floor function returning the greatest integer less or equal than x, and x mode the modulo function returning the remainder of the integral division of z by c.

Reducible Integral Recurrence Equations

291

Fig. 5. Uniformisation (d E lin(D),9 > O)

zEDl,~r.z=y~+0" 7(z)= ~. In this case, the routing paths have the shape of a "roof-top" on the routing domains D1 and D2, as illustrated in Fig. 5 (in a 2-dimensional space). Two routing variables, R 2 and R 1, are needed to pipeline the values of V according to the two directions ~ (the "ascending" part of the paths) and ~ (the "descending" part of the paths), respectively. A single displacement at the "top" of the path is required when the length of the path is odd, and is flagged by the control variable/~. Variables (~ and 7 have a similar function as in the previous case. When ~(z) = 7(z), then a change of direction for the routing path is required and the value carried by R 2 is transferred to R 1. The latter is then pipelined to the original domain D (subdomain of D1), where it is used for the computation of U.

5

THE KNAPSACK

PROBLEM

As an example of our techniques, we consider the knapsack problem ([4, 3]), a classic combinatorial optimisation problem, which consists 9 of determining the optimal (i.e., the most valuable) selection of objects of given weight and value to carry in a knapsack of finite weight capacity. If c is a non-negative integer denoting the capacity of the knapsack, n the number of object types available, wk and vk, respectively, the weight and value of an object of type k, for 1 < k < n and wk > 0 and integral, a dynamic programming formulation of the problem is represented by the following system of recurrences 1~ (k, y) E Dl " F(k, y) = 0

(k,y)~D2" (k, y) E n3 " (k,y) ~ 1)4" (k, y) ~ D2 " (k,y) ~. D4"

F(k,y)= F(k, y) = F(k,y) = V(k, y) = V(k,y)=

0 -~ f ( F ( k - 1 , ~ ) , F ( k , y - ,,,k),V(k,y)) vk V ( k , y - 1),

9This is one of the several variants of the knapsack problem. A complete presentation together with a number of applications is given in [4]. 1~ system of recurrences corresponds to the so-called forward phase of the knapsack problem in which the optimal carried value is computed. The corresponding combination of objects can be determined from the optimal solution in a second phase, known as the backward phase of the algorithm, which is essentially sequential and mainly consists of a backward substitution process on sub-optimal values of F. An efficient algorithm for the backward phase is given in [3].

L. Rapanotti and G.M. Megson

292

Fig. 6. The Knapsack Problem with f defined as f(a, b, c) = max(a, b + c), for all a, b, c, and equation domains" D2 = {(k,y) l 1 _< k < n,y = 0} D1 = {(k,y) lk = 0,1 <_ y < c} Da={(k,y) ll<_k<_n,y<0} D4={(k,y) ll_
(k,y,z) e Di " F(k,y,z) = 0 (k,y,z)~. D~ " F ( k , y , z ) = 0 ( k , y , z ) e D~3 " F ( k , y , z ) = -co

(k,y,z) e D'4" F(k,y,z)=

] ( F ( k - 1,y,z),R~(k,y,z),V(k,y,z))

(k,y,z) en~"

V(k,y,z)=

vk

(k,y,z)eD'4"

V(k,y,z)=

Y(k,y-l,z)

( k , y , z ) e D~,l,7(k,y,z) < a ( k , y , z ) " R l ( k , y , z ) = R l ( k , y - 1, z 4- 1) ( k , y , z ) ~. D~4,~,7(k,y,z) > a(k,y,z)" R l ( k , y , z ) = co ( k , y , z ) e D~,l,7(k,y,z ) = a ( k , y , z ) , ~ ( k , y , z ) = O" R l ( k , y , z ) = R2(k,y,z) ( k , y , z ) E D'4,1,7(k, y, z) = a(k, y, z),~(k,y,z) = 1" R ~ ( k , y , z ) = ! ( k , y , z ) ~. D4,2, z > O" R 2 ( k , y , z ) = R2(k,y - 1, z - 1) !

( k , y , z ) E D4,2, z = O" R2(k,y,z) = F ( k , y , z ) ( k , y , z ) E D~,l,z < ~" ~ ( k , y , z ) a ( k , y - l , z + l) ( k , y , z ) E D~,~,z-'O" a ( k , y , z ) [g(k,y+~,z-~)/2]

R2(k

- l z),

Reducible Integral Recurrence Equations

293

Fig. 7. Uniformisation of the Atomic Integral Data Dependence ( k , y , z ) ~. D~,l,z < g " ( k , y , z ) E D~, 1 , z = g "

fl(k,y,z) = fl(k,y,z)=

f l ( k , y - l , z + l) g(k,y+g,z-O)mod2

( k , y , z ) ~. D~4,1,z < O "

7(k,y,z) =

7(k,y-

(k,y,z)

,y(k,y,z)

O,

e O'4,~,z = O "

=

l , z + l) - i

where 9 = [Wmax/2] and Wmax is the maximum weight of the objects. The new domains are defined as: D~

=

{(k, y, z) ] k = 0,1_< y _< c, z = 0}

D~

=

{ ( k , y , z ) J 1 <_ k <_ n,y = 0, z = 0}

D~

=

{ ( k , y , z ) ] 1 <_ k <_ n,y < O,z = 0}

D~4 = { ( k , y , z ) ] l < _ k < _ n , l < _ y < c , z = O } D~4,1 = { ( k , y , z ) ] l < k < n , l - z < y < c - z , O < _ z < _ ~ l } { ( k , y , z ) ] l < k <_ n , z - 2~ <_ y < c - z,O _< z _< O}. D'4 .2 = As this system of equations is uniform, corresponding regular array designs can be obtained by applying the usual synthesis techniques (see, e.g., [7, 11, 14]).

6

CONCLUSION

The main contribution of this paper is the definition of a new class of recurrences from which regular array designs can be systematically derived. The new recurrences, so-called integral recurrences, constitute a proper superclass of the more traditional affine and uniform recurrences, hence correspond to more general forms of algorithms. Our approach consistently generalises established synthesis techniques. In this way all the knowledge acquired during the past ten years of research on synthesis techniques is readily made available to this new class of algorithms. Besides, our approach facilitates the upgrading of existing software environments for the synthesis of regular array designs to the systematic treatment of integral recurrences. Indeed some computational aspects of the approach require further investigation, such as aspects related to the reducibility properties of generic integral recurrences. In particular the method cannot guarantee an optimal rewriting which minimises space-time

294

L. Rapanotti and G.M. Megson

complexity of arrays. Instead it provides a systematic framework for exploring the design space. The location of optimal solutions necessarily deals with combinatorial issues and demands approximate solutions or designer intervention. The work presented in this paper represents an initial, although essential, investigation into the class of integral recurrences. On-going research focuses on computability issues, the definition of timing and allocation functions to map integral recurrences onto regular arrays and the introduction of forms of parameterisation of the design process. Acknowledgments This work was supported by REFLEX, Science and Engineering Research Council (SERC) grant no. GR/H46725. References

[1] M.C.Chen, "A design methodology for synthesizing parallel algorithms and architectures", J. of Parallel and Distributed Computing, vol. 3, no. 4, pp. 461-491, 1986. [2] F.H.M.Franssen, F.Balasa, M.F.X.B.Van Swaaij, F.V.M.Catthoor, and H.J.De Man, "Modeling multidimensional data and control flow", IEEE Transactions on VLSI Systems, vol. 1, no. 3, pp. 319-327, 1993. [3] T.C.Hu, Combinatorial Algorithms. Addison-Wesley Publishing Company, 1982. [4] S.Martello, P.Toth, Knapsack Problems: Algorithms and Computer Implementation. John Wiley and Sons, 1990. [5] G.M.Megson, An Introduction to Systolic Algorithm Design. Oxford Science Publications, 1992. [6] E.D.Nerig, Linear algebra and matriz theory. J.Wiley & Sons Inc., 1963. [7] P.Quinton, "The systematic design of systolic arrays", INRIA Technical Report, no. 216, 1983. [8] P.Quinton, and Y.Robert, Systolic algorithms and architectures. Masson and Prentice Hall International, 1991. [9] P.Quinton, and V.Van Dongen, "The mapping of linear recurrence equations on regular arrays", J. of VLSI Signal Processing, vol. 1, pp. 95-113, 1989. [10] S.V.Rajopadhye, and R.M.Fujimoto, "Systolic array synthesis by static analysis of program dependencies", PARLE- Parallel Architecture and Languages Europe, Lecture Notes in Computer Science, vol. 258, pp. 295-310, Springer Verlag, 1987. [11] S.K.Rao, Regular Iterative Algorithms and their Implementation on Processor Arrays. PhD Thesis, Stanford University, 1985. [12] L.Rapanotti, G.M.Megson, "Uniformisation techniques for integral recurrence equations", The University of Newcastle upon Tyne, Computing Science, Technical Report Series, no. 478, 1994. [13] J.Teich, and L.Thiele, "Control generation in the design of processor arrays", J. of VLSI Signal Processing, vol. 3, no. 1/2, pp. 77-92, 1991. [14] J.Xue, The formal synthesis of control signals for systolic arrays. The University of Edinburgh, PhD Thesis, CTS-90-92, April 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors)

9 1995 Elsevier Science B.V. All rights reserved.

HOPP

- A HIGHER-ORDER

PARALLEL

295

PROGRAMMING

MODEL

R. RANGASWAMI

Department of Computer Science The University of Edinburgh Mayfield Road, Edinburgh EH9 3JZ Scotland, U.K. ror@des,ed. ae. uk A B S T R A C T . The efficientprogramming of parallelcomputers is stilla difficulttask. This paper focesses on studying methods for expressing parallelism in programs without making the programmer explicitlyresponsible for parallelism. Parallelprogramming using a set of useful implicitly-paraUelconstructs is considered. These constructs are borrowed from FP and the Bird-Meertens approach. Programs are analysed statically,using an analytical cost model which selects a cost-effectiveimplementation for a chosen architecture. A brief overview of the model is presented and its operation is demonstrated with the aid of two case studies. K E Y W O R D S . Parallelisationtechniques, Bird-Meertens Formalism, compile-time analysis, distributed-memory M I M D Machines.

1

INTRODUCTION

Programming parallel computers is still a complex task. The programmer is often responsible for identifying the parallelism, sharing the work-load amongst the processors, and handling interprocess communications. The resulting programs are not particularly portable, being specific to a particular architecture. HOPP is an approach to parallel programming based on an implicitly-parallel functional language. Functional languages have advantages such as referential transparency and abstraction from detail. Referential transparency ensures that a given expression evaluates to the same value irrespective of the order of evaluation. Abstraction from detail is obtained by the use of higher.order functions. A function can be an argument to another function as well as the result of a function application. A function that has one or both of these properties is called a higher.order function. This ability to abstract away from detail hides the

296

R. Rangaswami

complexities of parallel programming from the programmer. However, one of the problems with functional languages is their poor efficiency of implementation compared to imperative languages. The main reason for this is the use of lists as the main data structure. The sequential implementations suffer from overheads arising from creation and destruction of lists and garbage collection. Also, list access is linear in its length as opposed to constant time access in the case of the array. In addition, parallel implementations using list data structures incur significant overheads due to packing and unpacking when communicated between processors. HOPP is based on constructs borrowed from the Bird-Meertens Formalism [2], and FP [1]. It incorporates a basic set of higher-order functions that perform general-purpose operations on lists. These functions are either already part of most functional languages, or can be defined in them. They are inherently parallel and are referred to as recognised .functions. Since the behaviour of each of these recognised functions is predetermined, a program which is expressed in terms of these functions can be analysed at compile-time to realise a costeffective parallel implementation. The main limitation of this approach is that the analysis can only handle regular problems. In this context, a regular problem is one whose behaviour does not depend on the actual input values. The set of higher-order functions themselves impose a degree of regularity on the problem. It would be difficult to express problems requiring irregular communications, using the set of available recognised functions. This means that many irregular problems cannot be parallelised using this approach. In the extreme case where a program does not contain any occurrences of recognised functions, no parallel implementation can be realised. This forces the programmer to remain within the fixed repertoire of available functions in order to obtain parallelism. HOPP aims to generalise the Skeletons approach considered in [3],[4], [5], [6]. Algorithmic Skeletons capture common computational forms. A skeleton is an abstraction of a wellknown parallel computational form, e.g., divide-and-conquer. Each application program must match one or more skeletons in order to be exploited for parallelism. The implementation scheme provides an efficient implementation for the program. The method considered in this paper aims to provide a more general basis for writing programs. The idea is to be able to express programs that may not necessarily match one or more skeletons. An approach similar to the one in this paper, is also considered in [8],[9]. However, the cost measures employed there are simplistic particularly when accounting for communication costs. This paper describes a hierarchical cost model, i.e., parallel implementations are considered for recognised functions and their arguments. The model is aimed at distributed-memory machines and attempts to minimise the communication costs of implementations.

2

O V E R V I E W OF T H E S C H E M E

A program in HOPP is expressed as a sequence of phases. Each phase comprises of one or more occurrences of one or more recognised functions along with instances of user-defined functions. Each recognised function has a predefined parallel implementation on a given target machine topology, along with an associated implementation cost. User-defined functions only have a sequential implementation and are also referred to as sequential functions. A knowledge of the target machine topology is essential in order to select the implementation

HOPP- A Higher-order Parallel Programming Model

297

Figure 1: The Analysis and Implementation Scheme

associated with that topology for the recognised function. Each implementation attempts to make optimum use of the machine connectivity in an effort to reduce communications overhead. The hypercube, 2-D torus and binary tree topologies have been studied. However, this paper only discusses the scheme in the context of the hypercube topology. Parallelism is only exploited within each phase. The phases themselves are sequential and phase i does not commence until phase i - 1 is completed. However, future work could consider pipelining as an option, in order to evaluate the phases in parallel. The basic set of functions in the Bird-Meertens Formalism has been extended to incorporate a number of new functions. It has been shown that all these functions can be expressed in terms of one or more of the existing functions. They have been included as recognised functions in their own right because they are found to be useful in many common problems. Also, the scheme provides a more efficient implementation for them as compared to the implementation that would be obtained if they were expressed in terms of the basic set of functions. The basic outline of the scheme is depicted in Figure 1. The analyser constructs a parse tree representing the program. Each branch in the tree corresponds to a phase of the program. A cost analysis is carried out on the program. The cost of a program comprising of n phases is given by: n

Cost = ~ C osti i-1

where, Cost~ is the cost of phase i. The cost of a phase depends on the nature and number of recognised functions in that phase and also the parallel implementation selected for that phase, on a given number of processors. In our scheme, a phase that has only

298

R. Rangaswami

one occurrence of a recognised function has only one parallel implementation, namely, the parallel implementation for that function on the target machine (in this case, a ddimensional p-processor hypercube). For a phase containing two recognised functions, one of which is the argument of the other, three parallel implementations are possible : 9 A parallel implementation for the outermost function on p processors. 9 A parallel implementation for the innermost function on p processors. 9 A parallel implementation for both functions. The d-dimensional hypercube is divided into 2 k (0 < k < d) smaller hypercubes, with each smaller hypercube containing 2d-k processors. The outer function is evaluated in parallel across 2k processors and the inner function is evaluated in parallel across 2 d-k processors. A similar argument can be applied to phases containing three or more recognised functions. However, parallel implementations are only considered up to three recognised functions in the phase. Any recognised functions below this are just implemented sequentially. Hence, a phase can have at most eight possible implementations (including a sequential one). The costs associated with all the implementations for each of the phases are estimated and a search tree is constructed. The most efficient implementation for the program corresponds to the least-cost path in the search tree. The code corresponding to this lea~t-cost path can then be generated and executed on the parallel machine. The search tree grows exponentially with the number of phases in the program. In this paper, only problems that comprise of a few phases are discussed, so this is not a serious issue. However, for bigger problems some heuristics for pruning the search tree would have to be considered. In order to predict costs fairly accurately and select efficient implementations, some estimate of the input data sizes and the costs of sequential functions are required. This could either be obtained by profiling or from the user. Since the system does not incorporate a profiler, the user is required to specify such information. In order to estimate communications costs, information such as start-up time K0, the bandwidth of the communication channel K1, and the size of the data to be communicated is required. K0 and K1 are machine-specific parameters. A linear model of communication is assumed. The size of the data to be communicated depends on the number of list elements and the size of each element. If the list size is unknown at compile-time, the user is required to estimate it as a function of the number of processors p. Simple comparison metrics are used, such as, list size >> p, list size ~ p, etc. The size of each base element depends on its type and this could be deduced by the compiler. However, programs in functional languages can be written at a very high level of abstraction. A program could apply to a whole group of list types and the exact type information will only be known at runtime. In such cases, the user will be required to specify type information at compile-time. It is important to realise that the analysis scheme does not account for costs arising from operations such as memory access. This would involve a very low level of detail for the analyser, and may make it very machine-specific. The selection of the least-cost implementation depends only on cost comparison and hence absolute costs are not of consequence. List processing costs are, however, accounted for by the model since these costs

HOPP - A Higher-order Parallel Programming Model

299

amount to a substantial overhead in functional languages. This includes costs that are incurred in constructing a new list or traversing a list. The analyser estimates these costs based on the nature of the input list and the details are transparent to the user. The code generator shown in Figure 1 would generate C-code for the target parallel machine with appropriate communication constructs inserted. However, this code generator has not been implemented. To provide preliminary evidence of the performance of the HOPP model and to assist in program development, some support is available in the form of a library of functions. This library contains the code for the various recognised functions and also code for performing various types of communications on the hypercube topology. The hypercube communication functions are based on [7]. The code in this library is used by all the example problems. The actual calls to the functions are at present generated by hand. The use of the same code ensures that performance figures for different examples can be sensibly compared. 3

CASE STUDIES

This section illustrates the method used in HOPP by considering two examples - matrix multiplication and merge sort. For each example, the problem is first expressed in the style just advocated - i.e., as a sequence of phases, with each phase comprising one or more recognised functions. The implementations considered and the costs computed by the analyser are examined. The results of implementing the most efficient implementation selected by the analyser on the Meiko Computing Surface (a transputer-based machine), are presented. The next best implementation selected by the analyser is also implemented, in order to verify the correctness of the model. Finally, a comparison of the predicted and experimental results is presented. Since the Meiko Computing Surface is transputerbased, a processor has at most four links. Hence, the maximum dimension of the hypercube considered for experimental purposes is four. 3.1

D e f i n i t i o n s a n d C o s t s for S o m e F u n c t i o n s

The examples discussed in this paper make use of only a few functions from the extended set of recognised functions. Their definitions are given below. Informal ML-style [10] notation is used. The cost of implementing each function on a d-dimensional, p-processor hypercube is also given. In the following discussion, C! represents the cost of the fimction f. As discussed previously, this cost is to be specified by the user or can be obtained by profiling techniques. Also, n represents the length of the input list. It is assumed that the input list is distributed across the p processors, [~] per processor. 9 m a p - applies some function to every element of a list. f[] =

[]

map f (x::xs)= (f x):: map f xs The cost of m a p is given by ' [~] Cf. The function m a p 2 is similar to m a p , but operates on two input lists of equal length. Its cost is the same as the cost of m a p .

R. Rangaswami

300

map2f[][]= [] map2 f (x::xs) (y::ys) = (f x y) "" map2 f xs ys 9 fold - combines the elements of a list, taken two at a time, using some binary operator f. The scheme assumes that the argument function f is associative. Only if this condition is met, can fold be implemented in parallel. fold f a []= a fold f a (x::xs) = fold f (f a x) xs When a call is made to the fold fimction, a must be the identity of the binary operator. At each step of the computation, a contains the partial result of the fold operation. Each processor performs the fold on its local elements. Then the partial results are combined globally to obtain the final result. The first step involves [ ~ ] - 1 applications of the function f. The second step involves d communications and d applications of the function f. Hence the cost of fold is given b y " C/([p] - 1 ) + d(Tcom + Cj). Tcom represents the cost of communicating the partial result from one processor to its neighbour. This cost depends on the size of the partial result being communicated. 9 c r o s s - p r o d u c t . Based on the order of evaluation, two functions are defined for performing the c r o s s - p r o d u c t operation. The basic operation is the same in both the cases, but the order in which they are performed is different, causing the ordering of elements to be changed in the result. - r_cross_product

r_cross_product f [a~,a2,...,am] [bl,b2, ...,b,] = [[(f a~ bl), (f al b2),..., (f t21 bn)], [(f a2 bl),(f 0.2 b2),..., (f a2 b,)], [(f am 61), (f a,~ b2),..., (f am b,)]] - c_cross_product

c_cross_product f [at,a2,...,a,~]

[bl,b2,...,b,]

= [[(f ax bl), (f a2 b l ) , . . . , (f am b~)],

[(f al b2),(f a2 b2),..., (f am b2)], [(f al b,), (f a2 b,),..., (f am b,)]] The cost of r_cross_product is given by" [m] nCf. The cost of c _ c r o s s _ p r o d u c t p is given by : [~]mCf. The second list would have to be broadcast to all the processors9 9 s p l i t - splits a given list into a specified number of sublists.

HOPP - A Higher-order Parallel Programming Model

301

split 1 [ x , , x ~ , . . . , x , , ] = [ [ x , , x , , . . . , x , , ] ] split k [x,,x~.,...,xn] = [[xl,x,.,...,zf~r]],...,[x(k_,)f~rl+,,...,x,,]]

split only involves list re-arrangement costs. A list of size n is re-arranged to a list of lists, with ( k - 1)sublists of size [~.] and the last sublist of length n - ( k - 1)[~]. These list processing costs are calculated by the analyser, based on the nature of the input list. For the purposes of this paper, the cost of splitting an n-element list into k parts, is represented by S,,k. 9 c o m p o s i t i o n . Tile c o m p o s i t i o n operation is represented by o and corresponds to the sequence of phases in a program. The terms composition and sequence will henceforth be used interchangeably. (f o g) x = f (g x) The cost of (f o g) is given by : C! + C 9. Occurrences of the form f (g x) will be replaced with (f o g) x before the cost analysis is carried out. 3.2

Matrix Multiplication

This section discusses the well-known problem of multiplying two matrices Am• and B,,• resulting in the matrix Cm• 3.2.1 Program Code The problem can be expressed as a composition of two phases as shown below. The recognised functions are depicted in bold face and informal ML-style notation is used.

fun mat almlt times plus A B = m a p ( m a p (fold plus 0)) o r_cross_product ( m a p 2 times) A BT; where, fun plus a b = a + b; fun t i m e s a b = a * b; B r represents the transpose of B The transpose is performed sequentially, in order to improve program performance. 3.2.2 The Analysis The analyser constructs the parse tree represented in Figure 2 for the problem. The root node represents c o m p o s i t i o n . For phase one, the analyser computes the costs for a sequential implementation and three parallel implementations, corresponding to the two recognised functions in the phase. On a p-processor hypercube, if r_cross_product is implemented on p~ processors and m a p 2 is implemented on p~ processors, p~ _( p, p~ _( p and PIP,. i 1 = P, the cost of phase one is as follows (see section 3.1) '

c,, = c.o. + rgl Ccom represents the costs incurred in distributing the initial input lists across the p processors. Cti,,e., represents the cost of the sequential function times. It may be noted that the result of phase one is a list of list of lists of size (m • n • k). Phase two contains three recognised functions. The analysis results in the costs for a sequential implementation and

302

R. Rangaswami

map

r_cross_product

I

I

map

map2

I

I

fold

times

plus

Figure 2: The Parse Tree for Matrix Multiplication seven different parallel implementations. If the outer m a p is implemented on p~ processors, the inner m a p is implemented on p~ processors and the fold is implemented on p~ proPlP2P3 = P, the cost of phase two is computed as follows c,., =

+

1 - 1)+

d(r o

+ G,..).

R~om represents the cost incurred in re-arranging the results of phase one to suit the implementation of phase two, d -- log2 P~3,T~om is the cost incurred in communicating the partial result of the fold on one processor to a neighbouring processor.

1 Two sets of matrices A12x32, B32x32 and A~4x32, B~ x16, were considered for analysis. In both the cases, the following implementations were selected as the best (i.e., least-cost) and second best parallel implementations respectively, for p = 2, 4, 8,16. 1. B e s t - For the two phases, p] = p, p~ = 1 and p~ = p, p~ = 1. This corresponds to a parallel implementation for r_cross_product in the first phase and a parallel implementation for the outer m a p in the second phase. 2. Second Best - For p = 2, it is clear that only one function in a phase can be implemented in parallel. For the second best implementation, m a p 2 is implemented in parallel in the first phase and fold is implemented in parallel in the second phase. For p = 4, 8,16, the following implementation was selected as the second best. In phase one, p~ = 2, p~ = p~T" In phase two, p~ = 2, p~ = 1 and p~ = L This corresponds to a parallel implementation for r_cross_product on p~ processors and m a p 2 on p~ processors in the first phase. In the second phase, the outer m a p and the fold are implemented in parallel on p~ and p~ processors respectively. The inner m a p is implemented sequentially. 3.2.3 The Results The theoretically predicted values are plotted along with those obtained experimentally in Figure 3, for both implementations 1 and 2. In each of the graphs, the trend predicted by the model is also obtained experimentally. The experimental curves are above the theoretical ones. This is because the model does not account for low level costs that are incurred by a practical implementation. For both sets of matrices, the

303

HOPP - A Higher-order Parallel Programming Model

M a t f l x M u l t l p l J c R t l o n - (64x32)X (32xl6)

Mmtrix Multiplication - (32x32) X (32x32)

~oo'oo

"l;~4;.~&;: ~ ; ; ~ ; . . ; . . . . . .

i

550,OO

~

5oo.oo

q ! . it

%7~.,'.7;

(500.(}0

IIb

.~.~

| I

4oo.oo

* ~

3.',0.o0

. I.. I I

~, ^ .

~

tl~

2.~.OO

~-oo

I

t I I

I I

,

- ..~..

n

IIIII

m

t

~ $$ "

tl

"

Q1~,;~4"~I~;;~; . . . . . r .....

.....

I

9~OO.OO

~' "l~.'lm.'tk'al [L,'M

5

T

"" "

"

.too.oo

tI II Q.

200.00

L No. cff l~N.'t'~un

IO

15

'

.'..-o Po F . .

4

'

IOO.OO

10

No. (ff l~wt't~,~ 15

Figure 3: The Results for Matrix Multiplication

optimal number of processors predicted is verified experimentally. Also, the theoretical and experimental curves in implementation 1 are much closer than the respective curves in implementation 2. The same sequential code is executed by both implementations. However, implementation 2 involves more communications (such as scattering and broadcasting) for initial data distribution. Also, the parallel implementation of fold in phase two involves communications. An implementation involving more communications appears to perform worse than predicted. This is probably due to the method used to estimate K0 and K1, the machine-specific communication parameters. Communication costs were measured for different data sizes. For each data size, an average cost over a large number of such communications was considered. K0 and Kx were then obtained by a curve fitting technique. In the case of a real program, the cost of just one instance of a communication is measured. The processors also probably do not operate in lock-step. Communication routines such as scatter and broadcast assume a number of overlapping nearest-neighbour communications. These may not necessarily overlap in practice. This probably explains the larger distance between the theoretical and experimental curves for implementation 2.

3.3

Merge Sort

This section discusses merge sort. A list of integers is to be sorted in ascending order. The strategy adopted is to sub-divide the list into a list of lists, with each sublist containing two elements. The two elements in each sublist are sorted and the sorted sublists are then merged, two at a time, to obtain a fully sorted list. For the sake of simplicity, the number of elements in the input list is assumed to be a power of 2.

3.3.1 Program Code The problem can be expressed as a sequence of three phases as shown below. The recognised functions are again depicted in bold face. The function g_fold will be explained shortly.

R. Rangaswami

304

fun sort n xs = g_fold (merge[]) o m a p msort o split (n/2) xs; where, n is the length of the list fun msort [x,y] = if(x > y) then [y,x] else Ix,y]; f..

[][]

-

[]

merge xs [] = xs merge [] ys = ys merge (x::xs) (y::ys) = if (x _< y) then x :: merge xs (y::ys) else y :: merge (x::xs) ys;

3.3.2 The Analysis In this case, all the three phases have only one recognised function each. Hence, only two implementations are possible for each phase - one parallel and one sequential. The cost of the function resort is a constant, but the cost of merge is proportional to the sizes of the two input lists that are being merged. In this case, the scheme allows the user to specify a cost function for the sequential function merge. This cost function is used by the analyser to estimate implementation costs. The analyser constructs the parse tree representing the program. For phase one, the cost of split is given by 8~,2~./. The cost for the sequential implementation is obtained by. making p = 1. At the end of phase one, each processor contains a list of lists of size (~p x 2). Phase two only involves a m a p and its cost is given by ~Cms, where C,,s represents the cost of the function resort. In phase three, after each merge operation, the size of the resulting list is the sum of the sizes of the two lists being merged. As the fold progresses, the size of the intermediate result increases. After each processor executes the fold locally, the number of elements on each processor is ~. In the first step of the combination of partial results, two lists each of size "are merged to produce a list of size 2~. The second step involves the p p merging of two lists each of size ~_n and so on. Let the cost of merging two lists of sizes m p and n be represented by f(m, n). Using the formula for the cost of fold in section 3.1, the cost of phase three is as follows 9 -2p a--1

d-1

f(2i, 2) + z_,,X"(T~'gco,,+ f(2' p, 2'np))" i=1

i=0

In order that the analyser accounts for the increasing size of tile emerging result while ca,1culating communications costs, the user is required to specify the fold operation by the use of g_fold. In the case of matrix multiplication the size of the emerging result is a constant. The computation in that case is therefore specified by an ordinary fold. Syntactically there is no difference between fold and g_fold, g_fold provides extra information to the analyser, enabling it to predict costs more accurately. Two lists of sizes 512 and 1024 respectively, were considered for analysis. In both the cases the following implementations were selected as the best (i.e., least-cost) and second best implementations respectively, for p = 2, 4,8,16. 1. Best - Implement each recognised function in parallel on p processors for all the phases.

HOPP- A Higher-order Parallel Programming Model

305

2. Second Best- In phase one implement the split sequentially on a single processor. Then scatter the result to implement the two recognised functions in the next two phases in parallel on p processors.

3.3.3 The Results The results for merge sort are plotted in Figure 4. Once again the trend predicted by the model is obtained experimentally. For the input list of size 512,

Figure 4: Tlle Results for Merge Sort the optimal number of processors is predicted to be 8 for both implementations. This is also obtained experimentally. For the list of size 1024, a more effective implementation was obtained with 16 processors both theoretically and experimentally. In the case of merge sort, fewer communications are involved in the initial distribution of data, as compared to either implementation of matrix multiplication. A closer match between the predicted and experimental results is obtained. 4

CONCLUSIONS

The paper discussed two application programs that were expressed using the HOPP model. The constructs used to express the applications had predefined parallel implementations and costs. An analyser computed the most cost-effective implementation for each of the programs. These were implemented and the predicted and experimental implementations were found to match fairly well. The two programs were of a regular nature and it was therefore possible to express them in terms of the existing constructs. Irregular problems such as quicksort that depend on the actual input data values cannot be effectively parallelised using this scheme. This is a limitation of the HOPP model and arises due to the use of compile-time techniques to predict program performance. The advantages of the model include architecture-independent programs and simplicity of programming parallel machines. Future work needs to concentrate on testing the expressiveness of the model by considering more examples. The code generator needs to be implemented. A profiler could be

306

R. Rangaswami

incorporated in the system, in order to reduce the amount of information required from the user. It would be interesting to study the HOPP model on shared-memory machines. Acknowledgements I would like to thank Murray Cole for supervising this work. Thanks are also due to the Edinburgh Parallel Computing Centre for providing parallel machine facilities. This work was partly funded by the Overseas Research Studentship awarded by the British Council. References [1] J.Backus. Can Programming be liberated from the Von Neumann stye : a Functional style and its algebra of programs. In Comm. ACM, Vol. 21, No.8 Aug. 1978, 613-41. [2] R.S.Bird. Algebraic Identities for Program Calculation. In The Computer Journal, Voi.32, No.2 1989, 122-126. [3] M.Cole. Algorithmic Skeletons : Structured Management of Parallel Computations. Research Monographs in Parallel and Distributed Computing Pitman 1989. [4] M.Cole. Higher-Order Functions for Parallel Evaluation. In Proceedings of the 1988 Glasgow Workshop on Functional Programming August 1988, 8-20. [5] J.Darlington et al. Parallel Programming Using Skeleton Functions. In PARLE '93 Parallel Architectures and Languages Europe. 5th International PARLE Conference, Munich, Germany, June 1993, 146-160. [6] H.Stolze, H.Kuchen. Parallel Functional Programming Using Algorithmic Skeletons. In Parallel Computing : Trends and Applications. Proceedings of the International Conference ParCo93, Grenoble, France, September 1993, 651-654. [7] Y.Saad, M.H.Schultz. Data Communications in Hypercubes. In Journal of Parallel and Distributed Computing 6, 1989, 115-135. [8] D.B. Skillicorn. Parallelism and the Bird-Meertens Formalism. Internal Report, Department of Computing and Information Sciences Queens University, Kingston, Canada, April 1992. [9] D.B.Skillicorn and W.Cai. A Cost Calculus for Parallel Functional Programming. Internal Report, Department of Computing and Information Sciences Queens University, Kingston, Canada, May 1992. [10] M,Tofte. Four Lectures on Standard ML. Technical Report, ECS-LFCS-89-T3, University of Edinburgh, 1989.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors)

9 1995 Elsevier Science B.V. All rights reserved.

307

DESIGN BY TRANSFORMATION OF SYNCHRONOUS DESCRIPTIONS

G. DURRIEU, M. LEMAITRE Ddpartement d'Etudes et de Recherches en Informatique, CERT / ONERA, ~ avenue E. Belin, 31055 Toulouse eedex, FRANCE.

ABSTRACT. This paper presents the results of a research project based on the transformational paradigm for architecture design. The purely functional and synchronous language LUSTRE is used for descriptions. Three examples of synthesis are given. KEYWORDS. Transformational design, synchronous functional language, LUSTRE. 1

INTRODUCTION

The approach of hardware design (VLSI architectures) by program transformation is advocated and investigated for some time [12], and still is an active working area [15, 13]. This paper presents the results of a research based on this paradigm. An interactive software tool has been developed as support environment for transformation of LUSTRE programs, which allows to achieve, through derivations, the description of a hardware system from a formal specification, and to prove its correctness. With this tool, called TRANSE, specifications progressively and interactively are transformed through successive rewritings, using previously proved transformation sketches, until programs that are structurally interpretable (in terms of architectures) and have good operational properties are obtained. After a brief presentation of the TRANSE system, the paper presents three examples of synthesis using transformations, belonging to different domains: signal processing, CPU architecture and arithmetic oriented architecture. In the conclusion, the interest and limitations of the proposed method are underlined. 1.1 C o n c e p t i o n and Validation

Two approaches are commonly used in order to verify that an implementation matches a specification' 9 simulation: it cannot be exhaustive since the number of different cases to be tested is gigantic, even in the case of a weak complexity architecture.

308

G. Durrieu and M. Lemaitre

9 formal a posteriori verification: the behaviour of tile designed architecture is inferred from its structural description, then the equivalence with the specification is formally proved. The latter, full description of which can be found in [8, 21], starts to be industrially used, for example: [11, 17, 5]. However it suffers from the following drawbacks: the behavioural specification and the structural description are two objects obtained without any formal link between them; the elaboration of the final structural description is not assisted; the development process is not kept and thus is not reusable. Moreover, really operational proof techniques, based on Binary Decision Diagrams ("tautology checking"), thus on boolean algebra, for instance are not able to take in account arithmetical or algebraic properties; only the "theorem proving" approach [22] can. There is a third approach, more prospective for the moment, called "design by computation", which doesn't suffer from the drawbacks mentioned above. This approach, which the present paper belongs to, is detailed in 1.3. 1.2 The L U S T R E Language Our work wa.~ based on LUSTRE [9, 23], a purely functional synchronous executable language. A LUSTRE program is a system of equations defining streams of values (signals) in a mutual recursive fashion. This formalism is used by Merlin-Gdrin for describing the control automata of its nuclear power stations (SAGA/LUSTRE environment) [24] and by Agrospatiale for the design of on board calculators for the Airbus aircraft. The skills of the language are the following: clarity, concision, power, abstraction, referential transparency, "natural" (maybe graphical) expression. Historically this formalism was first used by automaticians, and fits to hardware systems. There is a direct correspondence between LUSTRE programs and architecture diagrams, as we will see in the examples. LUSTRE is quite close to Silage [10], and gets a formal semantics.

1.3 Transformational Design Transformational design consists in progressively transforming behavioural specifications, through successive rewritings, and using previously proved transformation sketches, until programs which can be structurally interpreted in terms of architectures and having good operational properties are obtained. Thus the matter is the construction of a hardware architecture while establishing the proof of its correctness, using a top-down approach. Program transformation techniques set up a formal link, equivalence or implication, between different programs performing the same function. This approach has been investigated for a long time in software engineering, and has been moderately successful. We guess that it is better suited to the hardware domain, since the strong constraints attached to it limit the number of potential solutions. The design by calculation approach applies to either automatic synthesis (e.g. [3]) or semi-automatic aided synthesis, starting from either functional [19, 4] or relational [13] formalisms. ALPHA [16] formalism and environment originate in the same program transformational design principles, but are especially dedicated to the design of systolic arra.ys.

Transformation o f Synchronous Descriptions

rule QUO ; vat R,N: int ; eonst let assert (k>0); N = 0 -> p r e ( N + l ) R = N mod k ; Q = N div k ; tel ; m)

k

:

int

309

;

;

let Q

= 0 -> p r e

( is R

= k-1

then

Q+I

else

Q

) ;

tel;

Figure I An example of a transformation rule in the TRANSE system. Rules are expressed usi,g a simple extension of the LUSTRE la,guage. This transformatio, allows the computation of successive quotients (Q stream) of natural h~tegers (N) by a constant (k) to be optimized. Declarations in the first block express that for the tra,sformation to be applicable, the sub-expressions bound to the R a,d N streams must be of integer type, and the sub-expression bound to k must be an integer constant. The assertion within the block states that this constant must be positive. The additional equatio,s are templates which equatio,s of the program being transformed must match with, for the transformation to be applicable.

1.4 A Tool Supporting Transformations An interactive software tool providing assistance for transformations called TRANSE has been built, which specifically operates on LUSTRE programs. The synthesis performed could be labelled as semi.automatic or assisted, as in fact the designer transcribes his know-how into the tool, which only formally verifies the steps of the derivation. This environment, based on the CENTAUR [6] generic environment, has the following features: e it includes a parametrized set of transformation rules. Each rule (see example figure 1) is expressed in a formalism which is a simple extension of LUSTRE. 9 it presents to the designers the specification to be processed and a menu with the available transformations (see figure 2). 9 it allows at any moment the mouse selection of sub-expressions of the specification and the transformation to be applied (the result of all possible transformations on a given sub-expression is automatically produced). 9 it offers helps such: backtracking, trace, saving of results, history... The general rule set includes transformations such folding/unfolding, arithmetic and algebraic rules, evaluation rules ("constant folding"), rules specific to the LUSTRE language semantics, including local retimings and conditional equation manipulations... Figure 2 presents the interface of the TRANSE system. 2

THREE

EXAMPLES

OF SYNTHESIS

2.1 Signal Processing This is a small classical example, presented mainly in order to introduce the LUSTRE language and the interactive transformation method, from [18]. Starting from the

310

G. Durrieu and M. Lemaitre

Figure 2 The TRANSE system interface. The central window holds the specification being processed. Transformations which can be applied to the selected expression appear in the left window. Finally, the right window holds the history of already performed transformations.

Transformation of Synchronous Descriptions

eonmt node

a,b

: real

filter

;

(u:real) r e t u r n s (x:real) + b*pre(u) ;

;

let x = a * p r e ( x ) tel

aonst node

a,

b,

filter2

a2,

ab

:real;

(v :real)

a2

returns

= a'a, (x

a b = a'b;

:real) ;

let x = pre(a2*pre(x)) u = p r e (v) ;

+ pre(pre(ab*u)

+ pre(b*v));

tel

Figure 3 Programs of the linear filter i. LUSTRE. At the top: before transformations. At the bottom: after transformations enhancing the critical period.

Figure 4 Diagrams of the linear filter correspomting to the LUSTRE programs of figure 3. Every LUSTRE program structurally correspo.ds to an architecture diagram, in particular, the pre operator correspo.ds to a memory element (delay or flip-flop). At the top: before transformations. At the bottom: after transformations.

311

G. Durrieu and M. Lemaitre

312

equation of a linear filter: x,+l = ax, + bu,, a LUSTRE program is directly obtained (top of figure 3). The corresponding architecture represented at the top of figure 4, has a critical period of T,n + Ta, T,n and Ta representing the time taken by a multiplication and an addition. Having in mind the elaboration of an equivalent architecture with a shorter critical period, transformations are applied and the program shown at the bottom of figure 4 is obtained; the critical period of the corresponding architecture is max(Tm,Ta). The transformations performed are detailed below. The starting point is: X

= a*pre(x)

+ b*pre(u)

In order to enhance the critical period, we must alternate p r e and arithmetic operators on the whole path, including the feed-back. Let us unfold x in the right hand part 9 x

=

a*pre(a*pre(x)

+ b*pre(u))

+ b*pre(u)

Let us use the rule k * p r e ( z ) = p r e ( k ' z ) , provided that k is a constant" X

= pre(a*(a*pre(x)

+ b*pre(u)))

+ b*pre(u)

Let us use the conventional distribution x * ( y + z ) = ( x * y ) + ( x * z ) a2=a*a, ab=a*b 9 X

= pre(a2*pre(x)

+ ab*pre(u))

and let us state

+ b*pre(u)

Let us use the distribution rule p r e (x+y) = p r e (x) + p r e (y)" X

=

(pre(a2*pre(x))

+ pre(ab*pre(u)))

+ b*pre(u)

=

(pre(a2*pre(x))

+ pre(ab*pre(u)))

+ pre(b*u)

Let us use the previous rule, distribution on +, reversed 9 X

= pre(a2*pre(x))

+ pre(ab*pre(u)

+ b*u

)

= pre(a2*pre(x))

+ pre(pre(ab*u)

+ b*u

)

Let us state u = p r e (v) in order to eliminate the direct dependency of + and *. We obtain b * u = b * p r e (v) = p r e ( b ' v ) , and thus 9 x = pre(a2*pre(x)) + pre(pre(ab*u) + p r e ( b * v ) ), expression having the expected properties. It should be noted that the specification changed 9 the input is no longer the signal u but v which u depends on.

2.2 C P U A r c h i t e c t u r e

This example shows a possible use of the TRANSE system for pipelining combinatorial functions. The starting point is a simplified case study published by J.B SAXE et al. [20]. The pipelining technique is widely used in the design of most of recent RISC processors. The example thus is significant. The basic idea consists in reducing the length of a critical path by introducing intermediate registers, without altering the general function. The clock period can then be reduced. The Central Processing Unit pipelining transformation is complex; it is all the more complex since the given CPU is complex itself (remember that the considered example is a simplified one): errors can be entered when performing this transformation manually, which can be avoided using an assistance system such TRANSE. Let us consider the simplified representation of a CPU (figure 5) made up of: 9 an Arithmetic and Logic Unit (ALU) receiving as input a stream of unary operations (Op) to apply on a stream of data read into the memory (RD), and producing a stream of results to be written into the memory (WD).

Transformation of Synchronous Descriptions WA

Op

313

RA

Figure 5 Diagram of a mini-CPU. const s i z e = 8 ; -- n u m b e r of w o r d s type d a t a , a d d r e s s , o p e r a t i o n , r e g = function s e l e c t (RF:reg; R A : a d d r e s s ) function A L U ( O p : o p e r a t i o n ; R D : d a t a ) function a s s i g n (RF:reg; W A : a d d r e s s ; node

mini

(

RA:address ; WA:address ; Op:operation )

returns

( RF : register RD,WD:data ; R:reg ;

vat let

RD = select(RF,RA) ; WD = ALU(Op,RD) ; RF = pre(assign(RF,WA,WD)); tel

in R e g i s t e r F i l e (data m e m o r y ) data^size ; r e t u r n s (RD:data) ; r e t u r n s (WD:data) ; WD:data) returns (RFl:reg) ;

-- R e a d A d d r e s s -- W r i t e A d d r e s s -- A L U O p e r a t i o n );

-- c o n t e n t

of R e g i s t e r

-- R e a d D a t a -- c o m p u t e W r i t e D a t a -- s t o r e b a c k i n t o R e g .

File

File

.

Figure 6 LUSTRE program describing the mini-CPU (diagram of figure 5). 9 a Memory (/14) fitted with: a read combinatorial function ( s e l e c t ) receiving a,s input a stream of reading addresses (RA) and the memory contents (RF), and producing a stream of read data towards the ALU. a write combinatorial function ( a s s i g n ) receiving as input a stream of writing addresses (WA), the memory contents (RF) and a stream of data from the ALU (WD), and producing the new contents of the memory (RF). -

-

Figure 6 shows the specification in LUSTRE of this mini-CPU. The a s s e r t statement expresses a relation which is always true between the s e l e c t and a s s i g n functions. It must be stressed, as a feature of this CPU, that its cycle time is limited by the addition of the delays introduced by the combina.torial functions ALU, s e l e c t and a s s i g n . Using program transformations based in pa.rticular on the last assertion, one will be able to get a program corresponding to the diagram of figure 7. The complex combinatorial functions ALU, s e l e c t and a s s i g n are separated from ea.ch other by memories, in such a way that the cycle time can be reduced. 2.3 A r i t h m e t i c O r i e n t e d A r c h i t e c t u r e This synthesis example is a more algorithmic one. The issue here is the specification a.nd the

(7. Durrieu and M. Lemaitre

314 WA

l

W^I I :

.

.

.

.

l

Op

. . . . .

WA1 RAWA2 RA

-

RA

. . . . . . . .

Figure 7 Diagram of the mini-CPU after transformations, equivalent to the diagram of figure 5. synthesis of an architecture taking as input an integer value and computing as output the integer square root (by default) of this value. This problem is suggested by [1, page 113]. Starting from an informal description of the algorithm, a general automaton will be derived, then a LUSTRE specification. This program will then be progressively transformed in order to obtain an optimized architecture. In this example we are guided by a strategic goal which is the elimination of multiplications and divisions (other than by power of 2). R is the square root of N if and only if R 2 < N < (R + 1) 2. The algorithm proceeds in two steps. First step: the smallest P power of 2 such that N < p2 is found. The second step consists in searching for R by dichotomy between 0 and P. Two variables are defined: B (standing for "bottom") and H (standing for "top") such that during the computation B ~ < N < H 2 is always true. MoreoverD = H - B i s s t a t e d . IfD = 1 the algorithm finishes: the result is B, since B then matches the above definition. Else the mean M = (B + H)/2 is computed and examined" if N < M 2 then the new margin { B , M } replaces the old one { B , H } , else {M,H} does. A very simple way to more formally express an iterative algorithm such as above is to put it in automaton form (figure 8). The state variables are P, B, H, D and F l a g , the last one holding true during the first step (bounding) and false during the second step (binary search). " - " denotes an undefined ("don't care") value. The result variables will be R e a d y (holding true when the result is available) and R e s u l t . The formal specification in LUSTRE is a direct transcription of the automaton. This transcription is very easy to achieve. Reading the vertical columns associated to each variable is sufficient. For example, for P one obtains: P = 1 -> p r e

(if F l a g

then

if N p r e

( 2*P

) ;

Transformation of SynchronousDescriptions

starting

state

transitions the

conditions

: P

B

H

D

Flag

1

-

-

-

true

(next v a l u e s )

transitions

are valid

on current

: with

D>I,

E = D div

state

next B

P Flag Flag not not

and N
315

and N<M*M and not(N<M*M)

R e a d y = (not Flag) a n d (D=I) R e s u l t = if R e a d y t h e n B e l s e

2 and M=B+E

state H D

Flag

J J

2*P

0 -

P -

P -

false true

J J

-

B M

M H

E E

false false

-

Figure 8 A program extracting integer square root in automaton form. "-" stands for .ndefined value (don't care). n o d e r a c (N:int) r e t u r n s ( R e s u l t : i n t ; R e a d y : b o o l ) v a t F l a g : b o o l ; P, B, D, E, M : int; let P = 1 -> p r e ( 2 * P ) ; B = U n d e r _ 3 -> p r e ( if F l a g t h e n 0 else Is N < M * M t h e n B else M ) ; D = U n d e f _ 7 -> p r e ( i s F l a g t h o n P o18o E ) ; F l a g = t r u e -> p r e ( F l a g a n d n o t N 0) ; tel

;

Figure 9 A program extracting integer square root in LUSTRE, intplied by the automaton of figure 8.

The second equation implies the first, Figure 9 shows a LUSTRE program implied by the automaton of figure 8, on which transformations will be applied, guided by the following strategic goals" 9 to suppress multiplications p*p and hi*M; 9 to keep as few as possible variable definitions which use a p r e , because according to the correspondence with architectures, each p r e implies a memory register;

G. Durrieu and M. Lemaitre

316

node r a c ~'ar

(N:int) r e t u r n s ( R e s u l t : i n t ; R e a d y : b o o 1 ) ; Flag, N I t M 2 : b o o l ; P2, M2, B2, E20 BD, BE, D2

: int

;

let

Flag

= true

-> p r e

{Flag a n d n o t N p r e (4 * P2) M2 = B2 + B D + E2 ; E2 = D2 d i v 4 ; B2 = 0 -> p r e ( s Flag elge if e l s e M2 BE = BD div 2 ; B D = 0 -> p r e ( Is F l a g e l s e Is e l s e BE D2 = 0 -> p r e ( Is F l a g e l s e E2

(D2=1)

} ;

;

;

then 0 N I t M 2 t h e n B2 ) ; then 0 N I t M 2 t h e n BE + E2 ) ; t h e n P2 ) ;

tel

Figure !0 A LUSTRE program extracting integer square root equivalent to the program of figure 9 after transformations.

9 to hctorize as much as possible common expressions. Finally the program of figure 10 is obtained. There are no longer multiplications, except by 4, but that could be done using shifts. This final implementation keeps the sizeable parallelism already existing in the specification. It could be shown that the computation time (identical to that of the specification) is proportional to O(log 2 N). 3

CONCLUSION

The use of program transformations in hardware design is well-known ([12, 13, 15, 19, 20] for example). The contribution of this paper to the domain is twofold. It shows that: 9 the LUSTRE language is well-suited to support this approach. 9 as suggested by our three examples, the transformational approach applies to quite different domains of hardware synthesis; the last example (an arithmetic architecture computing integer square roots) shows that we can reuse and formalize results from the software transformation field. Opposite opinions are often encountered concerning the proposed approach. Some people like the interactive, exploratory feature, and the interactive capability left to the designer in front of the tool; others accept only fully automated tools. Some particular problems of retiming are in hct automatable [14]. The example of the filter probably could be automated (section 2.1), since the strategy is a quite simple and guiding one. The example of mini-CPU pipelining probably could not, since it is necessary to know the characteristic equation of the memory in order to carry out the work. The example of square root certainly could not since it exploits the creativity of the designer.

Transformation of Synchronous Descriptions

317

In the example of the mini-CPU, the synthesis emerges into a solution not known in advance. The designer tries to introduce the idea of buffering the lastest written elements, and corrective parts are brought to him without additional cost. On the contrary, in the example of square root, the general idea is rather the verification of a sequence of pre-existing transformations, found in literature. The rewriting system implemented by TRANSE can be seen as forward chaining deduction, which was found, considering this last example, to be sometimes rudimentary. This example in fact underlines the limits of the system at present time. There is thus an oscillation between verification and genuine synthesis (invention). It would be interesting to incorporate to TRANSE the possibilities of deductive systems proceeding with demonstration of sub-goals, LARCH [7] for example, already used for architecture verification [20]. This study incidentally showed the great quality of LUSTRE, a purely functional, synchronous, data flow, equational language, for specifying VLSI architectures and formally working on these specifications. One could criticize some misses at the expression of generic components level. On the other hand some possibilities of the language were not exploited here, in particular all the features concerning clocks, which allow the description of subsets working with separate clocks. This project is now joining a larger research project, named ASAR, grouping together six French teams, and oriented towards architectural and system synthesis, the goal of which is to build a generic multi-formalism framework for architectural synthesis [2]. Acknowledgments This work has been supported by Etude D R E T 3468-00 and French C N R S [ P R C A N M . The exposition of this paper has greatly benefittedfrom the comments of referees and members of the conference committee. References

[1] [2] [3] [4]

[5] [6] [r]

J. Arsac, Prdceptes pour programmer, DUNOD, 1991. P. Asar, Towards a multiformalism framework for architecural synthesis : the ASAR project, in Proc. of the Codes/CASH'94 Conf., Grenoble, France, Sept. 1994. G. Berry, Esterel on Hardware, in Mechanised Reasoning and Hardware Design, Prentice Hall, 1992. R.T. Boute, Systems semantics 9 Principles, applications, and implementation, ACM Trans. Prog. Lang. Syst., 10 (1988), pp. 118-155. Bull Systems Products, Proving "Zero Defect" with VFormal, 1993. D. Cldment, Gipe : Generation of interactive programming environments, TSI, 9 (1990). S. J. Garland, J. V. Guttag, A Guide to LP, The Larch Prover, tech. rep., MIT Laboratory for Computer Science, Dec. 1991.

318

[81 [9] [io] [11] [12] [13]

[14] [15l [16]

[17] [18]

[191

[20] [21]

[22] [23]

I24]

(7. Durrieu and M. Lemaitre A. Gupta, Formal hardware verification methods 9 A survey, Tech. Rep. CMU-CS-91193, Carnegie Mellon University, Pittsburg, PA 15213, Oct. 1991. N. Halbwachs, P. Caspi, P. Raymond, D. Pilaud, The synchronous dataflow programming language LUSTRE, Proceedings of the IEEE, 79 (1991), pp. !305-1320. P. N. Hilfinger, Silage, a high-level language and silicon compiler for digital signal processing, in IEEE Custom Integrated Circuits Conference, Portland, Oregon, May 1985, pp. 213-216. W. A. Hunt, B. C. Brock, A Formal HDL and its use in the FM9001 verification, in Mechanised Reasoning and Hardware Design, Prentice Hall, 1992. S. D. Johnson, Synthesis of digital designs from recursion equations, The MIT Press, Cambridge, Massachusetts, 1983. G. Jones, M. Sheeran, Designing arithmetic circuits by refinement in Ruby, Science of Computer Programming, 22 (1994), pp. 107-135. C. E. Leiserson, J. B. Saxe, Retiming synchronous circuitry, Algorithmica, 6 (1991), pp. 5-35. J. D. Man, J. Vanslembrouck, Transformational design of digital circuits, in EUROMICRO Cologne, 1989. C. Mauras, Aplha 9 un langage dquationnel pour la conception et la programmation d'architeetures parall~les synchrones, th~se de doctorat, Universit~ de Rennes I, d~cembre 1989. O. Coudert, J. C. Madre, A Unified Framework for the Formal Verification of Sequential Circuits, in Proc. of IEEE International Conference on Computer Aided Design'90, Santa Clara., California, Jan. 1990. / K. K. Parhi, D. G. Messerschmitt, Pipeline Interleaving and Parallelism in Recursive Digital Filters - Part I, IEEE Trans. on Acoustics, Speech and Signal Processing, 37 (1989), pp. 1099-1117. J. G. Samson, L. J. M. Claesen, H. J. deMan, Correctness Transformations on the Hough Algorithm, in CompEuro 92, The Hague, May 1992. J. B. Saxe, J. J. Homing, J. V. Guttag, S. Garland, Using transformations and verification in circuit design, Formal Methods in System Design, 3 (1993), pp. 181-210. V. Stavridou, Formal Methods and VLSI Engineering Practice, The Computer Journal, 37 (1994), pp. 96-113. V. Stavfidou, T. F. Melham, R. T. Boute, eds., Theorem Provers in Circuit Design, IFIP Transactions A: Computer Science and Technology, North-Holland, 1992. G. Thuau, B. Berkane, A unified framework for describing and verifying hardware synchronous sequential systems, Formal Methods in System Design, 2 (1993), pp. 259276. Verilog, AGE/SAGA Langage LUSTRE, manuel de r~f6rence, tech. rep., VERILOG, 38330 Montbonnot, France, Janvier 1994.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

319

HEURISTICS FOR EVALUATION OF ARRAY EXPRESSIONS ON STATE OF THE ART MASSIVELY PARALLEL MACHINES

V. BOUCHITTI~, P. BOULET, A. DARTE, Y. ROBERT 1 Laboratoire LIP, CNRS (U.R.A. no 1398)

Ecole Normale Supdrieure de Lyon 69364 LYON Cedex 07 vbouchit,pboulet, darte,[email protected],fr

ABSTRACT. This paper deals with the problem of evaluating High Performance Fortran style array expressions on massively paraUel distributed-memory computers (DMPCs). This problem has been addressed by Chatterjee et al. under the strict hypothesis that computations and communications cannot overlap. As such a model appears to be unnecessarily restrictive for modeling state-of-the-art DMPCs, we relax the restriction and allow for si. multaneous computations and communications. This simple modification has a tremendous effect on the complexity of the optimal evaluation of array expressions. We present here some heuristics, which we are able to guarantee in some very important cases in practice, namely for coarse-grain, or fine-grain computations. KEYWORDS. Parallelism, array expressions, distributed memory, communications overlap.

1

INTRODUCTION

We focus in this paper on the evaluation of HPF (High Performance Fortran) style array expressions on massively parallel distributed-memory computers (DMPCs). The difficulty of such an evaluation is to choose the placement of the evaluation order and location of the intermediate results. Chatterjee et a1.[9,3,4] have adressed this problem under the strict hypothesis that computations and communications cannot overlap. However state-of-the-art DMPCs can overlap computations and communications, so we relax this restriction. This 1Supported by the Project C3 of the French Council for Research CNRS, and by the ESPRIT Basic Research Action 6632 ~NANA2~ of the European Economic Community.

320

I/'. Bouchittd et al.

If'\

/: /:'\ A

B

C

E

F

D

Figure 1" a simple expression tree simple modification has a tremendous effect on the complexity of the optimal evaluation of array expressions. 1.1

T h e Rules of the G a m e

In this section we set the basic definitions and ground rules for our work, and illustrate them on a simple example. Our problem is to evaluate a binary expression T. We assume that T is given as a binary tree: commutative or associative rearrangement is not allowed, and there are no common subexpressions. Also, without loss of generality, we assume that there are no unary operators. Hence T can be captured as a locally complete binary tree: all internal nodes have in-degree 2. All nodes but the root have out-degree 1. See in figure 1 the tree corresponding to the expression T = f o ( f l ( f 2 ( A , B ) , f a ( C , D ) ) , f 4 ( E , F ) ) . For a moment we forget about distributed arrays and HPF expressions, i.e. we do not relate the nodes of the tree with data items. Rather, we give an abstract interpretation of our problem as follows: leaves represent locations, while internal nodes represent computations (we also say that such nodes are evaluated). An internal node can be computed if and only if both sons have been evaluated (leaves require no evaluation) and both sons share the same location. In the previous example, we have six leaves - - nodes A, B, C, D, E and F - - and five internal nodes fi, 0 _< i _ 4. If both sons of an internal node are not located at the same place, at least one communication is needed. For instance to evaluate node f2, we can decide to transfer data stored at location A to location B and evaluate node f2 at location B, or vice-versa. But we can also decide to have two communications, for instance from location A to location C, and from location B to location C, then node f2 will be evaluated at location C: this could enable the computation of node fl without communication at location C, provided that node f3 has also been evaluated at location C owing to a communication from leaf D to leaf C. W h a t are the rules of the game for the abstract problem? We have to evaluate the expression T as fast as possible, while respecting the partial order induced by the tree. Communications and computations can occur in parallel. More precisely, we assume that

Heuristics for Evaluation of Array Expressions

321

Table l ' a simple execution one communication and one independent computation can take place simultaneously. We suppose that communications are sequential among themselves and that computations are also sequential among themselves. In some sense, it is like a two-processor scheduling problem, but with one machine devoted to computations and the other to communications. 9 See Table 1 for an example. Here we assume that all communications cost 3 units of time while computation costs are listed in the table. Of course there is a priori no reason for the communication costs to be the same, but this turns out to be an important case in practice, (see section 1.2). We stress some characteristics of the problem" 9 The location where the final result should be available is not specified in the problem. If T is an assignment involving a binary expression, we may be imposed the location of the result. This does not change the problem much, as one can always add one communication at the end. 9 As for the model of evaluation we assume two things: -

At most one computation can occur at a given time-step. This is a natural consequence of the original formulation of the problem in terms of Fortran 90 array expressions (see section 1.2).

-

At most one communication can occur at a given time-step. This hypothesis is more questionable, since it comes from modeling actual DMPCs.

9 The same location can appear several times in the expression. This is not the case in our example, but we could replace leave E by a second occurrence of, say, A. Then the expression can no longer be captured with a tree: instead we use a DAG (Directed Acyclic Graph). 1.2

HPF

Array

Expressions

The original motivation for tile above problem comes from the evaluation of HPF array expressions. Consider again the expression

T = fo(fl(f2(A,B),f3(C,D)),f4(E,F)).

V. Bouchitt# et al.

322

Assume that we have a 2D-torus of processors of size P x P (such as the Intel Paragon), and arrays a, b, c, d, e, f and res of size N by N. Consider the following loop nest: for i = 3 to N - 2 do for j = 7 to N - 5 do res(i,j) = ( ( a ( i - 2 , j + 3) + b ( i , j - 6)) • ( c ( i - 1,j + 3) + d(i + 2,j + 5))) + ( e ( i - 1,j + 2)/I(i,j)) Here we have an assignment, we could easily modify the expression tree to handle it. Suppose we have the following HPF static distribution scheme: CONST N = 100, P = 10 CHDIR$ PROCESSORS PROC(P,P) CHDIR$ TEMPLATE T(N,N) CttDIR$ ALIGN a, b, c, d, e, f, res WITH T CttDIR$ DISTRIBUTE T(CYCLIC,CYCLIC) ONTO PROCESSORS for i = 3 to N - 2 do for j = 7 to N - 5 do res(i,j) = ( ( a ( i - 2,j + 3) + b(i,j - 6)) x ( c ( i - 1,j + 3) + d(i + 2,j + 5))) + ( e ( i - 1,j + 2)/f(i,j)) The distribution says that all arrays are aligned onto the same template, which is distributed along both dimensions in a cyclic fashion. Each processor of the 2D-torus thus holds a section of each array. We can now make the link with the locations A to F which occur in the expression T. Consider for instance node f~:

f~" temp(i,j) = a ( i - 2,j + 3) + b ( i , j - 6),3 < i < N - 2,7 < j < N - 5. The array element a ( i - 2 , j + 3) is stored in processor proca(i,j) = ( i - 2 rood P,j + 3 mod P) while array element b ( i , j - 6 ) i s stored in processor proc~(i mod P , j - 6 mod P). Therefore each element b ( i , j - 6) must be sent according to a distance vector ( - 2 , 9 ) t if we decide to compute the temporary result ternp(i,j) of node f2 in proca(i,j). Let r(u, v) denote the translation of vector (u, v) t. This communication amounts to a global shift along the grid of array b to "align" the origin of array a translated by r ( - 2 , 3) with the origin of array b translated by r ( 0 , - 6 ) (because a and b are aligned onto the same template). As already said, nothing prevents us from evaluating node ]'2 In a different location. We can choose another "origin" and communicate both arrays a and b accordingly. We understand now why we have assumed that there is a single computation that can be done at each time step: this is because all processors operate in parallel on different sections of the distributed arrays involved in the expression. However, we may have several global shifts in parallel, depending upon the communication capabilities of the target machine.

Most current DMPCs are capable of performing at least one communication in parallel with computations, hence our key assumption that communications and computations do overlap.

Heuristicsfor Evaluation of Array Expressions

323

Evaluating communication costs is much harder than for computation costs, and this for many reasons: 9 State-of-the-art DMPCs have dedicated routing facilities, so that the distance of a communication may be less important than the number of conflicts on communication links [5, 7, 15]. 9 Even in a simple case like the tree example, the size of the communications depends upon the distribution of the array and the alignment. Therefore we assume that communication costs are fixed parameters that may be calculated according to some intricate formula, where important parameters include the size of the message, the startup cost, the distance cost and principally the contention cost s. However an important practical case is to assume that these communication times are constant because it would be very complicated to compute an approximation of these times. Indeed, we do not know what algorithms are used for the routing of the messages and therefore we can not guess when there would be contentions or conflicts. So we consider here the case when the communication times are constants (the maximum or an average of real costs). This models particularly well the communication patterns that come from uniform data dependences. Another important practical case is when aU communication costs are smaller than any computation cost (coarse-grain formula). We deal with this case in section 3. 1.3

Paper Organization

The paper is organized as follows: first in Section 2 we survey previous work related to this paper. We propose then in Section 3 a heuristic that is optimal in a restricted but useful case, that of coarse-grain computations and also in a sub-case of fine grain computations. Finally, we give some conclusions in Section 4. 2

S U R V E Y OF P R E V I O U S W O R K

This work fits into some Many people have dealt Unobe [12], Li and Chen Sadayappan [16], Huang

work done about parallel compilers dealing with data placement. with the alignment problem: Anderson and Lain [1], Lukas and [11], Feautrier [8], O'Boyle and Hedayat [13, 14], Ramanujam and and Sadayappan [10] and Darte and Robert [6].

In the field of parallel evaluation of array expressions, Gilbert and Schreiber [9] proposed an optimal algorithm for aligning temporaries in expression trees by characterizing interconnection networks as metric spaces. Their algorithm applies to a class of so-called robust metrics. :More machine dependent experimentations would be necessary to determine an approximation formula that would give the communication time in function of the size of the data and of the communication pattern.

324

V. Bouchittd et al.

Chatterjee et al. [3] extended Gilbert and Schreiber's work in two ways. They reformulated the problem in terms of dynamic programming and proposed a family of algorithms to handle a collection of robust and non-robust metrics. They also allowed non conformable arrays by studying the case of edge-weighted expression trees. Chatterjee et al. [4] extended their previous work in dealing with a complete program and not only an array expression. They propose alignment for all objects in the code, both named variables and compiler generated temporaries. They do not restrict themselves to the "owner-computes" rule. We concentrate our work on array expressions but unlike Chatterjee et al., we do not focus on the shape of the interconnection network and the minimum communication time, but rather on the largest overlapping of the communications by computations. Indeed, we consider another model of modern DMPCs. Actually, most machines are able to overlap computations and communications, and moreover, their interconnection network is based on routers. So, the communication time cannot be derived easily from the layout of the processors but mainly depends upon contentions. 3

HEURISTICS

3.1

Introduction

We propose in this section heuristics to solve the general problem. Fortunately these heuristics give an optimal time in useful practical cases. Here is the notation used in this section:

expresslon-tree 9 is an expression-DAG whose internal nodes form a tree. n represents the number of internal nodes of the expression-tree. ~cale(i) is the time needed to compute the operation of node i. ~com is the time needed to move the data from their position on one node to their position on an other node.

r o o t represents the root node of the expression-tree, i.e. the only node with out-degree 0. Af is the set of the nodes of the expression-tree. s is the set of the leaves of the expression-tree. I is the set of the internal nodes of the expression-tree.

3.2

W h e n All Leaves Have Different Locations

3.~.I Hypotheses We assume here that the leaves are all distinct. We are then in the c ~ e where the expression-tree is in fact a locally complete binary tree.

P r o p e r t y 1 For a locally complete binary tree with n + 1 leaves, there are at least n communications to do.

Heuristics for Evaluation of Array Expressions 3.2.2 B

Lower bound Let = $com+max(

325

6c,lc(i), ( u - 1)$com) + ~fcalc(root) i~z\ ( roo~}

B is a lower bound on the time necessary to compute the expression. The first stage of the execution is a communication to move one of the operands of the first computation to the same location as the other operand, hence the first term Scorn. The last stage is the computation of the root of the tree, hence the $calc(rOot). In between, as the machine can only do one computation at a time, and all the computations are sequentialized, hence the term ~,iez\{roo~} 6calc(i). Likewise, the communications are also sequentialized, hence the ( n - 1)ticom. As computations and communications are assumed be simultaneous, B is a lower bound on the execution time of the expression tree.

3.2.3

The heuristic

T h e e v a l u a t i o n of one n o d e To minimize the number of communications, we compute one intermediate result at the location of one of the two operands it depends upon. So we have one communication and one computation for each node. T h e general strategy We consider a total order on the internal nodes of the tree. We then evaluate the nodes following the given order as soon as possible. Here is the order -~ that we consider: 1. The evaluation is done by depth level, starting with the deepest leaves, once one level has been computed, going up one level. Each level is evaluated from the left node to the right node. 2. The communication direction is always from the left son, except when the right son is a leaf, in which case, the communication direction is from the right son. 3. The communications are executed in the order -,: of execution of the nodes, as soon as possible. T h e o p t i m a l case We assume that either all computation times are lower than the communication time, or that they are all greater than the communication time. In this particular but realistic case (fine grain or coarse grain computation), we are able to describe a set of strategies that give a communication time equal to the lower bound B. T h e o r e m 1 The evaluation proposed above gives an optimal execution time. The proof of theorem 1 can be found in [2].

3.3

W h e n Leaves Can Share the Same Location

3.3.1 The case under study We study here the case when the data are not necessarily all stored at different locations (in terms of processors). But we impose in this case that

326

V. Bouchittd et al.

there is no temporary storage of the data. We mean that a data array stored at location a may be moved to location b for a computation but is not available at location b for another computation. We can thus represent the expression by an expression-tree a.

3.3.2 An algorithm to determine the minimal number of communications We present here an algorithm that computes the minimal number of communications needed to evaluate an expression-tree. The algorithm is built by following the dynamic programming paradigm: it is based on the labeling of each node of the expression-tree by the minimal number of communications needed to compute the subtree of this node and by the list of locations where this number can be reached. We will denote this label (n,,sc) where s~ = {11,12,...,lk}.

/~ Initialization */ f o r e a c h l in leaves(DAG) do label(/)=(O, {l}) enddo f o r e a c h n in internal_nodes(DAG) do label(n)=(O, 0) enddo

/* First Phase */ e x p l o r e the DAG backward from the leaves to the root l e t c be the current node l e t (n2, s2)=label(right_son(c)) if slns2 ~ @ t h e n + n else

+

+ 1,

u

endif endexplore

/* Second Phase */ l e t (nr, sr)=label(root) choose a location l for the root from sr label(root)=(n~, {/}) e x p l o r e the internal nodes of the DAG forward from the root to the leaves l e t c be the current node let (n~,s~)=label(c) l e t (n/, {l/})=label(father(c)) " i f 11 E s~ then choose a location I from s~ label(c)=(n~, {l}) else

3See section 3.1 for the definition of the expression-tree.

Heuristics for Evaluation of Array Expressions

327

label(c)=(nc,s I) endif

endexplore T h e o r e m 2 The algorithm described above gives the minimal number of communications

needed to compute the input expression and a strategy to allocate the intermediate results that uses this minimal number of communications. The proof of this theorem can be found in [2].

3.3.3 The heuristic We consider here a set of algorithms all based on the same scheme. The first assumption is that to compute quickly, we should use as few communications as possible. This is not always optimal. To do this we use an allocation of the intermediate results given by the algorithm of the previous section. This allocation and the structure of the expression-tree induce a partial order <1 on the nodes: iisasonofj =~ i < l j (1) i is a brother of j, i is communicated and not j <1 is transitive

:r

i <1 j

(2) (3)

P r o p e r t y 2 <1 is a partial order. The proof of this property can be found in [2]. From this partial order <1, we build by linear completion a global order -< on the nodes of the DAG. This global order is the order of execution of the nodes. The execution of one node is done in two steps: first, fetch the data needed for the computation (do the needed communications) and second, compute the operation of the node. We consider the leaves as internal nodes with no communication needed and a null computation time. A refinement to this is to choose the global order -~ among the orders that evaluate as the first internal node a node for which no communication is needed. Hence, when it is possible, this choice allows one to recover the first communication. There is a lot of freedom in this algorithm: one can construct many allocations that give the minimal number of communications and many global orders are linear completions of the partial order <1.

The optimal case In the case of coarse grain computations, where for each node, the computation time is greater than the communication time, the above heuristic is optimal and gives an execution time equal to the sum of all computation times, ~i~x Scale(i) when the first communication can be recovered and when it is not possible, the execution time becomes t;com + ~ z ~calc(i). Indeed, the allocation algorithm assures that at most one communication arrives at each node. However, in the fine grain case, the fact that there can be no communication for a node

328

V. Bouchitt~ et al.

invalidates this demonstration. Here is an example: take the DAG from the expression a = fo(.f,(A,B),A(Is(C,C),C)).

The communication cost is 10 for all communications and the computation costs are:

1511

1i01

See figure 2 for the different possible allocations of the data given by the algorithm. Although the executions induced by allocations (3) and (4) are optimal, the ones given by allocations (I) and (2) are not. This proves that our heuristic is not always optimal when we are in the fine grain case. 3.4

Extension

T h e o r e m 3 The heuristic is still optimal in a slightly more general case. When we say that the communication cost must be smaller than all the computation costs, we only use the fact that the slowest communication is faster than the fastest computation. Indeed, all the proofs referenced above are still valid in this slightly more complicated case. 4

CONCLUSION

We have addressed in this paper the problem of evaluating HPF (High Performance Fortran) style array expressions on massively parallel distributed-memory computers. We have considered here that the target machines were able to overlap computations and communications. We have presented a family of heuristics which give optimal results for important cases in practice: small grain computations for expression-trees with all leaves at different places and coarse grain computations for all expression-trees. Future work may include experimentations on real life machines to refine the heuristics and generalization of these techniques to blocks of instructions and global programs. References [1] Jennifer M. Anderson and Monica S. Lain. Global optimizations for parallelism and locality on scalable parallel machines. A CM Sigplan Notices, 28(6):112-125, June 1993. [2] Vincent Bouchitt~, Pierre Boulet, Alain Darte, and Yves Robert. Evaluating array expressions on massively parallel machines with communication/computation overlap. Technical Report 94-10, Laboratoire de l'Informatique du Parall~lisme, march 1994. [3] S. Chatterjee, J. R. Gilbert, D. S. Schreiber, and S.-H. Tseng. Optimal evaluation of array expressions on massively parallel machines. In Proceedings of the Second Workshop on Languages, Compilers, and Runtime Environments for Distributed Memory Multiprocessors, Boulder, CO, October 1992. [4] S. Chatterjee, J. R. Gilbert, D. S. Schreiber, and S.-H. Tseng. Automatic array alignment in data-parallel programs. In ACM Press, editor, Twentieth Annual A CM

Heuristicsfor Evaluation of Array Expressions

Figure 2: different allocations for expression G

329

330

V. Bouchittd et al. SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 1628, Charleston, South Carolina, January 1993.

[5] M. Cosnard and F. Desprez. Quelques Architectures de Nouvelles Machines. In Ecole d'Automne CAPA - Port d'AIbret. Hermes, 1994. To be published. [6] Alain Darte and Yves Robert. A graph-theoretic approach to the alignment problem. Technical Report 93-20, Laboratoire de l'Informatique du Parallfilisme, Ecole Normale Sup6rieure de Lyon, July 1993. [7] R. Esser and R. Knetcht. Intel Paragon XP/S - Architecture and Software Environment. Technical Report KFA-ZAM-IB-9305, Zentralinstitut fur Angewandte Mathematik- Forschungszentrum Julich, April 1993. [8] Paul Feautrier. Towards automatic distribution. Technical Report 92-95, Laboratoire MASI, Institut Blaise Pascal, Paris, December 1992. [9] J. R. Gilbert and D. S. Schreiber. Optimal expression evaluation for data parallel architectures. Journal of Parallel and Distributed Computing, 13(1):58-64, September 1991. [10] C.H. Huang and P. Sadayappan. Communication-free hyperplane partitioning of nested loops. In Banerjee, Gelernter, Nicolau, and Padua, editors, Languages and Compilers for Parallel Computing, volume 589 of Lecture Notes in Computer Science, pages 186200. Springer Verlag, 1991. [11] Jingke Li and Marina Chen. The data alignment phase in compiling programs for distributed memory machines. Journal Parallel Distributed Computing, 13:213-221, 1991. [12] Jaon D. Lukaz and Kathleen Knobe. Data optimization and its effect on communication costs in MIMD fortran code. In Dongarra, Kennedy, Messina, Sorensen, and Voigt, editors, Fifth SIAM Conference on Parallel Processing for Scientific Computing, pages 478-483. SIAM Press, 1992. [13] Michael O'Boyle. Program and Data Transformations for Efficient Execution on Distributed Memory Architectures. PhD thesis, University of Manchester, January 1992. [14] Michael O'Boyle and G.A. Hedayat. Data alignment: Transformations to reduce communications on distributed memory architectures. In Scalable High-performance Computing Conference SHPCC-92, pages 366-371. IEEE Computer Society Press, 1992. [15] J. Palmer and G. L. Steele. Connection Machine Model CM-5 Overview. In H.J. Siegel, editor, The Fourth Symposium on the Frontiers of Massively Parallel Computation, pages 474-483. IEEE Computer Society Press, 1992. [16] J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. IEEE Trans. Parallel Distributed Systems, 2(4):472482, 1991.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

ON FACTORS LIMITING THE GENERATION COMPILER-PARALLELIZED PROGRAMS

331

OF EFFICIENT

M. R. WERTH and P. FEAUTttIER

PRISM Laboratory University of Versailles, 45, avenue des ~tats.Unis 78035 Versailles Cedez, France { Mourad.Raji. Werth, Paul.Feautrier} @prism.uvsq.fr

ABSTRACT. In this paper we discuss two factors which may limit the performance of parallel programs: 1. Rate of inactivity, where within a virtual processors grid a number of processors remain idle during the program execution. 2. Overheads due to a communications optimization step. These factors were first observed while testing the performance of programs generated by a paraUelizing compiler. KEYWORDS. ParaUelizing compilers, inactivity rate, unimodular transformations, optimization of communications.

1

INTRODUCTION

Our paraUelization method starts with a program written in a conventional language -e.g. Fortran-, then produces the parallel form of the program using a systematic transformation

scheme. This form is then adapted to generate code for a given parallel architecture. The underlying idea of our systematic transformation scheme is to retrieve a 8pace.time mapping of the program (a similar approach is used by the systolization community [i]), starting with a powerful tool, the Data Flow Graph [2] which provides the exact source of every value manipulated in the program, i.e. the exact instruction and the corresponding iteration (in the presence of a loop nest) that produces the value. The schedule [3] and the placement .function[4] of the program instructions are then derived. They are linear functions in terms of the surrounding loop indexes. The schedule expresses the date at which a given iteration must be scheduled, the placement function provides the coordinates of the

M.R. Werth and P. Feautrier

332

virtual processor-within a multi-dimensional grid- that will execute the corresponding computation. In a subsequent step the program is expanded in order to obtain its single assignment form and the new loop nests are constructed. The logical time and the grid axes represent the new indexes. Each original loop nest is embodied by a multi-dimensional virtual processors grid. The generated program contains a single global sequential loop over the time having an entirely parallel body [5]. Unfortunately, the resulting body incorporates several overheads that limit the performances of the generated code. These overheads are due to the presence of instruction guards and conditional operands and to the high cost of communications. These problems are dealt with in [6] and [7] respectively. Our experiments have shown that the performances of the generated programs (even though optimized) can be limited by several factors. In this paper we address two of them. In section 2, we present a simple example in order to illustrate our purposes. The first factor, which is related to the choice of the transformation applied to the sequential program, concerns the rate of inactivity of the virtual processors within the grid; we show that unimodular transformations are not always the best choice. This topic is treated in section 3. The second factor is a side effect of the optimization of communications. In fact, when g e t s are transformed into sends, the set of the sending processors must be delimited. This operation may, in some cases, induce an important overhead that considerably reduces the performances of the optimized program. Section 4 deals with that issue. In section 5, we present a discussion of possible solutions to the above-mentioned problems as well as our conclusions.

2

AN EXAMPLE

In order to illustrate our purposes, we use the Lamport loop as an example : doi=

I, 1

do j = 2, n-I do k - 2, n-I

a(j, k) = 0.25 *(a(j, k-l)§

k§247

k)§247

k))

end do end do end do

The Lamport loop is an example of a relaxation algorithm where an element of the matrix is determined by computing the average of its four neighbors. This computation is included in a global loop (the one in i from 1 to 1). This makes the dependency issue a bit more complex 'because the newly computed element must be considered for the next computations. Figure 1 illustrates the dependencies between the program iterations. The timing function of the computing instruction is t = 2i + j + k - 6. Thus, for any given time value, the iterations belonging to the hyperplane of equation 2i + j + k - 6 - t = 0 are entirely parallel.

Efficient Compiler-parallelized Programs

333

Figure 1' Dependencies between iterations of the Lamport loop example

In order to obtain a unimodular transformation, we have chosen the placement func-

ti~176176

(q~

= (i+kJ)

(see[5]f~176

transformation is thus '

T(i,j,k) =

qo q~

=

i+j k

the new loop nest is then indexed by t, qo and ql whose bounds are obtained by applying the Fourier-Motzkin algorithm and are given as follows' 1 < t < 21 + 2 n - 8

max(it + 8 - n I , 3 , t + 6 - l - n ) < _ q o < _ m i n ( [n + 2t +.......2 J , t + 2 , 1 + n - 1 ) max(2,t + 7 - 2q0,t + 5 - q 0 - I) < ql _< m i n ( n - 1, n + t + 4 - 2q0,t + 4 - q0) The outermost loop is sequential, the two others are entirely parallel and are incarnated by a 2-dimensional virtual processors (VPs) grid indexed by (q0, ql) and of size 2 n , n.

3

RATE OF INACTIVITY

At any given iteration of the loop on t, only those processors verifying the above constraints on q0 and ql may be active. In substance, these inequations form a parametric polyhedron (where parameters are t, n and l) of which it is not possible to have a visual representation. Therefore and in order to understand the behavior of the parallel program, we shall study the shape of this polyhedra within two time intervals which will enable us to get a visual representation of it. This translates to expressing the bounds of q0 and ql in their simplest form. For simplicity's sake, we assume that I = n. Let's begin with the time interval 90 _< t _< n - 2. Within this interval we have (see [8]): 3<_qo___t+2

M.R. Werth and P. Feautrier

334

Figure 2: R a t e of inactivity for t < n - 2

and

{

2 t + 7 - 2qo

if qo>_t~2~} < q l < t + 4 _ q elsewhere -

~

Figure 2 shows the shape of the active V P s ' domain in terms of t when 0 < t _< n - 2. Notice t h a t t h e r a t e of activity of the the VPs is very low. More t h a n 43- of the processors are idle 1 . As for n - 2 < t < 2n - 5, the bounds of q0 and ql are expressed as follows (see [8]):

t+8-n

_
2

n+t+2

2

and

{ 2

t + 7 - 2q0

_ if qo> elsewhere

} ~

{
.

n-1

if t+8"n
qo

Figure 3 shows the shape of the active V P s ' domain. T h e r a t e of activity is higher t h a n 1This is a rough approximation. If we assume that the number of integer-coordinates points of a triangle is equal to its area (which is asymptotically acceptable), the number of active processors (represented by the dark surface) is equal to : 1(t-1) (t-l) ~ (.-Z-l) ~ .~ (t 2 2 4 4 - 4 rt ~ Given that, the total number of processors is 2n 2 = 8-~--, more than 7 out of 8 processors are idle.

Efficient Compiler-parallelizedPrograms

335

Figure 3: Rate of inactivity for n - 2 _< t < 2 n - 5 the previous case, but it remains quite low 2. We stop our study at this point, suffice it to note that the VPs' activity rate remains low for the other cases For 2n - 5 < t < 3n - 4 we still have t+s-n < q0 < n§ and for 3n - 3 _< t _ 4n - 8 we have t + 6 - 2n _< q0 _< 2n - 1. In both cases the m a x i m u m number of active processors along with axis qo is less than n - 3. "

-

-

2

-

-

2

This obviously limits the performance of the generated program, especially for SIMD virtualized systems like the CM-2(200). Generally speaking, this is a problematic issue. Let's assume, for example, t h a t T is the number of operations for a given program and l the latency (the m a ~ m u m value of time). In order to obtain an optimal parallel execution, one may assure that T virtual processors are simultaneously active in each time step; the inactivity rate would thus be zero. This means t h a t the number of operations that have to be executed is the same for each time step, which is generally not the case. The trouble is that only a small number of operations are carried out at a given time, while there is a large number of available VPs. Knowing that these VPs have to be m a p p e d to the physical processors of the computer, including those VPs to which no computation has been assigned (because of the SIMD architecture of our target machine), it is clear that a high inactivity rate leads inevitably to poor performance. n 2 2Asymptotically, the dark surface doesn't exceed -y. It is in fact less than the area of the including n~ n~ parallelogram which is itself equal to n 2 - 2(-~-) = "T (the area of the including square minus tile surface of the two other triangles), which leads to a rate of inactivity greater than 88

336

M.R. Werth and P. Feautrier

We have replaced the placement function by (j,k) in order to obtain a smaller VP grid (n 2 instead of 2n2). This led to a non-unimodular transformation, but the resulting parallel program ran more than three times faster than the previous one. This result leads to the conclusion that unimodular transformations are not always the best choice. A number of authors have shown how to reduce non-unimodular transformation overheads.(see [10], for example) 4

A SIDE EFFECT OF THE OPTIMIZATION

OF C O M M U N I C A T I O N S

It is well known that communication time is one of the most limiting factors for obtaining high performance programs for parallel architectures. In a parallel program generated by our paralleUzation method, data has to be got from the processor where it is stored, when it is not local. This is most easily accomplished by using the g e t primitive. This solution has the advantage of being natural and easy to generate automatically 3, but unfortunately it is inefficient. Indeed, the source processor pointed out by the g e t instruction can be entirely arbitrary giving that all active processors execute the get at the same time and in parallel. This might lead to conflicts on the Communication links during the forwarding of the messages. Moreover, in some machines (such as the CM-2(200)) the target processor has first to send its address to the source processor in order to allow the latter to send it the desired datum. As a consequence, g e t s are twice as expensive as sends. In [7] we address the issue of communications optimization by proposing ways to replace the costly g e t communication schema by a more efficient equivalent one. This amounts to a decomposition of the get operation into one or more elementary operations. Our goal is to detect the most economical decomposition. We proposed an approach based on linear algebra. Two tracks have been explored : 1. Transforming g e t s into sends. As a result if this transformation, communications called spreads (or b r o a d c a s t s ) , in which all or some of the active processors asking for the same data, are detected. These communications are efficiently implemented on most architectures. 2. Detecting regular communications between neighbor processors called shift 4 communications. Owing to the regular communication pattern involved, these communications are also very efficiently implemented. Transforming g e t s into sends amounts to designating a set of processors within the sending grid that will be in charge to send data to a set of processors within the receiving grid. In order to achieve this one has to designate which of the processors belonging to the sending grid might be active. This operation, which is detailed in [7], may in certain cases involve supplementary tests in order to insure that the receiving processors have actually integer coordinates. 3one can find the same choice in other works, see [9] for example 4they are called NEWScommunications too.

Efficient Compiler-parallelized Programs

337

For instance, in the parallel version of the Lamport loop example, we find the following get operation 9 (qo, ql) '

(2q0 + ql - 5 - t, ql - 1)

which is transformed into a send as follows :

(q0, ql)

) (qo--ql+t+4 2 ,ql+l)

consequently requiring the addition of the test: modulo(q0 - ql + t + 4, 2)= 0. This happens 4 times in the Lamport loop and results in the communications-optimized program being about three times slower than the original one ! This loss of performance is chiefly due to the fact that the modulo function is extremely expensive (60 to 75 times a floating point addition) on our target machine (the CM-2(200)). This overhead is likely to be smaller on other architectures.

5

DISCUSSION AND CONCLUSIONS

In this paper, we have presented two limiting-performance factors that we have encountered during our experiments. The first one concerns the rate of inactivity of the virtual processors. As a consequence of the so-called "VP looping" mechanism in SIMD-virtualized machines, which leads to the evaluation of expressions on every VP, even on those ones which normally don't have to participate in the computation, the performance of programs with a high inactivity rate is dramatically affected. A number of authors from the "Systolic Arrays Synthesis" community (see for example [11] [12] [13]) have proposed solutions in the context of partitioning/clustering that we aim to exploit and adapt to our parallelization context in order to minimize the number of virtual processors. We also have shown that unimodular transformations are not always the best choice. On the other hand, the shape of the wavefronts (sets of iterations to be executed in parallel for a given value of t) is generally not rectangular. The wavefronts are rectangular when the schedule and each component of the placement function is parallel to one of the iteration domain axes. This corresponds to the particular case where the schedule and the components of the placement function take the particular linear form where only one factor is not zero. Unfortunately, the great majority of current machines allow the use of rectangular grids only. This leads to implementing the non-rectangular wavefronts in a rectangular grid which requires making the extra processors idle. In our future work we plan to submit our method to experimentation on a MIMD architecture. We expect the run-time system of such machines to be more efficient in dealing the inactivity problem than with SIMD machines. Due to the fact that physical processors are not obliged to work synchronously, i.e they don't have to execute the same instruction at the same time, the system may be able to manage more properly the virtualization so as to efficiently distribute the actual computations. The second factor is a side effect of the optimization of communications. When turning g e t s into sends, one has to assure that only processors involved in the communication have

338

M.R. Werth and P. Feautrier

to send their data. This operation may induce a great overhead in some cases. Further investigation will be needed to determine whether the generation of a space-time mapping is possible when taking this issue into account.

Acknowledgements We are grateful to our friend Andreas L6tter for proofreading the manuscript.

References

[1]

P. Quinton. Automatic Synthesis of Systolic Arrays from Uniform Recurrence Equations. IEEE, Int. Syrup. on Computer Architecture, pp 208-214, 1984.

[21

P. Feautrier. Datafiow Analysis of Scalar and Array References. Int. J. of Parallel Programming 20(1), pp 23-53, 1991.

[31

P. Feautrier. Some Efficient Solutions to the A/line Scheduling Problem I, One Dimensional Time. Int. J. of Parallel Programming 21(5) pp 313-348, 1992.

[4]

P. Feautrier. Toward Automatic Partitioning of Arrays for Distributed Memory Computers. Seventh ACM Int. Conf. on Supercomputing, pp 175-184, Tokyo 1993.

[5]

M. R. Werth & P. Feautrier. On Parallel Program Generation for Massively Parallel Architectures. High Performance Computing II, M. Durand and F. El Dabaghi editors. North-Holland, pp 115-126, Oct. 92.

[61

M. R. Werth and J. Zahorjan and P. Feautrier. Using Compile-Time Conditional Analysis to Improve the Performance of Compiler Parallelized Programs. Proc. Int. Conference on Massively Parallel Processing, Applications and Development, Elsevier Science Publisher, 1994.

[71

M. R. Werth & P. Feautrier. Optimizing Communications by Using Compile Time Analysis. CONPAR'g4, Linz-Austria, LNCS, Springer-Verlag, Sept. 1994.

[sl

Mourad Raji Werth. G6n6ration Syst6matique de Programmes Parall~les pour Architectures Massivement ParaU~les. Ph.D Thesis, Universitd P. et M. Curie, Nov. 93.

[9]

Luc Boug6. The Data-Parallel Programming Model: a Semantic Perspective. Tech. Report No LIP-IMAG 92-45, 1992.

[10]

M. Barnett and C. Lengauer. Unimodularity considered non-essential. Proc. CONPAR 92- VAPP V, LNCS 634, Springer Verlag, pp 659-664, 1992.

[11] P. Clauss and C. Mongenet and G.-R. Perrin. Calculus of space-optimal Mappings of Systolic Algorithms on Processor Arrays. International Conference on Application Specific Array Processors, IEEE Computer Society, pp 591-602, 1990.

[121

J.Bu, E.Deprettere, P.Dewilde, A design methodology for fixed-size systolic arrays, in Application-specifc array processors (S.Y.Kung, E.Swartzlander, J.Fortes eds.), IEEE computer society Press, pp.591-602, 1990.

Efficient Compiler-parallelized Programs

339

[13] J.Teich, L.Thiele, Partitioning of processor arrays" a piecewise regular approach, b~tegration: The VLSI Journal, 1992.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

341

FROM DEPENDENCE ANALYSIS TO COMMUNICATION GENERATION: THE "LOOK FORWARDS" MODEL

CODE

Ch. REFFAY

G.-tt. PERRIN

LIB, Universitg de Franche-Comtg 16, Route de Gray F-~5030 Besanfon reffay@comte, univ-fcomte.fr

ICPS, Universitg Louis Pasteur Boulevard Sgbastien Brant F-67400 Illkirch perrin @icps.u-strasbg,fr

ABSTRACT. This paper is devoted to code generation from paraUelization techniques. It focuses on the communications. The so called "look forwards" dependence modeling is recalled, and we show that it yields the determination of multiple uses of a data and either its broadcast or its local storage into virtual processors. KEYWORDS. Affine recurrence equations, code generation, communications. 1

INTRODUCTION

Thanks to the development of loop parallelization and systolic array synthesis techniques from recurrence equations, the last decade saw the first prototypes of "paralhlizing compilers" as in [1]. These techniques are founded on a dependence analysis, the definition of a schedule function which meets the dependences and the definition of an allocation function which maps the computations into a virtual architecture. Their objective is to transform the initial iteration space of an algorithm in a space.time domain. As compiling techniques, they mean the rewriting of the source code in order to emphasize a sequential loop to spend the time and a parallel one to describe the array of virtual processors currently used, either in a SIMD programming model or in a SPMD one. In this paper, we briefly recall the now well mastered techniques [10, 14] based on linear transformations of polytopes and their scanning with loop nests. To complete the code generation, communications between virtual processors have to be synthesized from the dependence analysis : this can be achieved by applying the same linear transformation as the one going from the initial iteration space to the space-time domain. However, such a translation can present some insufficiencies and the dependence model~ ing must capture the whole information needed to generate communication code such as broadcast instructions, in order to perform it with the intrinsic e~dency of a data~

342

C. Reffay and G.-R. Perrin

programming model for example. This enhances the interest of defining an appropriate dependence modeling and of relevant parallel code generation techniques from systems of atrme recurrence equations. So, this paper is organized as follows. Section 1 is devoted to the dependence modeling from a system of atrme recurrence equations [8]. The second section recalls classical results on space-time transformation and code generation. The third one emphasizes the problem of communication. The Cholesky factorization illustrates all the previous points. 2

RECURRENCE

EQUATIONS AND DEPENDENCE

MODELING

D e f i n i t i o n 1 A system of recurrence equations is a set of equations of the form 9 X [ ~ = f~(Y[PE,a(~I, . . ., Y[PE,t,s(z~]) where X is a variable indezed by n integer components of a vector ~ scanning a bounded polyhedron C~ C ~ , and fE is a function with a constant computation time. In the following, PE,i are supposed to be affme functions 9 PE,i(z~ = ME,i" z + eE,i where Mf,i is a n • n matrix and EE,i a constant vector of 7Z". This means that the variables are fully indexed. We define the iteration domain C C ~ " as the union of sub-domains C~r. E x a m p l e 1 Choleskyfactorization. Let A be a n • n symetric semi.positive matriz. The Cholesky factorization computes a lower triangular matriz L so that A = L. L t. It is defined by the following system of affine recurrence equations 9 L(i,j,k) L(i,j,k) L(i,j,k)

= L ( i , j , k - 1 ) - L(i,k + 1, k) • L ( j , k + 1, k) 2
(1) (2) (3)

where initialization of variable L is L ( i , j , - 1 ) = A f t , 1 <_ i < n, 1 <_ j < i, and the resulting matriz L is defined as Lij = L ( i , j , j - 1), 1 < i < n, 1 < j < i. Every computation X[~ is indexed by a n-vector ~'. So, if the computation of X[z-] uses Y[zo], we say that a causal dependence exists from Z'o to ~"and we represent it by a n-vector d = ~-~o. Every equation (E) of the system induces then kE dependence vectors d~,i defined as"

dE,i = ~'- pE,i(z.') = ( [ - ME,i)" z - cg,i

Vi E '[1,...,kE}

In the particular case of a uniform dependence, dE,i(~ is a constant vector, (M = 1 and dE,i(~ = -~'E,i) whereas in the case of an affme one, it depends on ~'. Subsequently, in the atfme case, a result Y[z'0] can be used more than once (see figure 1 for the Cholesky factorization) 9 the set of points ~"E CE such that X[z-] uses Y[~'0] according to dependence d~,iis called a utilisation set and denoted as UtilE,i(Y[7.o]). Definition 2 The utilization set UtilE,i(Y[~o]) associated with an occurrence Y[~'o] and according to a dependence dE,i of an equation (E) is the convez polyhedron defined

Dependence Analysis to Communication Code Generation

343

Figure 1" Dependence vectors dl.s

as"

UtilE,~(Y[s

= {~ E CE I PE,i(s = zo}

If UtilE,i(Y[~o]) has dimension he, a basis of this utilization set, i.e. a basis of Ker(Me,i) contains nu vectors called utilization vectors, denoted as 6E,i,1,..., uE,i,n~,. Since a utilization set is a convex bounded polyhedron, the set of dependence vectors having their origins at point Y[~'0] and their extremity in UtilE,i(Y[Y.o]) form a cone, called dependence cone. Its eztremal vectors are denoted as dE,i,effil,dE,i,effi2, etc. So, dependence vectors, eztremal vectors and utilization ones characterize a utilization set. To complete the dependence modeling, we introduce the concept of emission set as the lattice containing all points F0 which originate some dependence vector. Definition 3 An emission set associated with the dependence dE,i, denoted as EmitE,i is

the lattice EmitE,i=

PE,i(CE).

Informally, this dependence modeling, called "look forwards", considers dependence vectors from a point pE,i(~) sending Y[pE,i(z')] towards points ~" using this value. This model is presented in full details, and compared with the correlated literature, in [8]. So, at the code generation step, for a given computation point z'0, it allows to define the set of computation points using the result computed in F0 and then the set of processors involved in the communication of this result. This information is necessary to detect broadcasts, or to generate the communication code by using send primitives. Conversely, the "look backwards" model generates communications by defining the location of required data for any computation. It means that the primitives involved in the generated code are get requests in spite of the cost of such protocols. Moreover, this model does not contain intrinsically the necessary information to detect broadcasts.

3

SPACE-TIME TRANSFORMATION

An affme schedule function t so that V~"E C, t(~') = ~.~'+a can be defined from dependence analysis. This function associates an instant t(~') E 77, with every point ~', and defines a partial order in the computation space : ~ is normal to the timing hyperplanes [6]. Various studies are concerned with this function : for example, the existence of a linear schedule and possible transformations of equations [9], or the research of an optimal linear schedule in the

C. Reffay and G.-R. Perrin

344

case of uniform recurrence equations [3, 5]. A linear allocation function a is generally defined as a projection of the initial domain along a direction ~, called the allocation direction. Both allocation function and schedule define a linear application T that associates a point z; of the space-time domain with every point ~"of the initial iteration domain. Notice that since two simultaneous calculations can not be allocated to the same processor, matrix T is necessarily invertible. When such a space-time domain is obtained, the code generation consists in defining the nested loops scanning this domain either in a SIMD mode or a SPMD one" SIMD mode

SPMD mode

For t = to to tend Do Fora11 activ processor p(t) Do treat(t,p)

Forall processor p Do For t = to(p) to tend(p) Do treat(t,p)

End Do End Voall

End Doall End Do

Loop bounds can be calculated either by the Fourier-Motzkin algorithm [12], or by PIP [4], or else by the method proposed in [13] using the polyhedral library developped at IRISA. The problem is more complex when the transformation T is non-unimodular : the initial domain is transformed into a lattice indeed. A method based on the Hermite decomposition of T is given in [10, 11, 14]. It results in a matrix product H U where U is unimodular and H triangular inferior. Diagonal entries of H give the increment steps for nested loops scanning the lattice. Moreover, this method calculates exact loop bounds. Figure 2 describes geometrical transformations applied on the iteration domain for the Cholesky factorization. The final domain is represented with black points (actual images of the initial domain), whereas rings represent holes of the resulting lattice. E x a m p l e 2 Cholesky yactorization 9 a non-unimodular transformation. The initial iteration domain is defined by this algebraic relation 9

001

-1 1

0 -1

0 0

0

1

-1

0

x

n 0

+

-. > 0

1

From these following scheduling vector u and allocation direction ~ 9

(1)

~:

(1)

1

~ =

1

1

1

one can derive a transformation matrix T, whose first raw is y t and the rest is a matriz a obtained by the Blankinship method [2] applied on vector ~ 9 T =

I - 11 01 1

1 1

0

-1

and

det(T) = 3

Dependence Analysis to Communication Code Generation

345

Figure 2: Space-time transformation for the Cholesky factorization

This leads to a non-unimodular transformation of the initial domain. The Hermite decomposition of T is 9 T

=

H x U

=

(100)(11 1) 1 1 0 1

2

x

3

0 -2 0

1

-1 0

Let ( i l , j l , k l ) = ~ = Us By substituting U-lzl for ~ in the algebraic relation, we obtain the following system of constraints 9 I

- jl -- il

-- j l

il

+ jl jl

-2

kI -- k l

Jr n

-~3 ~1

--1

)_ 0 ) 0

~ 0 > 0

The Fourier-Motzkin algorithm determines the loop bounds of the nest scanning this inter. mediate polyhedron. The reduced system of constraints is " 2 -il [1-~B-3]

_~ Zl <_ jl ~_ k,

_<: 3 n - 1 -3il + 3 n - 1

<-rain(-2, t , ~_ rain ([-2-~], - i l -

1) jl + n)

This intermediate polyhedron is presented in figure ~ for n = 5. The transformation H can then be applied on this domain. Let (t, al,a2) = H ( i l , j l , k l ) and (Yl,Y2,Y3) = H-l(t, al,a2) 9

C. Reffay and G.-R. Perrin

346

y~ Y3

(i 0 0)() ( t )

=

-1 1 1/3-2/3

0 1/3

al

=

- t + al

t-2a,+a~ 3

a2

The resulting loop nest can be obtained by using the method proposed in [11]. By defining loop bounds in the basis (Yl, y2, ys ) " l ilma= = 3 n - 1 31,na= = rain/'-2 [ -sYt +3n-1 I~

ilmin -- 2 ff l rnin ---- - - Y l

~,

'L

J/

~

we obtain 9

{,.,.

al,,i,~

_-2 =0

{,.o. a~..=

a2,,,,~

= - t -I- 2al -I- 3 [l+~-a' 1

(

L

J)

= t "F rain - 2 , -3t +an-1 al,~,= = - t + 2al -I- 3 rain (t~:~J, -a~ + n)

These bounds and the diagonal of H (whose values are increments for each loop indez) determine the following loop nest, that covers the final domain on figure ~. Fort=2to3n-lStep

0 F o r a l l a2

1 Do J

L

l

Do

+ 2 a l + 3 [ l + ? a , ] to -t-l- 2al -1-3rnin ([Lr~J, - a l -l- n) Step 3 Do w

w

L[U-1H-l(t, al,a~)] = I(...L[p(U-1H-l(t, al,a2))]...) End Doall End Doall End Do The loop nest obtained by this technique scans the space-time domain according to the choice of schedule and allocation. Inside this nest, computations of the resulting values X [ U - 1 H - l d ] , using the various Yk[pk(U-1H-ld)] as data, must take place. In case of several equations, it is first necessary to determine the equation associated with the considered resulting values X[U -1H -1 all. This can be implemented by using a selector according to the value of z-~ i.e. (t, al, a2) for the example. Moreover, these calculations use values Yk[pk(U -I H -~ d)] which can be results calculated by others processors. So, it is now necessary to synthesize communications which transfer results calculated by processors and possibly used by other ones.

4

GENERATING

COMMUNICATION

CODE

The loop nest generated by these techniques scans points z ~ of the transformed domain T(C). In this section, we focus on the code generation for the acquisition of data needed for a calculation, the calculation itself defined by the associated equation (E), the propagation of some data, and the emission of the result to virtual processors which use it. Our "look forwards" dependence model explicitly defines results emissions. One can then take advantage of functions, such as put and broadcast, which are intrinsic to some languages. They can be generated according to the sort of emission :

Dependence Analysis to Communication Code Generation

347

9 either local memory management, if the data or some result of a calculation are used on the same processor, 9 or broadcast, if a set of processors use the same item simultaneously, 9 or else simple point to point communication. For sake of simplicity, we consider in the following that the dimension of a utilization set UtilE,i(Y[~o]) is at most one. For this reason, we call it a utilization segment. Considering a schedule vector ~ and an allocation direction ~, we have the following properties. 4.1

Sort of Communication

P r o p e r t y 1 A broadcast is needed if all computation points of the utilization segment UtiIE,i(Y[~o]) are scheduled at the same instant, i.e. if u $E,i = 0 A ~F~,i ~ O. When computation points of a utilization segment are scheduled at different instants, we consider (for convenience) that t(~o + ffE,i,~=t) < t(~o + dE,i,~=2) where d'E,i,e=~ and dE,i,effi2 are extremal vectors. Any communicated value then has the following behaviour ' 1. the item Y[z~] is calculated by the virtual processor ~(z~), .#

2. this value is communicated from ~r(z~) to virtual processor ~(z~ + forms the first computation point belonging to UtiIE,i(Y[~o]),

dE,i,e|

that per-

3. each virtual processor p performing a computation point in Utils,i(Y[~o]),but the last one, must propagate Y[z~] to the nezt virtual processor 9 p~ = p + ~(~/~,i). This behaviour is generated by communication code depending on z~ and z~ -{-d~,i,exland on the fact that all points of UtiIE,i(Y[~o])are allocated to the same virtual processor or not. Our model can take into account these different cases by using simple algebraic properties 2 and 3. P r o p e r t y 2 If there is no broadcast, the transmission of any value Y[z~] from the virtual processor which calculates it to the first virtual processor which uses it, is implemented by a local memory management if and only if r = O. P r o p e r t y 3 If there is no broadcast, the propagation of any value Y[z~] from some virtual processor p which did use this value, to another which uses this value too, is implemented by a local memory management if and only if r - 0 ^ uE,i ~ O.

4.2

Data Acquisition and Calculation

These last properties allow to synthesize the sort of data acquisition. However, notice that information required for the code generation for data acquisition is intrinsically contained in the equations. Therefore, there is no particular problem in this part. Thus, instructions defining the acquisition code can be generated by the following algoritlun 9

348

C. Reffay and G.-R. Perrin

f o r i = 1 to ~E do generate("If dE,i(~ = dE,i,e~l Then") if ft. dE,i,e.t- 0 then generate("argE,i = bufferE,i[...]") else generate("receive(argE,i)") end if generate ("Els e" ) i f ~(uE,i) = 0 ^ uE,i ~ 6 then generat, e("argE,i = localE,i") else generate("receive(argE,i)") end i f g e n e r a t e ("End If") end do where we introduce two kinds of buffers, according to the fact wether the value is calculated or only transmitted by the considered virtual processor. A buffer is called bufferE,i[...] in the case of transmission and localE,i in the case of propagation. E x a m p l e 3 Choleskyfactorization. For short, we discuss the method only for the routine Equation1, that presents all interesting characteristics concerning the code generation for data acquisition. The application of the preceeding algorithm on this ezample generates the following code for the arguments acquisition of routine Equation1 9 I f (0, 0,1) = (0, 0,1) Then r e c e i v e ( a r g l ) Else argx = local1,1 End I f I f ( 0 , ] - k - 1 , 0 ) = (0,1,0) Then receive(arg2) Else receive(arg2) End I f I f (i - j , j - k - 1,0) = ( 0 , ] - k - 1,0) Then receive(arg3) Else receive(arg3) / L = a r g l - arg2 • args This generated code can be optimized as follows, by using a simple syntactic processing 9 receive(argl) receive(arg2) receive(arg3) L = arglarg2 x arg3

4.3

D a t a P r o p a g a t i o n t h r o u g h Utilization S e g m e n t

In the case of broadcast on a utilization segment, the problem of propagation does not hold. In case of no broadcast (guaranted by the condition : ~. ~E,i ~ 0), for each computation point in the utilization segment but the last (guaranted by the test" dE,i(~') ~ dz,i,effi2), the set of instructions that propagate Y[z~] from this point to its successor in the utilization segment can be generated in the code of routine EquationE associated with equation (E), by the following algorithm: f o r i = 11;o kEdO g e n e r a t e ( " I f dE,i(O ~ dE,i,e~,) Then") i f ~(uE,i) = 0 then generate("localz,i = argE,i") e l s e generate("send(argE,i) to ~(~+ end i f generate ("End If") end if end do

5E,i)")

Dependence Analysis to Communication Code Generation

349

E x a m p l e 4 Cholesky factorization. Considering an emitting point z0 = (i0,j0, k0), utilization segments associated with dependences of equation (1) are the followings : Utill.l(L[s = Utill.s(L[~o]) = Utill.3(L[~o]) =

{(io,jo,ko + 1)} {(io,j,jo- 1) E ~e {(i, io, jo - 1)E

I jo+l<j
The generated code propagation of data inside utilization segments Utill,s(L[~o]) and Utill,3(L[~o]) /or routine Equation1 is then : Ifj~iThen send(art2) go (al - l, as) End If If i ~ n Then send(art3) go (el + 1, as + 1) End If ..#

Because dl.1 is a uniform dependence, Utill.l(L[Y,o]) contains only one point. Therefore no propagation processing is needed.

4.4

Code G e n e r a t i o n for Emissions

In order to fullyuse the communications primitives,the generated code for emissions has to distinguish cases of local memory management, point to point communication and broadcast. The code generated by the following algorithm defines the "work" a point in CE is supposed to do for all emission sets EmitE,j such that EmitE, i n CE ~ 0 " generate("If ~ E EmitE,,i Then") if ~. ~m',i = 0 then g e n e r a t e ( " b r o a d c a s t ( X ) go cr(UtilE, i(X[z']))") e l s e if ~(dE',i,e~t) = 6 then generate("bufferE,,i[...] = X " ) else generate("send(X) go p + r end i f end i f generate ("End If")

E x a m p l e 5 Cholesky/actorization. The emission sets are respectively" Emitl.2(L) Ernitx.a(L)

= {(io,]o, j o - 1 ) E / ~ = {(io,]o, Y o - 1 ) E ~3

[ 2_
So, points of subdomain C2 are concerned for emission with the dependences dl.2 and dl.3. The code describing emissions in routine Equation~ must then be refined by observing that Emit1.2 = Emit1.3 = C2 and that points of C2 do not belong to any other emission set. Thus, it is useless to test the emission set a given point of C2 belongs to. if

u ~l.s = 0 then genera te("broadcast(L) go o'(Ut//1.2(X[~))") else if ~(d1.2,e=1)-- O then generate("bu//erl.2[...] = L") else generate ("send(L) go P + ~(~.2,,,~)") end i f

end if

C Reffay and G.-R. Perrin

350

i f ~. 51.3 = 0 then generate("broadcast(L) to cr(Utill.3(X[z-1))") else i t ~(d,.:.,,,)= 6 then generate("bul/er,.3[...] = L") else generate("send(L) to p + r end i f end i f This code is obtained by using the code generator. But its purpose is to enhance tests that can work during the code generation step. On this ezample, these tests produce the following instructions concerning emissions of the routine Equations 9 send(L) to (ax - 1, as) send(L) to (al + k - i + 1,as) -4

5

CHOLESKY FACTORIZATION : A COMPLETE EXAMPLE

In this section, we collapse all previous pieces of code generation, in order to develop the whole example. The whole parallel program for Cholesky factorization is obtained by applying two steps. The firstone determines the loop nest scanning the space-time domain presented here in a S I M D form, where the external loop describes the sequence of instants while the internal nest scans the processor space in parallel. The second step generates code for communications. This code is supposed to take place inside the nest computed by the first step and therefore parametrized by all loop indices. Such a program expresses on the one hand the span of a space-time lattice,and on the other hand, for each computation point ~" in this lattice,the subdomain C/~ where X [ ~ is defined and the needed communications to acquire its data, possibly to propagate these data and to emit its result. Section 3 showed how to find the loop nest. The following paragraph presents all intermediate calculations (i.e. the definition of geometrical objects in the "look forwards" model) necessary to apply our method of code generation for communications.

5.1

The Geometrical Objects

Dependence vectors ~.1 =(0,0,1)

Extremal vectors

Utilization 'vectors

~.2 = (0,i- k- 1, 0)

~.~.,,,= (0,I,0)

~1.1 = 0 ~ = (0,i,0)

d~l.s,,ffi., = (0, i- t - 1, 0) ,~.~ = i~ - i , i - k - i, O) d,.a,,ffi,= (0,]- k - 1,0) d1.3,.. ' = (n - / , j - k - 1 , 0 ) ~r,.~ = (o,o,1) .... i d3.,,.., - (I,0,0) d,, = (i- i,0,0) d'~.',-: - (" - I, o, o)

~ = {0,0,I) .

,

.

. .

.

.

.

.

. . . .

.

.

ul.3 " (1,0,0) '~2.1 = 0 ,,

,

~2.2 = (I, o, o) .=,

~=0

Emit,.,(L) U Emits.~(L) u Ernit3(L) = C Ernit,.s(L) = Ernitl.3(L) = Cs Emit~.~( L ) = Cs

..#

.

Dependence Analysis to Communication Code Generation

5.2

351

Code Generation

After the necessary geometrical objects have been calculated and simplified, the method presented in this paper generates the following parallel code 9 F o r t = 2 t o 3 n - l S t e p I Do Fora11 a l - 0 t o , + rain (-2, s~.p~ Do L.t = (/,j,k) = .x, .2) case ~ q C1 Do r e c e i v e ( a r g l )

receive(arg2) I f i - ] Then r e c e i v e ( a r g 3 ) Else arg3 - Ioca11,3 End I f L = a r g l - arg2 • arg3 Ifj~iThen send(arg2) t o (al - l,a2) End I f I f i ~ n Then send(arg3) to (al + 1,a2 + 1) End I f send(L) to (ax,a2 + 1) End Do case ~ q C2 Do r e c e i v e ( a r g l ) If j--i-1 Then arg2 - buffer2,2[...] Else arg2 = local2,2 End If L = argl/arg2 If i - j + n ~ 1 Then local2,2 = arg2 End I f send(L) to (al + 1,a2) buffer1,3[...] = L End Do case ~ E Ca Do r e c e i v e ( a r g l ) L = (argl)}

Buffer2,2[...]- L End Do End Doall End Voall End Do

6

CONCLUSION

As compared to the state of art concerning program para]lelization, our approach completes results about nest loop rewriting. We propose a code generator for communications. The study focused on the interesting case of afiine dependences that may induce the use of various primitives of specific communications. Our dependence modeling called "look forwards" is essentially founded on the notion of utilization set. We have shown how it is able to generate an efficient communication code, synthesizing either local memory management or point to point communication, or broadcast. Note that broadcasts are detected at the code generation level, by using basic linear algebraic tests.

352

C. Reffay and G.-R. Perrin

For sake of simplicity, the presentation has been made using the "look forwards" model in the initial iteration domain. However, in order to increase efficiency in the generated code at run-time, this model can be transposed in the space-time domain. We presented this method in the simple case of one-dimensional utilization set : it has to be extended to sets of any dimension. It can be also completed by the automatic synthesis of communication buffers [7]. All techniques presented in this article have to be implanted in the OPERA software, in order to generate parallel code from affme recurrence equations. References [1] M. Barnett. A Systolizing Compiler. PhD thesis, Dpt of Computer Science, The University of Texas at Austin, May 1992. Report TR-92-13. [2] W.A. Blankinship. A new version of the euclidean algorithm. American Mathematical Monthly, pages 742-745, 1963. [3] A. Darte. Techniqnes de paraU~lisation aetomatique de aids de boncles. PhD thesis, ENS Lyon, April 1993. [4] P. Feautrier and N. Tawbi. R~solution de syst~mes d'in~quations lin~aires : mode d'emploi du logiciel PIP. Technical report, Univ. P. et M. Curie, Paris, 1989. [5] Paul Feautrier. Some efficient solutions to the sffme scheduling problem. Journal of Parallel Programming, 21(5):313-348, October 1992. [6] L. Lamport. The parallel execution of DO loops. 17(2):83-93, February 1974.

Communications of the A CM,

[7] C. Mongenet. Data compiling for systems of uniform recurrence equations, 1994. PPL Special Issue on Dagsthul Seminar.

[8] C. Mongenet, P. Clauss, and G.-R. Perrin. Geometrical tools to map systems of ai~ne recurrence equations on regular arrays. Acta Informatica, 31:137-160, 1994. [9] Catherine Mongenet. Affme timings for systems of afline recurrence equations, 1991. LCNS 505. [10] J. Ramanujam. Non-unimodular transformations of nested loops. IEEE Computer Society Press, 16(20):214-223, November 1992. [11] T. Risset. Parall~lisation automatique : du module systolique ~ la compilation des nids de boucles. PhD thesis, ENS Lyon, February 1994. [12] A. Schrijver. Theory of Linear and Integer Programming. Wiley-Interscience Publication. J. Wiley, The Nederlands, 1986. [13] H. Le Verge, V. Van Dongen, and D. K. Wilde. La synth~se de nids de boucles avec la biblioth~que poly~drique. In RenPar'6, pages 73-76, June 1994. [14] J. Xue. Automatic non-unimodular transformations of loop nests. Parallel Computing, 20(5):711-728, May 1994.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

353

MAPPING COMPLEX IMAGE PROCESSING ALGORITHMS ONTO HETEROGENEOUS MULTIPROCESSORS R E G A R D I N G A I : t C H I T E C T U R E D E P E N D E N T PEI:tFOI:tMANCE P A R A M E T E R S

M. SCHWIEGERSHAUSEN, M. SCHONFELD, P. PIRSCH

Laboratorium fuer Informationstechnologie Universit~t Hannover Schneiderberg 3~, 30167 Hannover, Germany e-maihschwiege@mst, uni.hannover.de ABSTRACT. The high performance requirements of sophisticated image processing algorithms as well as the different computational demands of the individual parts request heterogeneous multiprocessor systems composed of dedicated as well as programmable processors. Mapping onto multiprocessors comprises partitioning of the algorithm into tasks and assignment of the tasks to processors. For this assignment an extended mixed integer linear programming formulation is presented including architecture dependent performance parameters. KEYWOttDS. System-level synthesis, mapping, MILP, heterogeneous hultiprocessors. 1

INTRODUCTION

Image processing applications like high definition TV (HDTV), 3D-image analysis or scene interpretation often have to be processed under real time constraints. This requires extremely high computational and throughput rates, which can only be achieved by massive application of parallel processing and pipelining as provided by multiprocessor systems. Complex image processing schemes can be splitted into several subtasks with different requirements leading to different architectures appropriate for the individual tasks. This results in a heterogeneous multiprocessor system for the implementation of the complete scheme. These heterogeneous systems may be composed of different, application specific processors like array processors for implementing low and medium level tasks with high computational demands and programmable processing elements on the other hand to provide the flexibility required by the high level tasks. For the design of such heterogeneous multiprocessor systems it is inevitable to find an efficient mapping of the individual parts of the image processing algorithm to suitable processors. This comprises not only the derivation of processors adapted to the computational needs of the single algorithms but also to combine them to an optimal entire system.

M. $chwiegershausen, M. SchOnfeld and P. Pirsch

354 2

PREVIOUS WORK

In the past, mainly two approaches for mapping algorithms onto multiprocessor systems can be distinguished. First, in the field of high-level synthesis the problem of mapping one single algorithm onto a dedicated datapath or processor was tackled [1, 2, 3, 4]. Especially in case of low level algorithms with high computational requirements methodologies for mapping an algorithm onto a regular array of processing elements like systolic array processors or related architectures were derived [5, 6, 7, 8]. But these approaches provide no means for mapping a complex scheme of several different image processing algorithms onto an appropriate multiprocessor system. On the other side, synthesis scripts like CATHEDRAL H, 2nd [9, 10] were developed in order to derive VLIW architectures consisting of synchronous DSP units with dedicated datapaths connected via a bus. Here, the application domain is a subset of realtime signal processing algorithms addressing medium throughput applications. owever, image processing applications demand for architectures providing extremely high computational and throughput rates. Other projects deal with mapping algorithms onto a distributed multiprocessor system with identical processors. But architectural aspects are hardly taken into consideration [11, 12, 13, 14]. In [1.5] an approach for domain-specific multiprocessor systems is presented but in opposite to our work it aims at calculating feasible schedules for a given application specific architecture without considering the area expense as well as the throughput rate. Furthermore, architectures excluding access conflicts on buses are regarded only. Therefore, these approaches have to be extended for mapping a sequence of complex image processing algorithms onto heterogeneous multiprocessor systems, consisting of different application specific processors and of flexible programmable processors. In this paper a new approach for mapping algorithms onto hardware is presented, taking into consideration the architectures of the related processors as well as the buses for the data transfer. The dependence of the alternative architectures on the execution and communication times as well as on the area expense is regarded. 3

MAPPING

APPROACH

Mapping the whole algorithm onto a heterogeneous multiprocessor system can be divided into three steps, namely partitioning the algorithm into several subtasks, determining possible assignments of the subtasks to dedicated processors during the classification and finally choosing the best assignment and deriving a schedule for the tasks, i.e. determining the temporal order in which the tasks are performed on the different processors (Figure 1). In the following the three design steps are explained in more detail. P a r t i t i o n i n g First of all, the overall algorithm has to be analyzed in order to detect non dependent parts which consequently can be paraUelized and therefore executed simultaneously on different processors [16, 17, 18]. This results in partitioning the overall algorithm into M dependent tasks mi, each of which represents a single image processing algorithm. Thereby, the dependencies between the different tasks can be modelled by means of a directed, acyclic graph GT = (V, E), whereas each vertex vi E V represents exactly one task mi and a directed arc eij E E from task mi to mj states, that there exists a precedence constraint between these tasks, denoted by a relation mi -< rrtj. Thus, the overall algorithm

Mapping ComplexImage ProcessingAlgorithms

355

can be described by a task graph, as shown in the top box of figure 1. Classification During the classification, it is determined for each task which assignments of the task's algorithm to suitable architectures of a given processor library is feasible with respect to the necessary throughput rate of the whole algorithm. Therefore, the classification requires a profound knowledge of the algorithms as well as of the architectures. The classification has to support different types of image processing algorithms distinguishing between filter algorithms (FIR,MEDIAN,etc.), transform algorithms (DCT, FFT, etc.), and non regular, data dependent algorithms (Quantization,Coding) on the other side. Furthermore, a distinction between separable and non separable as well as between 1-dimensional and 2-dimensional transform/filter algorithms is necessary. For each of these algorithms the following topics can be derived: The number and types of operations, the input/output data generated or needed by successive/precedent tasks and the inherent operation concurrency of the underlying algorithm. In case of real-time applications, additionally the necessary degree of operation concurrency can be calculated, whereas this corresponds to the average number of operations that have to be performed in parallel in order to fulfill the given timing constraint. In order to find appropriate architectures for the different requirements of the algorithms, a library of five alternative types of application specific processors is provided. This library comprises a single processor with dedicated datapath (SINGLE) performing all operations sequentially, several such dedicated processors in parallel (PAR) or connected in a pipeline mode (PIPE), and an array of dedicated processors (ARRAY)providing both, pipr and parallel operation mode. Thereby, the datapaths of the application specific processors depend on the kind of operation which are required for a given task. Additionally, a flexible and programmable processor (PRO(]) preferably suitable for performing the medium level tasks is provided.

Figure 1: Mapping Approach

These five processor types are only templates, whereas the datapath of each of the types is adapted to the operations that have to be performed for each task. For example, a SINGLE processor's datapath performing a filter task like FIR differs from a SINGLE processor's datapath performing a motion estimation task like BMA. By this adaptation of the datapath to a given task a large variety of architectures is derivable for each processor type. Fortunately, the number of feasible architectures can be reduced since a task with high requirement concerning operation concurrency can only be performed by a processor

356

M. $chwiegershausen, M. $chOnfeld and P. Pirsch

type with pipeline and/or parallel operation mode, namely PAP,, PIPE, ARRAY.Nevertheless, the remaining processor types lead to rather different performance data and area. expense. Thereby, architectures providing high computational rates often have large area expense. Thus, a trade off between computation speed and chip area has to be found for the overall multiprocessor system. According to the large number of feasible assignments of tasks to processor types, a cost function is necessary to judge a certain mapping. Clearly, the cost function has to take into account the achievable computation and throughput rate as well as the area of the related processors. Finding the best assignment of tasks to processors according to the cost function mentioned above leads to a combinatorial optimization problem as described in section 4. In order to derive a suitable formulation for this assignment problem the different architectural alternatives have to be described by means of a set of parameters. Therefore, it is reasonable to introduce the following performance and area parameters: E x e c u t i o n T i m e ri,j The execution time determines the time interval to perform task mi on processor pj. Generally, there are several architectures suitable to perform the task mi. Nevertheless, the number of clock cycles r i j to execute once the job may vary in a wide range, if function oriented as well as programmable processors are taken into consideration. P i p e l i n e P e r i o d c~i,~,j The pipeline period is the time interval between the execution of two tasks mi and m k , in case the tasks are both assigned to the same processor pj . If the processor is not capable of performing the tasks in a pipelined fashion, which means that task mk starts not until the completion of task mi, c~i,k,j just denotes the execution time of task mi. Otherwise, ~i,k,j < ri,j. B u s T r a n s f e r R a t e BR, BL The bus transfer rate BR for a remote data transfer denotes the number of clock cycles necessary to transmit a given amount of data on one of the buses, connecting different processors. Tile bus transfer rate for a local transfer BL denotes the time necessary to transfer a given amount of data within the same processor, for example if an exchange between several register files or local memory units is necessary. D a t a T r a n s f e r A m o u n t Di,k The data transfer amount gives the number of data, which have to be transmitted from task mi to m~. In case both tasks are executed by the same processor, no data transfer is necessary. Thus, the data can be held locally inside the processor's local memory. By multiplication of the data transfer amount and the bus transfer rate the time for data exchange can be calculated. I n p u t D a t a I n t e r l e a v e Factor/3il, k The input data interleave factor is the number of clock cycles between the execution end time of any task mi, which transmits data to a task mk, with mi -~ mk, and the start time of task mk on processor pj. Usually, it is assumed that task mk can start execution only after all of its predecessors have transmitted their data. In case task mk does not need all data from a predecessor to begin execution, this task can start execution while there are still data transferred to the processor. Thus, the task works in an interleaved or stream mode. For example, filtering tasks often process their data in a stream mode, whereas transforms as the FFT work in a block mode, since all input data are accessed in parallel. O u t p u t D a t a I n t e r l e a v e F a c t o r f~i, ~ The output data interleave factor determines the number of clock cycles between the execution start time and the begin of the data transfer of task mi if executed on processor pj. In case task mi generates its output data at the end of the computation,/~i,~ is never smaller than the execution time ri,j. For example, motion

Mapping Complex Image Processing Algorithms

357

estimation algorithms like block matching first of all compare the whole search area of the previous frame to a reference block of image data of the current frame, before the position or displacement of the most similar block can be determined. Thus, only few output data are generated, but only at the end of the computation. In contrast, convolution algorithms generate many output data at a high rate just after a few input data have been processed. This leads to a very small number of clock cycles for fli,~ P r o c e s s o r A r e a E x p e n s e Aj The processor area expense is an estimation for the number of transistors necessary to realize the datapath of a specific processor type. This depends on the processor type as well as on the task to be performed. In order to calculate these performance parameters as well as area expense parameters, the different application specific architectures for each of the processor types SINGLE, PhR, PIPE, hB.Rhg and their datapaths dedicated to the different algorithms were synthesized on register transfer level. Thus, it becomes possible to derive realistic estimates concerning the execution time cycles as well as the number of transistors for each processor. Scheduling and Allocation Finally, the scheduling and allocation has to select from the large amount of feasible assignments derived during the classification the one which leads to an overall multiprocessor system with high computational and throughput rate as well as a low area expense. So, it is necessary to determine for each task the processor type, the temporal order of the tasks and of the data transfers, taking into consideration the precedence between the tasks, the availability of input/output data and the non overlapping usage of processors and buses in order to derive a valid multiprocessor schedule. This assignment problem can be formulated as an optimization problem and efficiently solved by means of linear programming, as described in the next section. 4

MILP FORMULATION

Although the idea to tackle scheduling and allocation by (mixed integer) linear programruing is not new, the presented approach is extended in comparison to previous works [19] by two main aspects. First, a multitude of different application specific processors can be included by introduction of architecture dependent parameters which result from the classification step. Second, periodic applications are treated taking into consideration requirements of real-time processing. The scheduling and allocation, i.e. the assignment of tasks to processors and of data transfers to buses can be described by means of a Mixed Integer Linear Programming. A Mixed Integer Linear Programming (MILP) is a general Linear Programming problem, where some of the variables are constrained to be integer. Here, the MILP models the problem of assigning a set of image processing tasks to a heterogeneous multiprocessor system aiming at high computational and throughput rates on one side and a low silicon area of the application specific and programmable processors on the other side. This is achieved by formulating the assignment problem by means of a set of binary decision variables responsible for the assignment of tasks to processors or data transfers to buses and by a set of real valued timing variables corresponding to the start and end times of task executions or data transfers. For example, if task mi is executed by processor pj then the corresponding binary decision variable xi,j = 1 and zero otherwise. Equivalently, the time at which a task rai starts execution, is denoted by the real valued variable si. The restrictions concerning the assignment problem like unique processor se-

M. $chwiegershausen, M. $chOnfeld and P. Pirsch

358

lection or non overlal~ping data bus transfer can be described by a set of (in)equalities. In order to judge a s~ecific assignment of tasks to processors, an objective (cost) function must be given. In the remainder of this section the used variables, the cost function, and the restrictions are presented: 4.1

B i n a r y decision variables

The binary (0-1) decission variables used in the MILP model are: 1. x i j = 1

2. 3. 4. 5. 6. 7. 8. 4.2

~j = 1 ~i,k = 1 eil,kl,i2,k2 = 1 7i,~ = 1 7Lk = 1 I/t = 1 ~7~.k= 1

if task mi is executed by processor pj. if processor pj is necessary, i.e. at least one task is executed by pj. if task mi starts execution before task mj. if the data transfer mi~ ---, mk~ starts before the transfer mi2 "* ink2. if the tasks mi, mk are executed by different processors. if the tasks mi, mk are executed by the same processor. if BUSt is necessary, i.e. there is at least one data transfer on BUSt. if the data transfer mi~ --, mk~ is carried out by using BUSt.

Integer variables

The integer valued variables used in the MILP model are: 1. Y 2. B 4.3

total number of processors. total number of buses.

Real timing and area variables

The real valued timimg and area variables used in the MILP model are: 1. si 2. ei 3. csi,k 4. cei,k

5. T 6. P 7. A 4.4

time step when the execution of task mi starts. time step when the execution of task mi ends. time step when the data transfer from task mi to task mk starts. time step when the data transfer from task mi to task mk ends. computation time necesarry to execute once the whole algorithm. computation period, i.e. the time between two concecutive computations. total area expense of the multiprocessor system.

Objective function

The objective function is formulated as a weighted sum of all those parameters influencing the performance of the overall multiprocessor system. Therefore, the following system parameters are taken into consideration: The computation time (latency) T, necessary to compute once the whole algorithm and the computation period P, i.e. the time interval between two consecutive computations. Note, that there is a difference between the computation time and the computation period. Since image processing schemes often have to be processed periodically on a video-frame or even block level, it is assumed that the processors work in an overlapped, pipelined mode in order to achieve high throughput rates. Titus, fully static and overlapped schedules similar to the static rate-optimal scheduling approaches for iterative dataflow-programs [20, 21] are considered, leading to computation periods P _< T. Furthermore, the area expense of the multiprocessor system has to be

Mapping Complex Image Processing Algorithms

359

taken into consideration: This comprises the area Aj of all processors pj which were selected (~r = 1), the number Y of processors a.nd the number B of all.buses 1 which were allocated (17/= 1).

[rain It,.. r. +. . .p +. . .A +. . . V +

B]

The individual weights CT, c p , . . . , cB provide the opportunity to consider different constraints of the designer. This leads to a priority-based optimization technique. For example, if the main goal is to derive an overall multiprocessor system with minimum area expense A and additionally an execution time T as small as possible the priority order of the objective function's parameters is A ~, T ~- ... ~ B. The corresponding, individual weights according to the designer's intention have to be CA >> CT >> ... >> cB. In order to achieve the optimum according to the main goal, it is recommended to choose the ratio r for two adjacent parameters of the priority order in the range of r = c~ > 102 To guarantee this, it is necessary to transform all the parameters of the objective function to an identical value range, for example [0..1]. This can be achieved by calculating the minimum as well as the maximum possible values for each parameter G according to (2). r

G - Groin

g = Gm,~;LGmin

4.5

0 <_ g <_ 1

--

O e {T,P,A,Y,B}

(2)

Restrictions

To ensure a valid multiprocessor schedule the following restrictions have to be taken into consideration: First of all, any task mi has to be assigned to exactly one processor pj, which is capable of performing the associated operations of this task which leads to equation (3). Furthermore, any processor is only necessary, i.e. ~j = 1 if one or more tasks are assigned to it. Therefore, ~j may be expressed by the logical OR of all tasks mi mapped onto the processor pj, see (4). Equation (5) states the precedence relation between two tasks, mi, mk denoted by mi ~ mk, according to data dependencies between the tasks. Some of these precedence constraints can be directly derived from the task graph, because a directed arc ei,k from task mi to mk denotes a precedence relation mi "~ mk. Nevertheless, if no path exists from mi to mk, no information is available, which of the tasks is executed first. In this case, the precedence relation can only be determined after the MILP has been solved. In order to guarantee a correct processor utilization, we must ensure that no two tasks are executed simultaneously by the same processor or equivalently if the processor pj can perform the two tasks, in a pipelined fashion with a pipeline period eq,k,j, then task mk can be scheduled only after task mi with respect to this pipeline period. In order to express the fact that whenever two tasks mi and mk are assigned to the same processor pj, i.e. xi,i = X k , j ---- 1 and task mi starts execution before task ink, i.e. Si (_ Sk, then si + t~i,k,j <_ sk. To do this ifi,k is 1 if si <_ sk and zero otherwise. We can guarantee this by equation (6). Any task mk can start only after all of its predecessors mi with mi .z, mk have completed execution on one of the processors or equivalently after they have transmitted

M. Schwiegershausen, M. SchOnfeld and P. Hrsch

360

their data. Therefore, if cei,k denotes the time at which the communication from task mi to task mk has completed, task mk can only be started after the end of the data transfer of all of its predecessors: sk > cei,k. In case that task mk does not need all data from a predecessor before the execution can start on a specific processor pj the above equation may be relaxed to (7) where/~[j denotes the number of clock cycles before the end of the data transfer at which the task mk is allowed to begin execution. Assume that the task mi was scheduled on processor pj at time si and the execution time of the task on this processor is given by rid, then the time at which the execution is completed can be cMculated as the sum of start and execution time (8). N

E zi,j = 1

1 si + ~

6i,k = 1

ai,kd " xid

A :rid = xA:,./= 1

(6)

j=l N

>_

-

m i -~ m k

(7)

1 _< i _< M

(8)

1 __ i _< M

(9)

mi -q rnk

(10)

t t fit,kl,i~,k2 "- 1 h l~il,kl -- ~i~,k~ -" 1

(11)

m i ~ rnk

1 _ e ~ - s i N

A= Z j=l

N

I,

(i "A.i Y = E ~.i B = E r l t j=l

(15)

i=1

Generally, it is assumed that the data transfer from task m / t o a successor task mk takes place after the task has completed execution and thus calculated all result values. In case the processor is able to perform execution and I/O simultaneously, the data transfer may begin even earlier. Furthermore, let us denote by fli,~ the number of clock cycles after starting task mi on processor pj when the first result value is valid and thus the data transfer can start. This leads to the constraint (9). T h e t i m e at which data transfer from task mi to task mk ends depends on the number of data items Di,k that have to be transmitted, the transfer rate B n of the bus or the used communication link between the associated processors and whether it is a local transfer or a remote one. In case the two tasks communicating with each other are assigned to the same processor we denote this by a local transfer, i.e. 7i,L = 0 with little or even zero communication time, determined by the so called local transfer rate

Mapping Complex Image Processing Algorithms

361

BL. Otherwise, a remote transfer, i.e. 7i,R = 1 takes place using the communication link between the processors, see equation (10). Concerning the bus selection we must ensure that no collision occurs on any bus at any time, i.e. the data transfers on the buses are restricted to be nonoverlapped. Now we can express the fact, that two data transfers have to be performed in a nonoverlapped fashion on any of the L buses by equation (11). Furthermore, a BUSI is only necessary, if there is at least one remote transfer on that bus, i.e. 3(i,k) rl~,k = 1. Since r/t = 1 if BUSt is necessary, this constraint can be expressed by the logical OR, see equation (12). Finally, to ensure that T represents the computation time of the whole algorithm it cannot be smaller than the time when the last task being executed by the multiprocessor system has completed execution. Thus, equation (13) must hold. In case the whole algorithm has to be executed periodically, the computation period P between two consecutive computations can be calculated as follows: Assume, that task mi is scheduled at si on processor pj. Then the second computation of task mi will be scheduled at si + P on the same processor. In order to avoid processor usage conflicts, we must ensure that whenever two tasks mi, rnk are scheduled on the same processor, i.e. xi,j = xk,j = 1 and mi is scheduled before ink, i.e. tii,k = 1 the start time of the ( n + l ) t h computation of mi must be greater than the execution end ek of any task mk of the nth computation scheduled later than mi on the same processor, see equation (14). Clearly, the area expense A of the whole multiprocessor system, the number Y of processors as well as the number B of buses (15) can be easily calculated from the corresponding decision variables ~j and ~i. 5

EXPERIMENTAL

RESULTS

In order to validate the mapping approach using mixed integer linear programming a typical complex image processing application was chosen: Low bit rate coding according to the hybrid videocodec scheme based on CCITT recommendation t1.261 [22]. It is used for data reduction necessary to transmit video data on a p x 64 kbit/s line (1 _< p _< 30), e.g. video telephone, video conferencing or even multimedia.

•

Figure 2: Video Coding Algorithm

The application consists of several computational extensive regular, low level tasks like motion estimation (BMA), discrete cosine transform and inverse (DCT, IDCT), and FIR filtering (LF) as well as data dependent, irregular medium level tasks like quantization and inverse (Q, IQ) and variable length coding (VLC). For the mapping it is assumed that the video frames are coded as luminance and two color difference signals sampled with N v = 352(176) pixels/line, Nt = 288(144) lines/frame, and a frame rate of fFrame = 10 Hz. Since the whole video codec scheme can be decomposed into the computation of so called macro blocks with 16'16 pixels, NMB = 352 Tff' 2ss TC = 396 computations of different macro blocks are necessary. Thus, the computation for each macro block has to be completed after Tp -< Tp . . . . = 0.25 msec. Assuming a clock frequency of f = 40 MHz leads to the NMB real-time constraint that the computation period P is given by P <_Tp. f = 10101 cycles. Figure 3 shows tile corresponding task graph derived from the video coding algorithm H.261. Here, IN denotes the buffer of the current video frame and BUF the buffer of the previous frame. The right box of Figure 3 shows the possible assignment of tasks to pro-

362

M. Schwiegershausen, M. SchOnfeld and P. Pirsch

cessors, assuming three different processor types (SINGLE, PIPE, ARRAY) for the low level tasks and a programmable processor (PROG) for the medium level coding task. Concerning the data transfer 32-bit buses are assumed which means that in case of 8-bit video data four data items can be transmitted in parallel on each of the buses. Furthermore, the size of the processor's local memories were not taken into consideration. The corresponding MILP model under these simplified assumptions leads to 125 variables and 284 restrictions which could be solved by branch-and-bound within a few cpu-seconds. According to the designers goal or priority order different solutions for an emcient heterogeneous multiprocessot system can be derived. The performance and area expense of four solutions derived by solving the MILP under different assumptions concerning the designer's priority are shown in Table 1. The value of the most important parameter concerning the priority order is sketched grey in each column. Figure 3: Task Graph and Assignments As can be seen, the minimum value with respect to the designer's main goal could always be achieved. If for example the designer's main goal is to derive an overall system with minimum area expense A and additionally small latency T as well as small computation period P, i.e. A ~- T, P ~- ... ~- B, the result is shown in the first column of Table 1.

Table 1" Performance and Area Expense of different Multiprocessor Systems The corresponding architecture is shown in Figure 4. It consists of Y = 7 procesors, three of type SINGLE, two of type PAR, and one of type PIPE and PROG,leading to on overall system with minimum area expense A = 56792. Theoretically, the area expense could be further reduced by selecting more processors of type SINGLE, but because of the increase in execution time for this processor type compared to a processor of type PIPE or PAR this would violate the real-time constraint P <_ Figure 4" Heterogeneous Multiprocessor System 10101!

Mapping Complex Image Processing Algorithms

Figure 5: Periodic and Overlapped Schedule

363

Figure 5 shows the overlapped and periodic schedule for the multiprocessor system. It can be seen, that the computations of three macro blocks overlap. The computation period, i.e. the time interval

between two consecutive computations is P = 6156 cycles, whereas the time necessary to compute once a whole macro block of 16x16 pixels takes T = 20646 cycles. Clearly, the processor idle times could be further reduced, if resource sharing of the processor's datapaths is allowed. The four solutions derived are sketched in the area-time plane, see Figure 6. The possible design space is marked by the dashed rectangle, showing the possible solutions concerning area expense A and computation period P, whereas the right edge of the design space reflects for example the real-time constraint (P _< 10101). Depending on the designer's main goal either area or time optimal solutions can be derived. 6

Figure 6' Design Space A,P

CONCLUSION

In this paper, an extended MILP model is presented to describe the mapping of complex image processing algorithms consisting of tasks with different computational complexity onto a heterogeneous multiprocessor system. The alternative processor types and their dedicated data paths are described by a set of architecture dependent parameters, like execution times, pipeline period, detailled architectural dependent data transfer times, and area expense. These parameters were estimated by synthesizing the datapaths on register transfer level leading to realistic assumptions concerning execution time cycles and transistor expense. Thereby, it becomes possible to derive an overall heterogeneous multiprocessor system which is optimal with respect to a designer's priority or main goal like minimum area expense or execution time. First encouraging results under simplified assumptions concerning non-overlapping computation and data-transfer could be derived for a typical application example taken from the field of video coding, e.g. video telephone. Acknowledgements This work is supported by the Deutsche Forschungsgemeinschaft under contract PiReferences

[1] W. Rosenstiel, H. KrKmer, "Scheduling and Assignment ill High Level Synthesis", in High Level VLSI Synthesis, It. Composano, W. Wolf, editors, Kluwer Academic Publishers, pp. 355-382, 1991. [2] P.G. Paulin, "Global Scheduling and Allocation Algorithms in the HAL System", in High Level VLSI Synthesis, R. Composano, W. Wolf, editors, Kluwer Academic Publishers, pp. 255-281, 1991.

364

M. Schwiegershausen, M. SchOnfeld and P. Pirsch

[3] Y. Hsu, Y. Lin, "High Level Synthesis in the THEDA System", in High Level VI`SI Synthesis, R. Composano, W. Wolf, editors, Kluwer Academic Publishers, pp. 255-281, 1991. [4] F. Tsai, Y. Hsu, "STAR: An Automatic Data Path Allocator", IEEE Transactions on Computer-Aided Design, Vol. 11, No. 9, pp. 1053-1064, September 1992. [5] S. Y. Kung, S. N. Jean, "A VLSI Array Compiler System (VACS) for Array Design", R. W. Brodersen, H. S. Moscovitz, editors, VI,SI Signal Processing llI, Chapter 45, pp. 495 - 508, IEEE Press, New York, 1988. [6] P. Frison, P. Gachet, P. Quinton, "Designing Systolic Arrays with DIASTOL", S.Y. Kung, R.E. Owen, J.G. Nash, editors, VI,SI Signal Processing II, Chapter 9, pp. 93-105, IEEE Press, New York, 1986. [7] D.I. Moldavan, "ADVIS: A Software Package for the Design of Systolic Arrays", IEEE Transactions on Computer-Aided Design, Vol. CAD-6, No. 1, pp. 33-40, January 1987. [8] U. Arzt, J. Teich, M. Schumacher, L. Thiele, "Hierarchical Concepts in the Design of Processor Arrays", CompEuro '9s pp. 232-237, March 1992. [9] J. Rabbaey, H. de Man, J. Vanhoof, G. Goosens, F. Cathoor, "CATHEDRAL II: A Synthesis System for Multiprocessor DSP Systems", in Silicon Compilation, Addison-Wesley, pp. 311360, 1988. [10] J. Vanhoof, I. Bolsens, G. Goosens, H. de Man, K. Rompaey, "High level synthesis for real-time digital signal processing", Kluwer Academic Press, 1993. [11] S.Y. Lee, J.K. Aggrarwal, "A Mapping Strategy for Parallel Processing", IEEE Transactions on Computers, Vol. C-36, No. 4, pp. 433-441, April 1987. [12] D. Fernandez-Baca, "Allocating tasks to Processors in a distributed System", IEEE Transactions on Software Engineering, Vol. 15, No. 11, pp. 1427-1436, November 1989. [13] S.H. Bokhari, "Partitioning Problems in Parallel, Pipelined and Distributed Computing", IEEE Transactions on Computers, Vol. 37, No. 1, pp. 48-57, January 1988. [14] K. Konstantinides, R.T. Kaneshiro, J.R. Tani, "Task Allocation and Scheduling Models for Multiprocessor Digital Signal Processing", IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 38, No. 12, pp. 2151-2161, December 1990. [15] A. Bachmann, M. Sch6binger, L. Thiele, "Synthesis for Domain-Specific Multiprocessor Systems including Memory Design", L.D.J. Eggermont, P. Dewilde, E. Depettre, J. van Meerbergen, editors, VLSI Signal Processing VI, Part 3, pp. 417-425, IEEE Press, New York, 1993. [16] U. Vehlies, U. Seller, "The Application of Compiler Techniques in Systolic Array Design", Proc. ISCAS, pp. 240- 243, June 1991. [17] M. Girkar, C.D. Polychronopoulos, "Automatic Extraction of Functional Parallelism from Ordinary Programs", IEEE Transactions on Parallel and Distributed Systems, Vol. 3, no. 2, pp. 166- 178, March 1992. [18] M. Gupta, P. Banerjee, "Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers", IEEE Transactions on Parallel and Distributed Systems, Vol. 3, no. 2, pp. 179- 193, March 1992. [19] S.Prakash, A.C.Parker, "Synthesis of Application Specific Heterogeneous Multiprocessor Systems", Journal of Parallel and Distributed Com.putin.9, no. 16, pp. 338- 351, December 1992. [20] K.K. Parhi, G.D. Messerschmitt, "Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding", IEEE Transactions on Computers,Vol. 40, no. 2, pp. 178 195, February 1991. [21] K.K. Parhi, "Algorithm Transformation Techniques for Concurrent Processors", Proceedings of the IEEE, Vol. 77, no. 12, pp. 1879- 1895, December 1989. [22] CCITT Study Group XV: Recommendation H.261, "Video Codec for Audiovisual Services at p x 64 kbit/s, Report R37, Geneva, July 1990.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

OPTIMAL COMMUNICATION CHIP COMPILER

FOR

A GRAPH

365

BASED

DSP

H.-K. KIM Electronics and Telecommunications Research Institute (ETRI} Yusong P. O. Boz 106, Taejon Korea kim @media. etri. re. kr

A B S T R A C T . This paper describes a complete chip compiler which can generate a circuit diagram with the associated control signals efficientlyfrom a given digitalsignal processing (DSP) algorithm. Especially, we focus on the development of a schedule modification methodology which can be applied to schedules generated by a cyclo-static scheduler. The objective of the schedule modification methodology is to obtain a modified schedule that requires both the minimum number of processing dements and the minimum number of communicating edges. This paper illustrates the synthesis process with an example of a fourth order Wave Digital Filter. KEYWORDS. filter.

1

Digital signal processing, cyclo-staticscheduler, chip compiler, Wave Digital

INTRODUCTION

In our previous research, we proposed a chip [5] compiler which can generate a VLSI layout efficientlyfrom a recursive DSP algorithm by combining a cyclo-staticmultiprocessor scheduler [8] with the L A G E R system [9]. For a given DSP algorithm represented by a fully specified flow graph (FSFG), a cyclo-static scheduler generates a rate-optimal, processoroptimal, and delay-optimal schedule [8]. Then, the schedule is modified so as to simplify the architectures of processors to be used as basic modules for the VLSI realization. Taking the modified schedule and associated FSFG, the Circuit Generator automatically creates a complete circuit diagram for both the computational and control circuitry necessary to implement the input algorithm. The resulting circuit diagram is used as input to the L A G E R system to create a final chip layout for fabrication. An important issue which was not addressed in the previous paper was the minimization of the interconnection network of the multiprocessor implementation. In general, efncient

366

H.-K. Kim

rate-optimal and processor-optimal fine grained multiprocessor implementations require considerable interprocessor communications. If the implementation is composed of n processors, n ( n - 1) uni-directional communications channels may be required for the worst case. From a VLSI implementation viewpoint, minimal interconnections are often a critical factor because communications channels (wires) occupy so much space. This paper describes a complete chip compiler which can generate a circuit diagram with the associated control signals efficiently from a given DSP algorithm. Especially, we focus on the development of a schedule modification methodology which can be applied to schedules generated by a cyclo-static scheduler. The objective of the schedule modification methodology is to obtain a modified schedule that requires both the minimum number of processing elements and the minimum number of communicating edges. In addition, we designed three prototypes of target processors in order to implement both adaptive and nonadaptive algorithms efficiently. They are to be used as basic modules in VLSI. Although we employed the LAGER system to obtain a VLSI layout in the previous paper, other existing CAD tools may be also used. For the sake of clarity, we will explain the design process except VLSI layout generation with an example of a fourth order Wave Digital Filter.

2

CHIP COMPILER OVERVIEW

The chip compiler takes as input a recursive DSP algorithm described by a graph, and generates a complete VLSI layout. Figure 1 depicts the block diagram of the overall chip compiler. The chip compiler is divided into five modules: the Flow Graph Analysis, the Cyclo-static Scheduler, the Schedule Modification, the Circuit Generator, and the Layout Generator. Each module can be briefly described as follows. 9 The Flow Graph Analysis module takes a fully specified flow graph (FSFG) as input, and extracts the following three optimality conditions: the Iteration Period Bound (IPB), the Processor Bound (PB), and the Period Delay Bound (PDB) [8]. These three optimality conditions are fundamental and are uniquely determined by the given FSFG and the speed of the elements which will be used in the implementation. These bounds do not rely on input algorithm transformations such as loop unrolling and retiming/loop folding. 9 The Cyclo-static Scheduler module finds an optimal cyclo-static multiprocessor schedule using the three optimality conditions computed in the Flow Graph Analysis module [8]. The resulting cyclo-static schedule always guarantees an implementation which can operate at the highest input sampling rate, with the least input-output delay, and with the highest processor efficiency (fewest possible number of processors). 9 The Schedule Modification module modifies the cyclo-static schedule to minimize the number of the processing elements and the number of interconnections between the constituent processors required for the implementation. The resulting modified schedule still always satisfies the three optimality conditions imposed on the original cyclostatic schedule.

A Graph Based DSP Chip Compiler

367

FlowGraph Analysis I Optimality Conditions

l Cydo-statio Scheduler

Modification

ISchedule Modified

I c,ro,, Control I Signals ~ Dia~lram Generator

Circuit

Layout Generator

Figure 1: Chip Compiler-Overview 9 The Circuit Generator generates a complete circuit diagram for both computational and control circuitry from the modified schedule and the associated FSFG of the input algorithm. The Circuit Generator can be considered the final phase of the behavioral synthesis process. 9 The Layout Generator creates a final VLSI layout for fabrication. The verification of the final layout can be performed by existing CAD tools. 3

SCHEDULING

3.1

Input Algorithm Representation

In our chip compiler, input algorithms are specified by fully specified flow graphs (FSFGs) [8]. A FSFG is a shift-invariant flow graph (SIFG) in which the nodal operations are additionally constrained to be atomic operation on a target processor. By atomic operation, we mean the finest granularity at which parallelism may be exploited. In general, there may be many FSFGs which corresponds to a particular SIFG. From a given SIFG, we can obtain a FSFG by expanding all non-atomic nodes (macro nodes) in the SIFG into atomic operations. Typically the macro node expansion is performed in a manner that maximizes the parallelism of the resulting FSFG [6]. A FSFG can be described as a finite directed graph. Each node corresponds to an atomic

368

H.-K. Kim

4

6

7

14

II

Input

19

21

2

)

) 5

~ .

.

.

.

,

8

. . . . . .

~5

v Output

20

Figure 2: FSFG of Fourth Order Wave Digital Filter operation on a target processor. Each directed edge represents data flow between the nodes. Each delay element, denoted by z -1, describes an ideal delay. The ideal delay relates data between successive iterations of the computation represented by flow graph. An example of the FSFG is shown in Figure 2. The operations depicted are either multiplications or additions. However, in the general case of the FSFG, the atomic operations could be of any granularity of computation. 3.2

Flow Graph Bounds

For a given FSFG, there are three fundamental bounds on the performance of a multiprocessor implementation. The basic assumption for these bounds is that the computational delays of all nodes are known [1, 7]. I t e r a t i o n Period Bound (IPB): The IPB is the minimum achievable latency between iterations of the algorithm. In other words, the IPB defines the maximal possible sampling rate of input streams when the speed of hardware unit is known. An implementation that achieves the IPB is called a rate optimal implementation. The IPB (To) is given by

To = [ max { Dl leloops m't }1 D, =

(1)

dj

where nt=total number of delay elements in loop l

dj=computational delay of the jth node of the Ita loop Dl-computational delay around loop l /=loop index. The quantity inside the brackets, P_t is called loop bound, Tl for loop l [3]. In Figure 2, if n i '

A Graph Based DSP Chip Compiler

369

we assume dmuu = dadd = 1, the IPB is 8. P r o c e s s o r B o u n d (PB): The PB is the lower bound on the number of processors required to implement a particular FSFG at the IPB. It is defined by Po >_

F~l

(2)

where D=total computational delay of all the nodes To=IPB. If all operations in the time span of IPB are well balanced, and bottlenecks are not present, the equality condition usually holds. Any implementation that uses Po processors, and achieves an IPB is called processor optimal. For the example in Figure 2, the PB is 3. Periodic T h r o u g h p u t Delay B o u n d (PDB): The PDB is the minimum achievable time between an input and the corresponding output. Schedules that operate at the PDB are called delay optimal. The PDB (Do) is defined as

Do = max(Dp - npTo) pEi/o

(3)

where Dp= computational delay of all the nodes along the I/O path p rip=number of delay elements along path p To = IPB. The quantity inside the parenthesis ( D p - npTo) is called the Throughput Delay along path p. The PDB of the FSFG in Figure 2 is 5, and tile corresponding I/O path is 14-13-12-15-16. For a particular multiprocessor system and a particular FSFG, the bounds described above are fundamental. Hence, if implementations can be developed which achieve these bounds, then it is clear that no other implementations exist which can operate at a higher rate, with less delay, or with a smaller number of processors.

3.3

T h e Cyclo-static Scheduler

We have employed a Cyclo-static Scheduler to obtain an optimal schedule in our chip compiler. The Cyclo-static Scheduler is a scheduler which generates a rate optimal, processor optimal, and delay optimal schedule that meets all precedence constraints of the given FSFG [8]. Moreover, since a cyclo-static schedule is periodic with respect to iteration period bound, it is only necessary to specify the schedule information for one iteration period [3]. The Cyclo-static Scheduler uses a constrained recursive, depth first search to find an optimal schedule. To find a rate and processor optimal schedule, the Cyclo-static Scheduler identifies all critical loops, and determines the slack time of all other loops. A critical loop is any loop whose loop bound is equal to the IPB. Non-critical loops have spare time in the computational deadlines of all operations in a loop, and this spare time is called the slack

370

H.-K. Kim

Pr#l Pr#2 Pr#3

140 130 30. 20 190

120 50 18~

150 160 80... 90 i70 200 .

.

.

10o

70

4i -

.

Figure 3: Cyclo-static Schedule of Fourth Order Wave Digital Filter time. The slack time (ts,) for a loop I is: t s, - n t T o - ~

(4)

di

where nt =

total number of delay elements in loop l

To -

IPB

di =

computational delay of the

jth

node of the Ith loop

l -- loop index In a valid rate and processor optimal schedule, all operations in critical loops are scheduled sequentially without gaps, and all operations in non-critical loops are scheduled sequentially with a maximum total gap equal to the slack time of the loop. To obtain a delay optimal schedule, a cyclo-static scheduler computes the Throughput Delay of each I/O path using Eq. 3, and arranges the I/O paths in order of decreasing Throughput Delay. All operations in the minimum delay (critical) path are scheduled contiguously, and all operations not in the minimum delay path are scheduled sequentially with a maximum total gap equal to the difference between the PDB and the corresponding I/O path Throughput Delay. Figure 3 shows a cyclo-static schedule of the fourth order Wave Digital Filter shown in Figure 2. Each row of the schedule will be executed by a single processor, and each column corresponds to each time step. As stated above, since cyclo-static schedules are rate, processor, and delay optimal, it is clear that the implementation based on the cyclo-static schedules can operate at the highest rate, with the least delay, and with the highest processor efficiency. However, cyclostatic schedules may have complex communication architectures [8], and this has led to the development of a schedule modification. 3.4

Schedule Modification

The objective of the schedule modification is to minimize the number of processing elements and communicating edges between the constituent processors in VLSI implementation. This can be achieved by exploiting the property of the cyclo-static schedule that rotating the columns and permuting the rows of the schedule do not affect the optimality of the schedule but only the communication structure between the processors and frame of the reference of the schedule [2]. In this context, a processing element means an element of a processor which performs arithmetic operations. In general, the number of processors in the original schedule and the modified schedule may be different. However, since the structure of the constituent processor of the original

A Graph Based DSP Chip Compiler

time

Prl(muitlj

.

.

.

.

.

PrpNM+l(addl) .

.

.

.

.

.

.

.

.

1 Mll

' .

.

.

.

.

.

.

.

.

.

.

.

.

9

APNM+I .

.

.

.

.

.

I

371

IPB M1 IPB MPNM IF' APNM+I IPB o

PrPNM+pNA(addpN A) .... ApNM+PNA1

APNM+PNA IPB

Prk: processor k Mmn" multiplication node to be computed by processor m at time step n Amn" addition node to be computed by processor m at time step n Figure 4: General Form of Rearranged Processor Schedule schedule is different from that of the modified schedule, the processor bound condition is nevertheless conserved. The following steps describe the procedure for finding a modified schedule from the schedule generated by a cyclo-static scheduler. 1. Find the minimum numbers of multipliers (PNM) and adders (PNA) necessary for implementing the schedule9 They are given by

PNM= PNA=

max NMi I
(5)

max NAi

(6)

i
where

NMi = the number of multipliers operating at time step i NAi = the number of adders operating at time step i For the Schedule shown in Figure 3, P N M = 1, and P N A = 3. 2. Rearrange the original Schedule so that each processor performs only one type of operation by permuting the rows of the Schedule. Figure 4 depicts a general form of the rearranged Schedule. In the rearranged Schedule, the number of processors may be increased9 However, since each processor in the rearranged Schedule performs only one type of operation, the actual number of processing elements is always reduced. Thus, the following inequality always holds.

Number of multipliers and adders in the original Schedule >_ Number of multipliers and adders in the rearranged Schedule 3. Create a Communications Table [4]. The CT specifies all the communicating edges that are assigned to a particular time step of a schedule. Table 1 shows a CT for the schedule of the fourth order Wave Digital Filter shown in Figure 3.

H.-K. Kim

372

1 3 ~ _., 2 ~ 140 - , 130 140 - , 150

13 ~ 20 20 20 190

2 .-, --, _., .., --.,

190 ~

12 ~ 50 41 21 200

3 18 ~ - , 17 ~ 120 --~ 80 120 -., 100 120 ~ 150 50 ~ 8 0

15 ~ 170 170 80 80

4 -, -, --, ~ ...,

160. 160 200 70 70

5 9 ~ ..,' 10 o 160 - - , O u t 20 ~ --, 191

7 7 ~ _.} 41

6 10 ~ _.~' 7 O 100 --~ 12 l 100 --, 141

8 41 _.. 31 41 ..., 51

180

Table 1" Communications Table 4. At each time step (starting from the time step 2), assign each multiplication node (Mli,"" , M p N M i, where 1 <_ i < P N M ) to processor #1, and then, compute the number of additional communicating edges needed for each try. Repeat this step for the remaining processors (Pr2,..', PrpNM). When a node has N P predecessors and N S successors, the node needs N P + N S communicating edges for the worst case. If N E additional communicating edges are required by assigning this node to processor Pri, the actual number of additional communicating edges will be N E - N P - NS. Assign the node to processor Pri which gives minimum N E - N P - NS. Assign the next node processor Prj which gives next minimum N E - N P - NS. Repeat this procedure until all entries at each time step are filled. The computation complexity for this step is rv~z..,=2 n 2 ). 5. Perform step 4 for addition nodes. The computation complexity for this step is O~v.PNA n2). kZ-,n=2 6. Repeat step 4 and 5 for the remaining columns of the modified schedule. The computation complexity for the combined steps is O((IPB - 1) 9 [K",PNh4 ~ n = 2 n2 + ~'~PNA z_,n=2 ~t2). This number is computationally tractable. A modified schedule obtained in this way can be implemented with the minimum number of interconnections between the processors but still conserving the optimality conditions imposed on the original schedule. Figure 5 shows the modified schedule of the fourth order !

time P r # 1 (mult) P r # 2 (add) Pr#3 (add) P r # 4 (add)

1 30 14 o

2 130 2o 19~

3 180 120 5~

4 150 80 17o

..

5 90 160 200

6

7

8

7~

41

100'

..

Figure 5" Modified Schedule of Fourth Order Wave Digital Filter Wave Digital Filter. To implement the modified schedule we need one multiplier and three adders. The required number of the communicating edges between the processors is seven. If we want to implement the original schedule, we need three multipliers, three adders, and for the worst case, up to twelve communicating edges.

A Graph Based DSP Chip Compiler

TypeI

Type II

373

Type I|I

Figure 6: Prototypes of Processors 4

CIRCUIT GENERATOR

The Circuit Generator module translates the modified schedule into a hardware architecture composed of modules such as functional units, storage units and I/O devices together with the control signals required to implement the algorithm. The resulting structural description is used as input to Layout Generation module for generating a physical layout. 4.1

T h e A r c h i t e c t u r e of the Target Processors

Figure 6 shows the three prototypes of target processors. Since the schedule is modified so that each processor in the schedule performs only one type of operation, each processor has only one type of processing element. Depending on the nature of input algorithm, the proper types of target processors will be selected.

4.2

Circuit Synthesis

Taking as input a modified cyclo-static schedule and the associated FSFG, the Circuit Generator automatically generates a complete circuit diagram and control specifications using the prototypes of processors defined above. The procedure to create the desired output is briefly described as follows. More details can be found in [5]. 1. Determine the type of each processor (multiplier or adder) in the modified schedule. 2. Specify the types of multipliers (Type I or II) to be used. 3. Find the communication architecture between the processors and perform register (REG1) allocation. 4. Perform multiplexer allocation. 5. Find the control signals for loading REGls and selecting the inputs of MUXls and MUX2s.

H.-K. Kim

374

1

2

3

,: .... ,I II

~

#2

T

t

in

#4

1

~

Figure 7: Circuit Diagram of Fourth Order Wave Digital Filter Figure 7 and Table 2 show the circuit diagram and control signals for the fourth order Wave Digital Filter. The control signals are used for loading registers and selecting multiplexor inputs, and represented as a Truth Table which is periodic. Thus, only one period of the table is sufficient for the generation of required control circuitry. The circuit diagram and control specifications can be applied to the Layout Generator for VLSI layout generation. 5

CONCLUSION

This paper described a complete chip compiler which can generate a circuit diagram with the associated control signals efficiently from a given DSP algorithm. Especially, we focused on the development of a schedule modification methodology which can be applied to schedules generated by a cyclo-static scheduler. In the future, a design synthesis system may be developed by combining the chip compiler with a proper CAD tool.

A Graph Based DSP Chip Compiler

I,Ume

i

375

11 21

31

41

51

61

7

1

11 3,, 1 0. 01 1 1

X

X

Ii

X X

8l ,

I p r # 1 j R1 9

, ' pr # 2

;

i M1

1

"

i

;

'

3

2

.

i

i

X

3

X!

,,

X X

!

Rll . .0' . 0 I . X. 1 1 1 X X X i'R12 : X " 1 1 01 X l Xl r R13" 0" 0" 0 0 i 0 0 o R14 1 1 1 0, 0, x x~i " Ri5 ~ X X X.' X 1 X X X ~Mll r X ' X' X' x 2' 1 .... 2 X i "Mi2 ~ 1 ' 1" 2' X X X X X X I ' M21 ' R l l 'R13 ' R l l ' R12 R l l ' Rll i X " M22 ' R14"i R14 ' RI~i ' R14 R14 R14 X X p r ' # 3 " Ril ' i .... 0 ' '0 ' 1 1' O' 0 1 ' Ri2" 0i 0!, i', 0 0 1 0 o 'R13" 1 O' 1 0 0 0 0 0 1R14' X' X' X' 1 1 X i x X X 2 3 X "Mll i 3 X 3 'M12 ~ X ' X ' X' 3' 4' X 2 X ' M i 3 i in X 2 XI X X X X' ' M21 ' X ' R12 " R l l ' R l l "R12 ' X ' Rll I R l 1 ' "M22 i' X R13' R13' R14' R14' X' pr #4 X X 9

t,

.

.

.

.

,

,,|

i

[

|,

,!

|

. . . .

|

|

!

|

|

i

|

!

,

,i,,

9

|

,

,,

|

!

i

|

|

I

I

,

,

X

!

|

|

. . . .

9

i

i

|

X

Table 2: Control Signals for Circuit Diagram shown in Figure 7 Acknowledgements The author would like to thank tile members of Media Application Section, ETRI for their support.

References [1] T.P. Barnwell III and C.J.M. Hodges, "Optimal Implementation of Signal Flow Graphs on Synchronous Multiprocessors," International Conference on Parallel Processing, Belaire, Michigan, pp. 90-95, Aug. 1982. [2] H. R. Forren, Multiprocessor Design Methodology for Real-Time DSP Systems Represented by Shift-Invariant Flow Graphs, Ph.D. thesis, Georgia Institute of Technology, May 1988. [3] P. Gelabert and T.P. Barnwell III, "Optimal Automatic Periodic Scheduler for Full" Specified Flow Graphs," IEEE Trans. on ASSP, March, 1993.

376

H.-K. Kim

[4] P. Gelabert and T.P. Barnwell III, "Optimal Automatic Periodic Multiprocessor Compiler for Multi-bus Networks," International Conference on Acoustics, Speech, and Signal Processing, San Francisco, pp. V593-V596, March, 1992. [5] H. Kim and T. P. Barnwell III, "A Chip Compiler for Rate-Optimal Multiprocessor Schedules", International Symposium on Circuits and Systems, !EEE, San Diego, May, 1992. [6] S. H. Lee, A Unified Approach to Optimal Multiprocessor Implementation from NonParallel Algorithm Specifications, Ph.D. thesis, Georgia Institute of Technology, Dec. 1986. [7] M. Renfors and Y. Neuvo, "The Maximum Sampling Rate of Digital Filters Under Hardware Speed Constraints," IEEE Trans. on Circuits and Systems, pp. 196-202, March 1981. [8] D.A. Schwartz and T.P. Barnwell III, VLSI Signal Processing II, "Cyclo-Static Solutions: Optimal Multiprocessor Realization of Recursive Algorithms," ed. H. T. Kung, IEEE Press, 1986. [9] C. Shung, R. Jain, K. Rimsay, E. Wang, M. Srivastava, E. Lettang, S. Azim, B. Richards, P. Hilfinger, J. Rabaey, and R. Brodersen, "An Integrated CAD System for AlgorithmSpecific IC Design," IEEE Trans. on CAD, May, 1991.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

377

RESOURCE-CONSTRAINED SOFTWARE PIPELINING FOR HIGH-LEVEL SYNTHESIS OF DSP SYSTEMS I F. S~NCHEZ, J. COKTADELLA

Polytechnic University of Catalonia Department of Computer Architecture Campus Nord, Mbdul D6 08071 Barcelona (Spain) [email protected] ABSTRACT. This paper presents UNRET (Unrolling and Retiming), a new approach for software pipeUning with resource constraints which is suitable for high-level synthesis of DSP systems. UNRET works with the data-flow graph which describes the loop body. Two graph transformations are considered: loop unrolling and retiming. The target architecture is composed of a limited number of different (possibly pipelined) functional units. The goal of UNRETis to find a scheduling that maximizes the resource utilization of the architecture. UNRETimproves the results obtained by other systems, achieving optimal solutions in most cases.

KEYWORDS. Software pipelining, loop pipelining, unrolling, retiming, scheduling, resource utilization.

1

INTRODUCTION

Software Pipelining techniques [11] attempt to overlap the execution of different loop iterations, by searching for a new loop body (steady state) with more execution parallelism. In general, in order to maintain the semantics of the original loop, a piece of code (prologue) must be executed before starting the execution of the steady state pattern. Similarly, another piece of code (epilogue) is usually necessary after the execution of the last iteration of the steady state. This paper focuses on how to obtain the steady state pattern. These techniques have received much attention recently. Techniques such as modulo scheduling [19], URPR [23], perfect pipelining [1], or Lam's algorithm [11] have been proposed for parallel architectures, while loop folding [6], retiming [13], loop winding [4] and functional pipelining [7] have been proposed for high-level synthesis. aThis work was supported by CYCYT TIC-91-I036

F. Sdnchez and J. Cortadella

378

UNRET is a new software pipelining approach which aims at maximizing resource utilization of the target architecture. The architecture is composed of a limited number of functional units (FUs). For each FU, the following characteristics are defined: the types of operations executable by the FU, the execution time, T(u), and the latency, L(u), of each operation u. The latency is defined as the minimum cycle count between succesive inputs to the FU (for pipeUned FUs, T(u) > L(u)). The scope of the paper is limited to single nested DO-like loops whose body is a basic block. A loop is represented as a labelled directed graph G(V, E, A, ~), called r-graph, where V is the set of nodes (operations), E is the set of edges (data dependences), and A and are two labelling functions representing the iteration index (,~) for each operation and the number of iterations (distance) traversed by each data dependence ($). The paper is organized as follows: Section 2 shows how a 7r-graph can be transformed while maintaining the semantics of the loop. Section 3 describes the two graph transformations used by UNRET. Bounds on the initiation interval (II) and the resource utilization of any schedule of a loop are estimated in Section 4. Section 5 explains the UNRET algorithm, as well as some details about scheduling. An algorithm to reduce the number of registers required to execute the loop is described in Section 6. Some results obtained by using well-known examples are presented in Section 7, with conclusions in Section 8.

2

EQUIVALENT

GRAPHS

Initially, a loop ~r = G(V,E,A,$) is represented by a ~r-graph in which A(u) = 0 for all operations u E V. However, a loop can be represented by different, though equivalent, r-graphs. Two r-graphs, r = G(V,E,A,$) and 7r' = G(V,E,)~,~), are equivalent [21] (represent the same loop) if V(u, v) E E:

~ ( v ) - )~(u)-}- ,~(u, v) = A'(v) - ~'(u)-t- $'(u, v)

Figure 1: Scheduling of equivalent 1r-graphs

(1)

Resource-constrained Software Pipelining

379

In general, the scheduling constraints imposed by dependences decrease as their distance increases. The example shown in Figure 1 depicts two equivalent r-graphs and their schedules (assuming all operations are additions that can be executed in one cycle, and the architecture has three adders). Edge labels identify the distance (8) of each dependence. An unlabelled edge denotes a data dependence with distance 0. Such edges represent a data dependence between two operations belonging to the same iteration, a.nd are called Intra. Loop Dependences (ILD). Edges e E E with 8(e) > 0 represent data dependences between operations from different iterations. They are called Loop-Carried Dependences (LCD). On the other hand, node subscripts denote the iteration index (A) for each operation. Thus, operation Ai denotes the execution of operation A at the ith iteration. Both r-graphs in Figure 1 represent the same loop (Equation (1) is fulfilled for each dependence). Each iteration of the loop ill Figure l(a) requires two cycles to be executed due to the existence of ILDs (an ILD, Ai ~ Bj, states that operation A from iteration i must be executed to completion before starting the execution of operation B from iteration j). The LCDs are always honored because of the sequential execution of the steady state. Due to the existence of ILDs, no schedule less than 2 cycles exists for the r-graph from Figure l(a). This r-graph corresponds to the initial representation of the loop. Its schedule does not require either prologue or epilogue 2. However, the loop body in Figure l(b) may be scheduled in only one cycle (H = 1) since no ILD exists a. This schedule contains operations belonging to two different iterations of the original loop (i and i + 1), and the execution of the new loop requires the execution of a prologue and an epilogue.

3

GRAPH

3.1

TRANSFORMATIONS

Dependence Retiming

Since ILDs constrain the scheduling of the loop body, we are interested in increasing their distance by transforming them into LCDs. The transformation dependence retiming is defined to achieve this goal. Given a dependence e = (u, v), dependence retiming transforms (~(e) according to equation (1) by performing the following steps: * A'(u):= A ( u ) + 1

9 9

w):=

+ 1,

V(u,w) E E

l,

V(w, u) E E

Dependence retiming is equivalent to operation retiming, as was defined by Leiserson and Saxe in [13]. Dependence retiming yields a r-graph equivalent to the original one. ~The k iterations of the loop are executed by the steady state. ZThe distance of dependences A ----* B and A -----,C has been updated by changing the iteration index of A according to Equation (1).

F. Sanchez and J. Cortadella

380

3.2

Loop Unrolling

In general, finding an optimal schedule of a loop requires more than one instance of the loop body [9]. The loop unrolling transformation [21] of a 10op ~r generates a new loop body lr K in which each operation and each dependence are repeated K times (the loop is unrolled K - 1 times)J20]. The effectiveness of loop unrolling is iUustrated by using the example in Figure 1. Figure l(a) shows a possible schedule when two FUs are available. One iteration is executed every two cycles (H = 2). However, if the loop is unrolled once, a schedule with shorter initiation interval can be found by applying dependence retiming, as shown in Figure 2. In this schedule, 2 iterations are executed every 3 cycles (H = ~).

Figure 2: Schedule with 2 FUs after loop unrolling and dependence retiming 4

4.1

BOUNDS ON THE INITIATION INTERVAL AND UTILIZATION

THE RESOURCE

M i n i m u m Initiation Interval

The initiation interval of a loop schedule is limited by the set of FUs of the architecture and the cycles (recurrences) formed by the dependences of the x-graph. Two lower bounds for the initiation interval can be distinguished: 9 recMII: the minimum initiation interval due to the recurrences of the loop body. 0

if the loop has no recurrences

9

recMII =

T(u)

(u,v)en maxRc_E Z ~u~v)

iftheloop has recurrences

(u,v)r

where R is a cycle (recurrence) of the dependence graph. 9 resMII: the minimum initiation interval due to the resources.

resMII = max vu~n~ VR~ ni where Ri is a resource type of the architecture, ni is the number of resources of type Ri available in the architecture, u M Ri states that operation u is executed in a FU of type Ri, and L(u) is the latency of u.

Resource-constra#wd Software Pipelining

381

The minimum initiation interval ( M I I ) achievable for any schedule of the loop is the maximum of the two previous lower bounds [22]. MII = max(recMII, resMII)

4.2

M a x i m u m Resource Utilization

A schedule so that each iteration takes MII cycles achieves a maximum utilization of the resources. The Resource Utilization (U) of a schedule depends on' 9 the sum of the latencies of all operations in the loop, L = ~

L(u)

u

9 the number of instances of the loop (K) involved in the schedule 9 the number of available resources (R) 9 the number of cycles of the schedule 4 (IIK) The resource utilization of any schedule of r K in IIK cycles is the fraction: x y

L.K R . IIi~"

For a target resource utilization, the loop unrolling degree (K) and the expected initiation interval (IIK) of the schedule can be computed by solving the following 2-variable linear Diophantine equation [2]: x . R . IIK - y . L . K = 0

5

(2)

UNRET

5.1

Farey's Series

Since the initiation interval of a loop schedule is bounded by MII, an upper bound for the resource utilization (MaxU)also exists. For a given loop and for a target architecture, all the possible values for the resource utilization of a loop schedule can be ordered in decreasing order of magnitude starting from MaxU. This sequence is defined by Farey's Series [23]. Farey's Series or order D (FD) defines the sequence (in increasing order)of all the reduced fractions with nonnegative denominator _< D. For example, Fs in the interval (0,1] is the series of fractions:

F5-

01112132341 1'5'4'3'5'2'5'3'4'5'1

Let ~ be the ith element of the series. FD can be generatedby the following recurrence: 9 The first two elements are respectively ~ : ~ and "~1 = ~i 4Note that IIK is tile number of cycles to execute K iterations of the loop. Every iteration is executed in II = ~h" cycles.

R Sdnchez and J. Cortadella

382 9 The generic term ~

XK+2

can be calculated as:

IY +D I

= t Y g + i J "XK+,

- Xz~-

.

IYK+DI

11:'+2 = I.' Yg+t J " YK+I

-

-

YK

Since U must be explored in decreasing order (starting from U = MaxU), and the range for U is U E [0, 1], we are interested in the series 1 - FD. For each value of U, the pairs (IIK, K) can be computed by solving equation (2). Figure 3 shows an example of generation of pairs (IIK, K) (the architecture has 4 adders, each of which performs an addition in 1 cycle). Figure 3(a) shows an example of a rgraph, in which all operations are additions. Figure 3(b) shows a diagram representing all possible pairs (IIK, K). Each point in the diagram represents a possible schedule. Point A represents a schedule of 3 instances of the loop in 4 cycles. The existence of such a schedule depends on the topology of the dependences of the loop. Point B represents a time-optimal schedule. Point C represents a schedule with the same resource utilization as point A, but with a longer IIK (the initiation interval for each iteration is the same). Figure 3(c) shows the schedule found by UNRET, which corresponds to point A after dependence retiming.

Figure 3: Exploration of resource utilization (a) Example of loop (b) Diagram representing the resource utilization for 4 adders (c) Schedule found by UNRET 1

Farey fraction

T

(ILK, K)

(5,4)

31

16

3"2"""1"7

I__5

29

16

31 "'" 19

(4,3) (8,6)

I__7

2~

8

28

9 " " " 31

(7,5)

26

5_ 6

(3,2) (6,4)

Table 1" Farey's Series F32 and legal pairs (IIK, K) associated with IIK <_8 In order to generate the Farey's Series, a ma~mum length for the schedule (the denominator D = R. IIg must be finite). This bound is called MaxH. we are interested in pairs (IIK, K) so that IlK <_MaxII. Such pairs are called Some Farey's fractions do not correspond to any legal pair, while other fractions

is required Therefore, legal pairs. correspond

Resource-constrained Software Pipelining

383

to several of them. Table 1 shows the Farey's Series F32 (for 4 adders and MaxlI = 8) which generates the pairs represented in Figure 3(b). For each fraction, the associated legal pairs (solutions of Diophantine equations) are shown when they exist.

UNRET explores the sequence of legal pairs (IlK, K) until a schedule with initiation interval IIK is found. For each pair (IIK,K), x K is built (by using loop unrolling) and iteratively transformed (by using dependence retiming). The ~'-graph is scheduled after each dependence retiming transformation. 5.2

Scheduling A l g o r i t h m

Let ASAPu(v) denote the first cycle at which operation v E V can be scheduled 5 due to dependence e = (u, v), and by ASAP(v) the first cycle at which operation v can be scheduled by considering all its predecessors in the 7r-graph. Let us assume that u has been previously scheduled, and let H be the expected number of cycles for the schedule. As was shown in [22], for any dependence e = (u, v) C E, ASAP~(v), is given by

ASAPu(v) = max(0, S(u) + T ( u ) - II. 6(e))

(3)

where T(u) is the execution time of operation u, S(u) is the cycle at which operation u has been scheduled and 8(e) is the distance of dependence e.

ASAP(v)=

maxg(ASAPu(v))

V(u,~)~

(4)

UNRET uses the well-known list scheduling as scheduling algorithm [5]. The operation selected is ASAP-scheduled by taking the available resources into account. More details about the scheduling algorithm can be found in [22]. 5.3

Fitted-in Schedule

The schedules of two consecutive iterations can be overlapped (in the same way as the execution of iterations from the original loop can be overlaped by software pipelining). Such schedules may not have a rectangular shape, as shown in Figure 4. In such schedules, part of an operation can be executed by an iteration, whereas the rest is executed by following iterations. Therefore, the schedule of an iteration may fit in with the schedule of the next iteration. The number of cycles required for a schedule can be significantly reduced by fitting the schedule in, while resources are available and dependences allow it. Figure 4 shows such an example of a fitted-in schedule. The ~r-graph in Figure 4(a) corresponds to the resolution of the differential equation from [17]. The target architecture has 2 non-pipelined multipliers that perform a multiplication in 2 cycles and 1 ALU which is able to execute additions, substractions and comparisons in a single cycle. A schedule with H = 6 can be obtained because of the fitting-in feature (Bi s t a r t s at iteration i and completes at iteration i + 1). SWe consider that the first cycle of a schedule is cycle 0.

F. Sdnchez and J. Cortadella

384

Figure 4: DifferentiM equation and fitted-in schedule (a)lr-graph of the loop body (b)A schedule for 2 multipliers and 1 ALU (c)Non-rectangular shape of the schedule (b) 5.4

UNRET: Algorithm

The UNRET algorithm is next sketched. The function find_schedule succesively transforms the ~r-graph by means of dependence retiming until the best graph for scheduling is found. Heuristics are provided to determine when a graph cannot be further improved for scheduling[22]. Such heuristics are based on shortening the critical path and reducing the number of ILDs. Finally, the best graph is scheduled. P r o c e d u r e UNRET Calculate MII and MaxU; Repeat Generate_pair(I/k, I(); ~rK := unroll(~r, K); schedule: =find_schedule(Tr K , IIK ); until length(schedule)=IIK ; end 6

SPAN REDUCTION

Once a schedule with a given resource utilization U has been found, UNRET attempts to reduce the SPAN (range of iterations involved in the schedule) while maintaining the initiation interval. In general, a reduction of the SPAN produces both; a reduction in the size of the prologue and the epilogue and a reduction in the number of registers required to store partiM results across iterations (due to shorter variable lifetimes). We next describe the steps to reduce the SPAN of a 1r-graph while maintaining a schedule in H cycles. First, the maximum value for A(u) (Amax) is computed by exploring all nodes. Then, this value is iteratively decreased until no schedule in //cycles is possible. Each time an operation index is decreased, the algorithm attempts to reduce the number of ILDs

Resource-constrained Sojhvare Pipelining

385

(without increasing the SPAN) before scheduling the loop. The number of required registers is calculated after each schedule is found. The final schedule is the one with fewest registers. In order to reduce Amax, the algorithm to reduce the SPAN uses a transformation similar to dependence retimin 9 that decreases A(u) by also transforming ~(e) for the incoming and outgoing edges of u.

7

E X P E R I M E N T A L RESULTS

We have implemented UNRET in C++ on a SUN-4 SPARC-10 workstation. UNRET has been executed in several examples. In particular we present Cytron's loop [3], the differential equation [17], the 16-Point Digital FIR Filter [15], the Fifth Order Elliptic Filter [10], the Fast Discrete Cosine Transform Kernel [14] and the 256-point Discret Fourier Transform Algorithm [6]. Tables 2-8 show the results. The CPU-time is given in seconds. All examples have been executed with MaxH = 30. We compared the different examples with Force Directed Scheduling (FDS) [16], Percolation Based Synthesis (PBS)[18], SEHWA [15], Pipelined Synthesis (PLS)[7], ATOMICS in the CATHEDRAL II compiler (ATM) [6], the Integer Linear Programming Approach ALPS [8] and the algorithm Theda.Fold (TF) [12]. All systems with the exception of FDS perform loop pipelining. Tables 2-8 show the number and type of FUs which are available, the MII for such set of resources and the initiation intervals achieved by different techniques. The last columns indicate the unrolling degree (K) and the number of cycles (IIK) of the schedule found by UNRET, as well as the time used to find the schedule.

UNRET significantly improves some results obtained by the above mentioned approaches, achieving optimal schedules in most cases. Since ALPS is an integer linear programming approach, its results are time-optimal. UNRET finds schedules with the same initiation interval as ALPS for most cases. Moreover, some results are improved by the fit-in feature of the schedule (see table 3). By initially unrolling the loop, significant improvements over other techniques have also been obtained (see tables 2, 4, 7 and 8). ["Resources [[" MH ][ ....

..il

I Unlimited d 5 .... 4 . 3

Algorithms

il PBs I ATM| PLS [ TF[ 3 3 3.4 4.25 5.66

3 3 3 3 4 4 5 5 6 '..ii ~ 6

3 3 4 5 6'

3 3 4 5 6

.... UNRET

3 3 3.4 4.25 5.66

Table 2: Cytron's example

][ (IIK,K)[] c P u [] !1 [I (secs i II 'i3,1) 0.13 (3,1) 01i'3 (17,5) 1:85 (17,4) 1.20 (17,3) 0:70

386

F. S~nchez and J. Cortadella

[I Resources II MII !!

I* I ALUs !1 .

3 2 2

2 2 i

II (HK, K)rl cp_u

Algorithms

I! FDs I.ALPS ] UNRET

.

.

.

6 6 6

.

6 7 ,

.

.

.

.

6 7 -

.

.

.

.

.

6 6 6 .

il

.

l(secs)

(6,1) (6,1) (6,1)

.

,

0.05 0.10 0.2-0

Table 3" Differential Equation

II Algorithms I! (H~,IC)If (secs Ci~U !1 ii MH S E H W A FDS U]VRET I1'1 + ll I " I Trl II ' I1... )11 II 31 6/112.66 il 3 i13. l.." 2.66 i} (8,3)11 4:o5 ll II 31 5 112.6o_11 3 I 3 i 3 2.66 (8,3), !1 4,00 il I! Resources

Table 4" 16-Point Digital FIR Filter

3 2 2

3' 2 2 1

16 16 i6 !6

17 18 19 21

16 !9,

17 18 18 21

UNRET ~' 16 16 16 17 ' 17 ' 19 ....... 19

"i16,1)' (16',1) "(17,1)' (19!1)

0.61 0.70 1.21 3'!1

Table 5: Fifth Order Elliptic Filter with Non-Pipelined Multipliers

II

Resources 3 3 2 i

2 1 1 1

!1.... !! 16 16 16 26

.... 17 18 19 -

16 17 28

!i (I)K; K)II

cPu

(16,1) (16,1) ~ (17,1) (2811)

0:53 0.61 1.8.5 ~ 1.70

Algorithms' 17 18 19 -

,,

17 ..... 28

16 16 17 28

Table 6: Fifth Order Elliptic Filter with Pipelined Multipliers I Res0urces [[ MII [i ...... Algorithms +l'_" I.* S E H W A [ P L S I TF 3 3 5 5 6 2.66 4 4 4 4 444 5 5 5 3 3 4 4.33 6 6 6 3 -3 3 5.33 8 10 10 222 8 17 16 17 17 1 1 1 .,,

....

il (u,,-; 1c) il CPU .11

UNRET I[ _.

2.66

(8,3)

64.0

4

(4,1)

2.78

4.4

(22,5)

5.33

(1G3)

88.4 27.3

(8,1)

1.2.5

8 .

16

.,,

Table 7: Fast Discrete Cosine Transform

.

.

.

.

.

.

(16,1)

1.55

II

Resource-constrained Software Pipelining

AL U i 1 i "i 1

Resources RAM ACU i 1 2 2 2 3

1 1 2

4 1!5

2

li5

ATM 4 2

Algorithms PBS UNRET 4 2 1.5 1.5 ,,

......

'

2

,

4'

,3 ..... 4

2

1

i

387

Ii!,,,, H(~ec~) jj (4,1) (2,1)

0.10 0.18

(3,2) (3,2)

0.~0 o.5o

. (1,1)

o.16

Table 8: 256-point discrete Fourier transform algorithm 8

CONCLUSIONS

This paper has presented UNRET, which is a new algorithm for resource-constrained software pipelining. The algorithm is based on the exploration of the resource utilization of a schedule. For a target resource utilization, the degree of unrolling of the loop and the expected initiation interval of the schedule are analytically computed. In order to perform software pipelining, a loop transformation called dependence retiming has been proposed. Multi-cycle operations and pipelined functional units have been taken into account. UNRET is an appropriate methodology for high-level synthesis of DSP systems. Its effectiveness has been shown in several examples. Loops with conditional statements and nested loops will be studied in the future.

References

[1] A. Aiken and A. Nicolau. Perfect Pipelining: A new Loop Parallelization Technique, volume 300 of Lecture Notes in Computer Science, pages 221-235. Springer Verlag, March 1988. [2] U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Acad. Pub., 1989.

[3] R. Cytron. Compiler-Time Scheduling and Optimization for Asynchronous Machines. PhD thesis, University of Illinois at Urbana-Champaign, 1984.

[4] E.F. Girczyc. Loop winding: A data flow approach to functional pipelining. In Proc. Int. Syrup. Circuits and Systems, pages 382-385, May 1987. [5] E.G. Goffman Jr. NewYork, 1976.

Computer and Job Scheduling Theory. John Wiley and Sons,

[6] G. Goossens, J. Vandewalle, and H. De Man. Loop optimization in register-transfer scheduling for DSP systems. In Proc. of the 26th Design Automation Conf., pages 826-831, 1989.

[7] C,-T. Hwang, Y-C. Hsu, and Y-L. Lin. Scheduling for functional pipelining and loop winding. In Proc. of the 28th Design Automation Conf., pages 764-769, June 1991. [8] C,-T. tIwang, J-It. Lee, and Y-C. Hsu. A formal approach to the scheduling problem in high level synthesis. IEEE Trans. on CAD, 10(4):464-475, April 1991.

388

F. Sdnchez and J. Cortadella

[9]

R.B. Jones and V.H. Allan. Software pipelining: A comparison and improvement. In Proc. ~3rd Ann. Workshop on Microprogramming and Microarchitecture, pages 46-56, November 1990.

[101

S.Y. Kung, H.J. Whitehouse, and T. Kailath. VLSI and Modern Signal Processing. Prentice Hall, 1985.

[11] M. Lain. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. of the SIGPLAN'88, pages 318-328, June 1988. [12] T-F. Lee, A. C-H. Wu, Y-L. Lin, and D.D. Gajski. A transformation method for loop folding. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 13(4):439-450, April 1994. [13] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5-35, 1991. [14] D.J. Mallon and P.B. Denyer. A new approach to pipeline optimization. In Proc. European Conf. Design Automation, pages 83-88, 1990. N. Park and A.C. Parker. Sehwa: A software package for synthesis of pipelines from behavioral specifications. IEEE Trans. on CAD, 7(3):356-370, March 1988. [16] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASICs. IEEE Trans. on CAD, 8(6):661-679, June 1989.

[171

P.G. Paulin, J.P. Knight, and E.F. Girczyc. HAL: a multi-paradigm approach to automatic data path synthesis. In Proc. of the 23th Design Automation Conf., pages 263-270, 1986.

[18] R. Potasman, J. Lis, A. Nicolau, and D.D. Gajski. Percolation based synthesis. In Proc. of the 27th Design Automation Conf., pages 444-449, 1990. [19] B.R. Ran and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. of the 14th Annual Workshop on Microprogramming, pages 183-198, October 1981.

[2o]

M. Rim. High-Level Synthesis of VLSI Designs for Scientific Programs. PhD thesis, University of Wisconsin-Madison, 1993.

[21]

F. SAnchez and J. Cortadella. Resource-constrained pipelining based on loop transformations. Microprocessing and Microproyramming, 38(1-5):429-436, September 1993.

[22]

F. SAnchez and J. Cortadella. Unrolling and retiming: A new approach for resourceconstrained software pipelining. Technical Report RR-94/17, UPC-DAC, August 1994.

[23]

M.R. Schroeder. Number Theory in Science and Communication. Springer-Verlag, 1990.

[24]

B. Su, S. Ding, and J. Xia. URPR: An extension of URCR for software pipelining. In 19th Microprogramming Workshop (MICRO-19), pages 104-108, October 1986,

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 9 1995 Elsevier Science B.V. All rights reserved.

389

A PORTABLE TESTBED FOR EVALUATING DIFFERENT APPROACHES TO D I S T R I B U T E D L O G I C S I M U L A T I O N

P. L U K S C H Lehrstuhl fr Rechnerteehnik und Rechnerorganisation Technische Universitat Manchen D-80~90 Manehen; Germany luksch @inf ormatik, tu- m uenchen, de

ABSTRACT. This paper presents a flexible and portable testbed that enables an unbiased comparison of different methods for distributed logic simulation. The great variety of algorithms for parallel discrete event simulation that have been proposed up to now is subdivided into a small number of fundamentally different approaches. Criteria for the classification are the distribution of functions and data structures and the way processes are synchronized. Based on this classification, a representative set of paraUelizations has been selected and applied to a gate level logic simulator. Run-time measurements will be presented for the iPSC/2 and iPSC/860 distributed memory multiprocessors. KEYWORDS. Distributed logic simulation, distributed memory multiprocessors.

INTRODUCTION In recent years, simulation has become an indispensable tool in VLSI design. As system complexity increases, simulation is used not only for validation purposes but also helps to design system hardware and software in parallel and to evaluate performance of different design alternatives. Sequential computers no longer can cope with the increasing demand for computational power that emerges from the desire for more comprehensive simulation of increasingly complex systems. Currently, parallel simulation on general purpose multiprocessors is the most promising and cost-effective way of providing the necessary compute power. A great variety of methods for executing discrete event simulation in parallel have been proposed in the literature. For most of them, prototype implementations have been reported

390

P. Luksch

with different simulators on different target architectures. Run-time measurements ranging from no speedup at all to super-linear speedup do not clearly favor any specific approach. Parallel simulation efficiency not only depends on the parallelization method employed. Properties of the simulation problem and parameters of the target architecture play an important role, too. For an unbiased comparison of different methods run-time measurements have to be made under uniform conditions. The goal of the testbed presented in this paper is to enable a detailed analysis of a great number of parallelization methods under uniform conditions. The major requirements in the design have been 9 to enable measurements under uniform conditions, 9 portability: the testbed should not be restricted to any particular multiprocessor system. 9 the testbed should be based on a "real world" simulator. 9 a broad coverage of all major approaches to distributed discrete event simulation should be provided. 9 extendibility: new paralleUzation methods should be easy to integrate. The subsequent sections will outline the design of the testbed. In section I a classification scheme will be presented which subdivides the great variety of parallelization methods into a small number of fundamentally different approaches. The fundamental concepts of our test environment as well v,s the four parallelization methods that have been implemented until now will be described in section 2. Run-time measurements will be presented in section 5. C L A S S I F I C A T I O N OF M E T H O D S EVENT SIMULATION

FOR

DISTRIBUTED

DISCRETE

Based on the distribution of functions and data structures and on the type of process synchronization the variety of methods for parallelizing discrete event simulation can be subdivided into a small number of classes representing fundamentally different approaches to parallelization. The classification scheme which has been used to make a representative selection of parallelizations for implementation in our testbed, is summarized in fig. 1. In this section, the subdivision imposed by our classification is motivated by outlining the basic concepts of the different approaches. For a more detailed description, the reader is referred to the literature, e.g. [5, 3, 6]. 2

A TESTBED FOR DISTRIBUTED SIMULATION

The main objective in the design of our testbed has been to provide a platform to implement and analyze a large number of distributed simulation strategies under uniform conditions. Besides the paraUelization strategy the main factors that contribute to parallel simulation behavior are the target architecture and the field of application where discrete event simulation is used. Together with our primary topic of interest, parallelization strategies, these environment parameters span a three-dimensional space as illustrated in fig. 2. Our testbed has been designed to cover as large a portion of it as possible.

Different Approaches to Distributed Logic Simulation

391

Figure 1: Classification of methods for distributed discrete event simulation

In order to eliminate the influence of the simulation application, some simulator had to be selected as the basis for the testbed. Given the importance of gate level simulation in VLSI design and its ever increasing demands for computational power, we have chosen a gate level logic simulator. It implements most of today's state-of-the-art techniques in the modeling of digital systems [7]. Thus, properties of the simulation application in our testbed are very close to those found in commercial CAE systems. Because of its computational complexity computer-aided VLSI design is likely to become the first production-use application of distributed simulation. Portability has been an important requirement in the design of our test environment. As implementation platforms we consider the whole class of scalable MIMD multiprocessors, especially distributed memory computers. The nodes of the multiprocessor are assumed to be virtually fully connected, without storing and forwarding messages on intermediate nodes. Parallelization is done explicitly on a medium to coarse grain level using the message passing model. Communication has to be tuned to optimize the relation of computation to communication on a given multiprocessor (c.f. ??). In the implementation of parallelization strategies we have restricted ourselves to the basic communication primitives which can be found in nearly all existing multiprocessor's programming models. Thus, a high degree of

P. Luksch

392

Figure 2: parameter space of distributed simulation covered by the test environment (The methods implemented in the testbed are depicted as dark planes marked with the abbreviations introduced in fig. 1)

portability has been achieved despite the fact that until now there is no generally accepted standard for message passing programming. Based on the classification scheme introduced in section 1, four parallelizations of the logic simulator have been implemented within the test environment. Each of them belongs to a different leaf in the tree of fig. 1. The parallel simulators have been instrumented to collect detailed run-time statistics including parameters specific to each approach, such as the number of roll-backs in Time Warp. On the one hand, implementation of a representative selection of parallelization strategies allows the main approaches to distributed simulation to be compared under uniform conditions. On the other hand, a library of functions is provided which forms the basis for a flexible test environment. This is summarized in table 1. Using the library, a variety of parallelizations can be analyzed with minimal implementation effort. In the following sections, the four parallelizations that have been implemented are presented.

3

FUNCTION

DECOMPOSITION

The sequential simulator is decomposed into six tasks that process the stream of events in a pipeline: 9 Input: read the stimuli for primary inputs. Output: read the circuit description and the stimuli, write the result. 9 Event administration: event list management and advancing simulation time. 9 Event generation: events are generated from new values at output signals, preliminary

Different Approaches to Distributed Logic Simulation approach function decomposition

model partitioning conservative ........ approach

Time Warp

393

function decomposition into six I~'rocesses 'element evaluation: one- and two-phase approach communication mechanism: number of items "per message adjustable as a parameter instrumenta;fion for run-time statistics interface to..circui t partitioning static partitioning: natural partitioning and min-cut (generalization of Fiduccia/Mattheyses' algorithm by Vijan [4, 12]) modified control structure for conservative synchronization ..... deadlock avoidance by' time requests . . . . . deadlock recovery"with the vector method, circulating control vec' tor, parallel vector method two options for the definition of external events instrumentation for run-time statlstics ..... rollback mechanism . . . . . optimized [ncrem'ental state saving '! aggressiveand lazy cancelation i optimized"re-simulation after rollback ...... dynamic 're-partitioning instrumentation for run-time statistics

Table 1' library of functions provided by the test environment signal values (see [7]) are computed for current events. 9 Event execution: compute new signal values, update signal list, determine fan-outs. 9 Element evaluation.

4

MODEL PARTITIONING

In the model partitioning approach, a partitioning procedure is needed to assign elements to simulators. Our testbed currently implements two algorithms: Natural partitioning assigns elements to partitions in the order in which they appear in the circuit description. Mincut partitioning is a generalization of Fiduccia/Mattheyses' min-cut procedure for the case of non-bipartitioning [12]. In our implementation, elements and signals can be weighted individually to account for different activity rates and evaluation complexities. Efficient communication is a key factor to parallel simulation performance. Existing distributed memory multiprocessors have quite high communication latencies - costs that have to be paid for each message irrespective of its length. This is why in our testbed an event buffering mechanism has been implemented which is controlled by two parameters, lmin and amax. Instead of sending each event as an individual message events are collected in a buffer which is sent as soon as its length reaches I,ni, events. To prevent events from being withheld too long in the buffer a second parameter is used. If sending an event is deferred for too long, then the synchronization overhead increases because simulators keep suspended unnecessarilyy in the conservative approach, while in

394

P. Luksch

the optimistic approach more speculative computation has to be undone by roll-backs. Therefore, if an event has been in the buffer for more than am,x units of simulated time, then the buffer is transmitted regardless of its length. The partitioning procedure and the event buffering mechanism described above have been implemented in all the parallelizations based on model partitioning that have been implemented within the test environment, as described in the next section.

4.1

Deadlock Avoidance with Time Requests

Based on an algorithm proposed by Bain and Scott [1], a conservative protocol has been implemented that avoids deadlocks by means of time requests. A time request (Ti,Si) is issued by a simulator Si to all St with l~[j] < T~, i.e. all predecessors 5) that prevent ,% from advancing its simulation time to Tt which is the time stamp of the next event in its event list. A time request (Ti, Si) asks the receiving process Sj whether its simulation time has already reached time Tt. If so, a YES reply is sent back. As an optimization, a YES reply carries the local simulation time of the replying process to keep the sender's channel time up to date. Otherwise, the request is queued and the requesting process is left waiting for the reply. If there are any predecessors Sk of S~ with lr < Ti, 5'~ sends a request (Ti, S~) to these Sk. li[k] is the channel time of S t for channel (j, k), i.e. Sk's local simulation clock at the time of generating the last event received by Si via channel (j,k). Also note, that the request has S~ as its second component - the simulator that has generated the request for T~. Its identity is needed for cycle detection as explained below. A cycle is detected if a simulator St receives a request (T~, Si) from some process 5't while it has an identical request in its queue. Then a RYES ("reflected yes") reply is sent to St irrespective of the current local simulation time Tt. St has, however, to keep in mind that a RYES reply has been given to St. Assume that, later on, an event el with time stamp tl < Ti is sent to St. Then, the copy of request (Tt, St) which had been queued upon receipt from some process S,,, must be answered by a NO reply, because event el might cause an event e2 with time stamp t2 < Ti to be generated and sent to S~. This is the only situation where NO replies are generated. A request by Si is completed if all predecessors to which the request has been sent have sent their replies. If the request has been originated by S~, then the replies decide whether simulation time can be advanced: If all replies are either YES or RYES, simulation will proceed. If any NO replies have been received, then simulation must remain suspended. Having updated its channel times It[j] according to the replies, St generates a new time request. If a request (Tk, Sk) has been completed which had been originated by another process, S~, a reply is sent to the process S~ from which the request had been received: If there is at least one NO reply, then a NO reply is sent. Otherwise, if all replies are YES, a YES reply is sent as soon as T~ > Tk. Otherwise, i.e. if there are no NO replies but at least one RYES reply, then a RYES reply is sent. If Tt > Tk, then the RYES can be converted to YES.

Different Approaches to Distributed Logic Simulation 4.2

395

Deadlock Recovery with the Vector Method

In the conservative approach, an alternative to deadlock avoidance is to allow deadlock to occur, then detecting and recovering from it. Our implementation of deadlock recovery is based on Mattern's vector method [10]. Two variants of this deadlock detection algorithm have been implemented: a circulating control vector and a parallel version of the vector method. During deadlock detection the next event time is collected fi'om each simulator. Deadlock is broken by computing the minimum of these times. All simulators with minimum next event times are restarted.

The circulating control vector. The vector method detects deadlock by having each process count the number of messages that are sent to and received from other processes. Each simulator Si has a (local) vector Li. If Si sends a message to Si, Li[j] is incremented by one; if Si receives a message, Li[i] is decremented by one. A circulating control vector collects this information on its way through the simulators. A simulator Si that has received the control vector keeps it until it has to suspend its simulation because li[j] < Ts for some j. Then it updates C by adding its local vector to it which is then reset, i.e. C '= C + L'~; L, = 0. The control vector is passed to a process Sj with C[j] > 0. If C = {~ upon update, deadlock has been detected: all processes have suspended simulation and there is no event message in transit. 4.3

Time Warp

In the Time Warp parallel simulator state information is saved incrementally instead of periodically saving the state as a whole (check-pointing). Upon execution events are not removed from the event list. Instead, the signal value prior to event execution is stored in the event data structure. If a rollback to time tr occurs, a forward search is started in the event list beginning at time tr. The value of a signal s is restored from the first event affecting s that is found in this search. Incremental state saving is preferred to check-pointing in logic simulation because checkpointing would result in very inefficient memory usage since each event changes only a small part of the system state. Both methods for undoing external events have been implemented: aggressive and lazy cancelation. With aggressive cancelation, an anti-message m - is sent for each event message m + generated in the rolled back period immediately upon rollback. With lazy cancelation, an anti-message m - is not sent before local simulation time (LVT) reaches the time stamp of m +. Only if m + is not generated once again in the re-simulation, m - will be sent. 1 The idea behind lazy cancelation is that re-simulation will re-generate most of the events undone in the rollback. 2 Global virtual time (GVT) is approximated using Samadi's GVT2 algorithm [11]. Despite being one of the earliest GVT algorithms, run-time measurements have shown a I By re-simulation we mean the renewed simulation of the rolled back period of simulated time. ~Strictly speaking, this assumption doubts Time Warp's efficiency. However,several studies have shown that lazy cancelation can be more efficient than aggressive cancelation.

396

P. Luksch

sufficiently close approximation of GVT. GVT2 outperformed a newer algorithm proposed by Lin/Lazowska [8] which does not require simulators to stop computation temporally but requires more messages to be sent. In our implementation of GVT2, however, the requirement of stopping simulation could be relaxed so that simulators may continue computation but must refrain from sending messages. Anyway, investigating newer GVT algorithms such as the one proposed in [2] will be an interesting application of the test environment. Two extensions to the basic Time Warp mechanism have been implemented within our testbed: being motivated by the same assumption as lazy cancelation optimized resimulation aims" at reducing the number of element evaluations during re-simulation, which is especially useful for circuits containing complex elements. Dynamic re-partitioning attempts to compensate uneven load distribution by moving elements from a heavily loaded processor to a lightly loaded one. Even if static partitioning has generated equally sized partitions, load may be distributed unevenly if elements have different rates of activity or if activity distribution in the circuit changes over time.

Optimized Re-Simulation. Assume, an element E has been evaluated during the rolledback simulation at (simulated) time t resulting in an event el to be generated. If during re-simulation, E is evaluated once again at time t, el will be generated again if the state of E is the same as in the corresponding evaluation before rollback. The state of an element is defined as the vector of its input signals and its internal state variables. The idea of our optimization is to re-use the event generated before rollback instead of evaluating tim element once again if the above condition is met. More precisely, optimized re-simulation works as follows: during "normal" simulation the simulator keeps track of the causal relationship between events and element evaluations, i.e. it stores information of the form "event el caused elements El, E2 to be evaluated. Evaluation of E1 generated ca, evaluation of E2 generated e4." (e3, e4 are called a follow events of el caused by the evaluation of E1 and E o, respectively.) In addition the element state has to be remembered for each evaluation. At rollback, locM events are marked as "undone" instead of removing them from the list. If during re-simulation element E1 is evaluated at time t, the simulator checks if there is information stored about follow events. If so, it compares El's state at the corresponding evaluation before rolling back to its current state. If the states are identical, e3 (which is a follow event of el) is re-scheduled by removing the "undone" mark. Only if there is no follow event information stored or states do not match, E must be evaluated.

Dynamic Re-Partitioning.

There are three alternatives for implementing load balancing

at simulation time: 1. There are several simulator processes on each node. The operating system determines the load on each node and migrates processes as necessary. However, currently no multiprocessor operating system supporting dynamic load balancing is available for production use. In addition, the Time Warp protocol makes it hard for an operating system to measure load since Time Warp simulators are ready to compute all the time. Moreover, optimal scheduling of several simulators on one node, i.e. lowest LVT first, cannot be implemented with any of the existing operating systems.

Different Approaches to Distributed Logic Simulation

397

2. Each simulator processes several partitions each of which has its own LVT. They are scheduled such that the partition with minimal LVT is simulated first. If load imbalance is detected, then a heavily loaded process gives one or Inore of its partitions to a lightly loaded process. Load is measured as the minimum LVT of a simulator's partitions as reported in the snapshots taken for GVT computation. As LVT's may move forth and back quickly due to speculative computations followed by roll-backs, mean values taken over a number of snapshot provide a more realistic image of load distribution. Since partitions have their own LVT's, migrating them is relatively straightforward. However, as partitions may only be moved as a whole, their number must be much larger than the number of processors in order to be able to balance load exactly. Since the communication structure is fixed to a great extend by statically clustering elements into partitions, a good static partitioning policy is required. 3. To compensate load imbalance, a set of elements is selected from the partition of a heavily loaded simulator and is moved into that of a lightly loaded one. Load is measured by observing LVT's as in method 2. Compared to methods 1 and 2, element-wise re-partitioning allows very fine-grained redistribution of computational load. Communication relations between processes can be rearranged freely because there are no restrictions due to static partitioning. However, migrating elements is not as straight forward as migrating partitions which have their own LVT's. Usually, the source partition's LVT, T, re, is lower than the destination partition's LVT, Tae,~. Therefore, the simulator processing the destination partition, Sd~st, must perform a modified form of rollback to Tsrc in order to simulate unprocessed events for the "new" signals. (This rollback does not require local events to be undone.) Unprocessed events for signals that migrate into the destination partition have to be sent from S, rc to Sd~t. Also, rollback at Sd~,t may require the signal history for [GVT, T, rc] to be transferred to Sd~st. In the current version of the testbed, method 3 has been implemented. Comparison of methods 2 and 3 will be an interesting point for applications of and extensions to our testbed.

5

EXPERIMENTAL

RESULTS

The testbed has been implemented based on the machine-independent parallel programming library MMK which has been developed in our research group and is currently available for the iPSC/2, iPSC/860 and networks of Sun Sparc workstations. Run-time measurements have been performed on the iPSC distributed memory multiprocessors using the ISCAS-89 benchmark circuits as workloads. Function decomposition has a theoretical speedup of 3-4. Parallelization overhead (without communication cost) has been measured to be less than 50%. Nevertheless, no speedup has been observed in our run-time measurements because of the implementation platform's high communication latency which is about 600 l*s for MMK on the iPSC/860 and 2 ms on the iPSC/2. For function decomposition to be efficient communication latency must be low or circuits must be very large so that data exchanged between pipeline stages can be packed in long messages while keeping the pipeline busy.

398

P. Luksch

Parameter s: number of simulation time units between application of successive input vectors to primary Inputs

Figure 3: Time Warp and Deadlock Recovery: experimental results

In our measurements, performance of the parallelizations based on model partitioning has shown to depend strongly on the circuit being simulated and on the stimuli being applied to its primary inputs as depicted in fig. 3 for some examples. Maximum speedups are about half the number of simulators involved in the simulation. However, in many cases no clear relationship can be established between the number of simulators and the achieved speedup. As the function of the ISCAS benchmarks is not known, random sequences of input vectors have been applied to the circuits at different frequencies. The parameter s in fig. 3 denotes the number of simulation time units between two successive input vectors. The examples shown suggest that Time Warp outperforms conservative synchronization with deadlock recovery. However, our measurements do not clearly favor any of the three approaches that have been analyzed. Circuit topology and stimuli have impacted performance much more than the method of synchronization did for both of our static partitioning procedures. Run-time statistic revealed the reason for this rather unexpected behavior: Load has been distributed very unevenly among the simulators. Further analysis has shown that activity rates vary by several orders of magnitude from element to element. Also the "center of activity" within a circuit tends to move during simulation. In Time Warp, uneven load distribution has resulted in an extreme divergence of LVT's. Fig. 4 shows the result of an observation of LVT's and GVT with the TOPSYS distributed monitoring system. GVT approximation is sufficiently close. One simulator increases its LVT without roll-backs, another one proceeds at nearly the same rate but with frequent and short roll-backs. The other simulators periodically run far ahead of GVT and then rollback over long periods of simulated time. As a result of being far ahead of GVT the latter processes use up all their memory for state saving if large circuits are simulated. In order to get such simulations finished, Time Warp's optimism had to be limited by suspending simulators which are running short of memory if they are more than a predefined amount

Different Approaches to Distributed Logic Simulation

399

Figure 4: Time Warp: LVT's and GVT's observed with the TOPSYS distributed monitoring system

of simulated time ahead of GVT. 6

CONCLUSIONS AND FUTURE WORK

A test environment has been designed which allows easy implementation of a great number of parallelization strategies by providing a comprehensive library of functions and enables an unbiased evaluation of different parallelization strategies. Four parallelizations have been implemented and analyzed. However, the number of run-time measurements has been limited by the instability of both the iPSC multiprocessors and the programming environment. Since some of the results obtained have been quite unexpected, further runtime measurements should be carried out in the future including larger circuits and circuits of known function for which input stimuli can be provided that "make sense". From our measurements performed so far, the following conclusions can be drawn: 1. Given its limited potential for speedup and its sensitivity to communication latency, the function decomposition approach can be applied successfully only in combination with the model partitioning approach. In future multiprocessors where each node has several CPU's sharing a common memory, a simulator running on one node may be parallelized using function decomposition while simulation is distributed among the nodes using the model partitioning approach. 2. Different activity rates must be accounted for in the static partitioning procedure. Most heuristic algorithms can be modified to have individual weight factors for elements and signals. Since in the design phase of a circuit typically a number of nearly identical simulations is run in a sequence (e.g. for debugging the design), these weight factors can be easily obtained from statistics collected in a previous run at no extra cost. Dynamic re-partitioning has proved to reduce the LVT divergence in Time Warp. However, further

400

P. Luksch

measurements will be necessary in order to evaluate its effects comprehensively. Topics for future research include using the testbed as a basis for the implementation and analysis of optimizations of the existing and new parallelization strategies, and porting the testbed to a more widely used programming model, e.g. PVM or P4. Enlarging the set of hardware platforms where the testbed is available will allow us to evaluate different multiprocessors with respect to their appropriateness for distributed discrete event simulation. Considering other application areas of discrete event simulation will show to what extent results obtained from logic simulation can be generalized to other types of simulation problems. Parallelization of a commercial simulator designed for modeling production processes in factories has just begun. Acknowledgements This work has been partially funded by the DFG ("Deutsche Forschungsgemeinschaft", German Science Foundation) under contract No. SFB 342, TP A1. References [1] W.L. Bain and D.S. Scott. An algorithm for time synchronisation in distributed discrete event simulation. In Distributed Simulation, 1988. [2] H. Bauer and C. Sporrer. Distributed Logic Simulation and an Approach to Asynchronous GVT-Caleulation. In Proceedings o/the 199~ SCS Western Simulation Multiconference on Parallel and Distributed Simulation (PADSg2), pages 205-209, Newport Beach, California, January 1992. [3] K.M. Chandy and J. Misra. Asynchronous Distributed Simulation via a Sequence of Parallel Computations. Communications of the ACM, 24(11), April 1981. [4] C.M. Fidueeia and R.M. Mattheyses. A Linear-Time Heuristic for Improving Network Partitions. In 19th Design Automation Conference, pages 175-181, 1982. [5] R.M. Fujimoto. Parallel Discrete Event Simulation. Communication of the A CM, 33(10):30-53, October 1990. [6] D. Jefferson. Virtual Time. A CM Transactions on Programming Languages and Systems, 7(3):404-425, July 1985. [7] T.H. Krodel and K. Antreieh. An Aeeurate Model for Ambiguity Delay Simulation. In 27th ACM/IEEE Design Automation Conference, pages 122-127, 1990. [8] Y.-B. Lin and E.D. Lazowska. Determining the Global Virtual Time in a Distributed Simulation. In Proceedings of the 1990 International Conference on Parallel Processing, volume III, pages 201-209, 1990. [9] Peter Luksch. Parallelisierung ereignisgelriebener Simulationsverfahren an/Mehrprozessorsystemen mit verteiltem Speicher. Verlag Dr. Koran., Hamburg, 1994. [10] Friedemann Mattern. Verteilte Basisalgorithmen, volume 226 of lnformatik.Fachberichte. Springer-Verlag, Berlin, 1989. [11] B. Samadi. Distributed Simulation, Algorithms and Performance Analysis. Technical Report, University of California, Los Angeles, (UCLA), 1985. [12] Gopalakrishnan Vijayan. Min-Cost Partitioning on a Tree Structure and Applications. In 26th A CM/IEEE Design Automation Conference, pages 771-774, 1989.

Algorithms and Parallel VLSI Architectures III M. Moonen and F. Catthoor (Editors) 1995 Elsevier Science B.V.

401

A SIMULATOR FOR OPTICAL PARALLEL COMPUTER ARCHITECTURES

N. LANGLOH, H. SAHLIt, A. DAMIANAKISt, M. MERTENS, J. CORNELIS Vrije Universiteit Brussel Dept. ETRO/IRIS Pleinlaan 2 B- 1050 Brussel Belgium nlangloh @etro.rub. ac. be

raise Ecole Royal Militaire, Brussels, Belgium ~also FORTH-Hellas, Crete, Greece

ABSTRACT. With the demonstration of optical data transcription of images and logic operations on images, it has been shown that it is feasible to build optical computer architectures with arrays of differential pairs of optical thyristors. This paper describes a simulator for the execution of (image processing) algorithms on arbitrary optical architectures. The results of the simulations will allow the estimation of the execution speed of different architectures and the improvement of the architecture itself. KEYWORDS. Optical computing, simulation, parallel computer architecture.

1

INTRODUCTION

The PnpN optical thyristor is one of the most promising elements for parallel optical information processing [4]. Currently, PnpN devices with a very good optical sensitivity (250 fJ) at an operation cycle frequency of 15 MHz [5] are available. These fast switching times are achieved through the fabrication of a new type of optical thyristor which can be completely depleted by means of a negative electrical pulse [3]. A physical implementation of an array of differential pairs of PnpN optical thyristors with the possibility of performing optical data transcription and optical logic has been shown in [10]. The ability to execute AND, OR, and NOT operations with these PnpN optical thyristor arrays allows us to design optical computer architectures capable of executing all possible Boolean functions [6].

N. Langloh et al.

402

Massively parallel optical computer architectures can be designed with arrays of differential pairs of PnpN optical thyristors. When the architecture is used for image processing, each differential pair of an array represents a pixel of the image. Executing a Boolean operation with this architecture means that every pixel of the same optical thyristor array undergoes an identical Boolean operation. In [6], it was shown using a worst case analysis that, for images of at least 64x64 pixels, the calculation of an arbitrary Boolean function containing 10 different variables needs fewer clock cycles on an SIMD architecture based on optical thyristors than on a sequential architecture. The design of the architectures, built with the PnpN thyristor arrays, must be carried out carefully, so that they will be competitive with currently existing parallel and sequential (electronic) computers. We have therefore developed two simulators. A first prototype (OptoSim), which is capable of simulating a fixed SIMD architecture containing 6 optical thyristor arrays has already been developed in [8]. The simulator gives the sequence of operations that the optical thyristor arrays must perform to execute a program. The architecture is fully SIMD, because only one plane at a time can perform an operation. The simulator optsim that is currently being developed will not have this disadvantage. One of the objectives of this simulator is to simulate architectures consisting of several primitive computer architectures (standard cells), connected with each other through an optical communication (bus)-structure. All of these standard cells must be able to perform operations simultaneously. At the hvel of pixel data, this architecture is still SIMD, but at the level of image data, it can be viewed as an MIMD architecture. If the program to be executed can be partitionated such that many standard cells simultaneously contribute to the solution, then the degree of parallelisation will be some orders of magnitude higher than in [8].

Outline In section 2, the typical optical components which are used in the architectures will be described. In section 3, some examples of elementary optical computer architectures will be given. Section 4 will describe the implementation of a first simulator (OptoSim). In section 5, a hierarchical description of an optical parallel computer architecture will be given. Section 6 will describe the simulator optsim, currently still under development. In section 7, we will draw conclusions.

2

THE BASIC ELEMENTS

Several optical components are needed to build an optical computer architecture. Beside the PnpN thyristor array, which enables logic operations to be performed, one also needs elements which allow the blocking of optical signals (like a shutter), and the routing of optical signals to more than one destination (like a beam splitter). A system description for these components will be given in this section.

Optical Parallel ComputerArchitectures 2.1

403

The PnpN Thyristor Array

The basic component which allows to perform logic operations is the completely depleted optical PnpN thyristor [3]. Depending on the anode-cathode voltage of the thyristor, it can be in one of the following four states. (1) When we apply a high positive voltage (around ten volts), a current will flow through the device and it will emit light. Also a huge amount of charge will be accumulated in the device. (2) When we apply a low positive voltage (a few volts), the device will remain idle; it will not emit light, and it will not accumulate nor lose charges. (3) When we apply a zero voltage, then it will accumulate charges proportional to the optical energy of the light that falls on the gate of the thyristor. The factor of proportionality depends on the wavelength of the light that shines on the gate of the device. (4) When we apply a high negative voltage, all charges accumulated in the device will be sucked out. The removal of the charge can happen in a few nanoseconds because the device is completely depleted.

Figure 1: A differential pair of optical thyristors and an electronic model of the pair It is interesting to set up two thyristors as a differential pair [4] (see Figure 1). This pair then behaves like a "winner takes all" network: when we apply a high voltage over the differential pair, then the thyristor with the most accumulated charge will conduct and emit light. The other thyristor will remain idle. In other words, only one of the two thyristors will send out light. If we assume the convention that a logic "true" corresponds to the situation where one of the thyristors of the differential pair emits light, and a logic "false" corresponds to the situation that the other thyristor sends out light (see Figure 2-a), then it is possible to perform an AND and an OR operation with the differential pair. The AND operation is depicted in Figure 3: (i) The AND Plane (array) and the Buffer Plane will be reset with a high negative anode-cathode voltage. Then the contents of the A Plane will be copied to the Buffer Plane and the AND Plane will receive an optical bias (every pixel of the AND Plane will receive a logic "false") from the Bias Plane. (ii) The contents of the Buffer plane will be transmitted to the AND Plane. Thereafter, the Buffer Plane is reset. (iii) The content of the B Plane is copied to the Buffer Plane. (iv) The

404

N. Langloh et al.

a) Logic representation of a thyristor pair vertically (resp. horizontally ) oriented: logic "trig" when left (resp. upper) thyristor emits light; logic "false" when right (resp. lower) thyristor emits light.

b) Normally-connected and cross-connected electrodes (e.g ncv is normally-connected vertical).

Figure 2: Logic representation of a differential thyristor pair

Figure,3: The steps of an AND operation with optical input data in the A Plane and the B Plane, and the result stored in the C Plane

Optical Parallel Computer Architectures

405

Buffer Plane transmits its content to the AND Plane, and then the Buffer Plane is reset again. (v) The contents of the AND Plane will be send to the Buffer Plane. According to the "winner takes all" principle, a pixel of the AND Plane will be logic "true" if the corresponding pixel on the A Plane and the B Plane were logic true, because the thyristor of the differential pair which corresponds with the logic "true" received twice as much light (it received light from the A Plane and the B Plane) than the thyristor corresponding with the logic "false" (only from the Bias Plane). If at least one of the pixels on the A Plane or the B Plane was logic "false", then the logic "false" thyristor of the AND Plane receives more light than the logic "true" thyristor, and according to the "winner takes all" principle the logic "false" thyristor of the AND Plane will emit light. (vi) The contents of the Buffer Plane will be copied to a C Plane. An OR operation is selected when the logic "true" thyristor is optically biased before the two optical input signals are put on the differential pair. The NOT and shifting operations needs a differential pair with a special electrode configuration. Every thyristor of the differential pair has two top electrodes, so that it will be possible to control the position of the optical output of the activated thyristor. The electrodes can be normally-connected or cross-connected (see Figure 2-b). This makes it possible to have inverting logic and shifting [10].

2.2

The B e a m Splitter

A cube beam splitter is a well known optical component. When we send light in one of the faces of the cube, a fraction x (smaller than one) of the optical energy of the light comes out on the other side of the cube, and a fraction 1 - x comes out of a third plane of the cube. This is when we ignore the optical losses in the component. 2.3

The Shutter

A shutter can be used to decide electronically whether light will pass through the component. These shutters are usually made with liquid crystal displays and have the disadvantage of being slow. But with arrays of differential pairs of PnpN optical thyristors and beam splitters, one is able to build an active high speed shutter. The Figure 4 describe such a shutter, containing only one optical thyristor array and one beam splitter. This shutter allows to change its state in several clock cycles (less than 100 ns) instead of several milliseconds. When the optical thyristor array receives an optical input, it is possible to decide to send out or not this same optical signal through the beam splitter.

3

SOME BASIC ARCHITECTURES

Different optical computer architectures have already been examined. In [6], a basic optical computer architecture consisting of just three thyristor arrays and one beam splitter was examined (see Figure 5). It has been shown in [6] that the proposed simple architecture

N. Langloh et al.

406

world

:____ _~

--I~-

outside

world

Figure 4: A dynamic shutter was capable of performing all possible Boolean operations. A straightforward extension to grey value image processing was also demonstrated. Outside

~~

World Input I

Output

And Processing Plane

Or

Processing

%

Plane

Shifter- Inverter

Processing Plane

Figure 5: A basic architecture [6]

4

OPTOSIM-

A FIRST COMPILER AND SIMULATOR

A simulator of an optical computer architecture containing six optical thyristor arrays and six beam splitters was developed in [8]. The simulator was also capable of compiling CLIP instructions [2] and SSL instructions [1] to the low level instructions suitable for controlling the PnpN optical thyristor arrays. But the simulator has two major drawbacks. Firstly, it can compile CLIP programs and SSL programs only for one specific computer architecture, and secondly the execution time of the program is rather long. The simulator and the compiler are implemented on a Macintosh.

Optical Parallel ComputerArchitectures 5

407

H I E R A R C H I C A L D E S C R I P T I O N OF AN O P T I C A L P A R A L L E L COMPUTER ARCHITECTURE

Our pupose here is to present a more powerful simulator which permits to: (1) simulate general purpose optical parallel computer architectures, and (2) map image processing algorithms on to these architectures. An optical parallel computer architecture can be viewed as a distributed system formed by connecting several primitive processing units (standard cells), through a communication (bus) structure. We suggest a unified hierarchical approach for describing and implementing standard cells/more complexe optical parallel computers. Such a hierarchical description considerably simplifies the analysis, the design and the implementation of the simulator. Thus we distinguish four levels: (i) the physical implementation; (ii) the functional description; (iii) the graph representation; and (iv) the algebraic description.

Figure 6: Hierarchical standard cell description and corresponding simulator structure These description levels can be summerized as follows (see Figure 6-a): The physical implementation of the architecture (or the standard cell) is a scheme of how the architecture (or the standard cell) is built. This contains the optical components used to implement the architecture (or the standard cell), such as lenses, beam splitters, optical thyristor arrays, holograms, diffractive elements, shutters, etc. The functional description of the architecture (or the standard cell) only contains the

N. Langloh et al.

408

elements which are necessary to describe the operations that the architecture (or the standard cell) can perform. Optical elements like lenses, etc. are not present here because they just ensure that the light rays emitted by the optical thyristor arrays will not spread out. The graph representation of the architecture (or standard cell) describes the architecture (or standard cell) in terms of nodes and links between these nodes. Each node is an element capable of processing data (an element that must be described with internal state variables) like the optical thyristor array. A link represents the communication path between the nodes (the path transmitted light can follow), e.g. free air. Each node must be described by the transformation of data it can perform as a function of the input data and the internal variables. The algebraic description of the standard cell is the sequence of instructions this standard cell must perform to execute a Boolean operation. The algebraic description of the architecture is the sequence of operations all elements in the architecture must perform to execute a given program. A formal language has been defined for the design of algorithms. 6

O P T S I M - A S I M U L A T O R F O R G E N E R A L O P T I C A L C O M P U T E R ARCHITECTURES

To develop the simulator we decided to follow a bottum-up approach which we will describe below (see Figure 6-b). For each level of the hierarchical description of section 5, we defined and developed dedicated tasks (processes). We started with the development of a simulator for the optical components element, which allow to simulate the internal and external state of an element after a given time At, given the internal and external initial state of the element. This corresponds to the functional description of the optical components (see section 5). The communication between the elements is simulated by the process graph. This corresponds to the graph representation of the architecture (see section 5). In order to obtain the complete functional description of the architecture a third process kernel will simulate the state changes on a complete optical computer architecture. The algebraic description of the architecture is given by two tasks" assembler and compiler. The process assembler allows the simulation of a sequence of instructions of the optical processor. It has also debugging capabilities which allow the step-by-step tracing of the program. The process compiler will allow higher level languages to be developed. As shown, the sinmlator is viewed as a collection of concurrently executing processes, These processes are communicating with each other using a message passing model. All of these processes have been developed on an UNIX workstation using the C language. The facts that the processes are concurrently executed and that the Communication happens via. a non-blocking message passing model allow the exploitation of the eventual parallelism of the workstation on which the simulator is running. 6.1

The Processes

6.1.1 The Element Process The process element allows to simulate optical components of the computer architecture. Each optical component is described by the transformation of the input data to output data as a function of its internal state and applied instruction.

Optical Parallel Computer Architectures

409

Given the kind of component (e.g. optical thyristor array, beam splitter, shutter, diffractive element, ...), given a description of the current state of the element (e.g. the charge already accumulated in the junction of a PnpN thyristor), given the optical inputs of the element, and given the time period At that the component will have these inputs, this process will calculate the new state of this element and the optical outputs of the element after the time period At.

6.1.2 The Graph Process The process element contains the information of all the optical components of the optical computer architecture, but it does not know how these components are connected with each other. This is the task of the process graph. It will contain a graph description of the architecture and will make sure that the optical input images of the component simulated by the process element are equal to the output images of the components to which it is connected. It also checks the architecture coherence. 6.1.3 The Kernel Process The task of the process kernel is to calculate the new state of the optical computer after a time period At, given the current state of each component of the architecture. At first sight, this seems to be a straightforward problem. But there can exist optical loops in the graph, and then only an iterative process can calculate the new state of the architecture. Knowing that passive optical components just dissipate light and active optical components generate light independently of the received optical energy, this problem can be easily solved. Firstly components of the architecture will be simulated assuming that they have no optical input. This way, the optical energy generated in the system is known. Then the process kernel will iteratively ask the process element to calculate for each component the optical output, knowing that its optical inputs are the optical outputs of the components connected to it. The process kernel then can ask the optical energy dissipated in the element. The process will iterate until the total optical energy dissipated in the architecture is close enough to the optical energy generated. 6.1.4 The Assembler Process With the process element, the process graph and the process kernel, it is possible to define tile components and how these components are connected with each other in order to form an optical computer architecture. It is also possible to calculate the new state of the architecture after a time period At, given the current state of the architecture and the instructions every component of the architecture must perform. But the level of the instructions is very low. E.g. the instructions for an optical thyristor ( Reset, Receive, Idle, and Send) correspond to the voltages to be put over the thyristor (see section 2.1). The process assembler will allow the user to give a sequence of instructions to be processed by the optical architecture. The process assembler will send the sequence instruction by instruction to the process kernel. It will also make sure that the new state of the components will become the old state of these components for the next instruction. 6.1.5 The Compiler Process It is clear that the process assembler will only generate a sequence of very low level instructions. But in most application domains, some sequences of instructions will appear at different occasions in a program. The aim of the process compiler is to allow the user to construct high level commands which will be translated in a sequence of low level commands. The most emergent application where the computer

410

N. Langloh et al.

architectures based on the PnpN optical thyristors will be used is in image processing. With the process compiler it will be possible to define operations on images as the sequence of low level instructions. 6.2

T h e C o m m u n i c a t i o n B e t w e e n T h e P r o c e s s e s of O P T S I M

The processes are so designed that they are communication driven. This means that a process will be idle until another process will ask it to perform a task. Each process can be treated a.s a software "black box" communicating with the outside world via message passing [11, 12]. This service is provided by the communication software. The communication software handles the messages to be transmitted to (received by) any process. If process A receives a request from another process B, then this request will contain (i) the name of the process B that started the request, (ii) an identification number that must be returned with every answer to the request, so that the process B knows to which request the answer belongs, (iii) an answer identification number of the request to which this is the answer (this number will be zero if it is not an answer, but a newly generated request), (iv) a command field containing the command to be executed (or the answer), and (v) a body field containing supplementary data. The communication software supports the notion of recovery functions. If a process has a request for another one, it will use its own communication software to send the request, and put a recovery function call in a recovery function queue by passing (i) a pointer to a function, (ii) the answer identification number, and (iii) a status structure corresponding to the current execution context. When a process receives a request, it will first check the answer identification number. If this number is zero, the process will assume that it received a new command, so it will interpret the command field and execute the associated command. If the answer identification number is non zero, the process will check to which recovery function this answer corresponds, and will start executing the code of this function. 6.3

The User Interface

It is the user who will control the operations that the processes perform. In the beginning, the user uses a command line as interface to communicate with the other processes. With this interface, the user can send commands to the processes and he also can receive the answers from the processes. The form of the command line is: "command" "body", where command is the command to be executed while body contains the parameters supplied by the command. For example:

element create

/* Allocate a new element and return an element id number. */

element load type

/* Load the type of an element (e.g PnpN type). b e g i n b e g i n element_id e n d begin type_id e n d e n d

element request ports

/* Ask for the number of ports of an element.

*/ */

Optical Parallel Computer Architectures

411

begin element_id end

element calculate /* Calculate the optical output signals of an element*./ begin element_id end graph load connection /* Connect pairs of elements. begin begin element_id end begin port_id end end begin begin element_id end begin port_id end end

,/

graph calculate next element /* Return the next element to be processed, and ask process element to update the influenced inputs. */ It is obvious that a command line is largely insufficient as a user communication interface. Therefore, we are currently developing a graphical user interface which will allow the user to communicate with the other processes in a very intuitive manner. The graphical user interface will be developed in Motif under X-Windows.

7

CONCLUSIONS AND FURTHER DEVELOPMENT

A first simulator and compiler of a fixed optical computer architecture is implemented. It allows the user to write a program in CLIP or SSL, and translate the program to low level instructions suitable for a direct control of the PnpN thyristor arrays. A more powerful simulator is currently under development which permits to simulate general purpose optical computer architectures. This simulator allows us to: (1) describe several optical primitive processing units (standard cells), (2) simulate optical components functionality and (3) ma.p algorithlns on to optical computer architectures. ~re have adopted a unified hierarchical approach for describing and implementing standard cells/optical parallel computers. First results show the usefulness of simulating even simple basic cells, whose functioning would be otherwise untractable. Tile next step will be the introduction of a new hierarchical layer which will combine different basic optical processing units in order to build complex optical computer a.rchitectures which can act as parallel MIMD architectures and which will allow coarse grain parallelisation of algorithms.

Acknowledgements This work is supported by a joint IMEC/VUB project and a Human Capital and Mobility network Vision Algorithms and Optical Computer Architectures, contract no ERBCHRXCT930382. The authors also wish to thank the Applied Physics department of the Vrije Universiteit Brussel.

412

N. Langloh et al.

References

[1] K.H. Brenner, A. Huang, N. Streibl. Digital optical computing with symbolic substitution. Applied Optics 25, pp. 3054, 1986. [2] M.J.B. Duff, T.J. Fountain. Cellular Logic Image Processing, Academic Press, 1986. [3] P. Heremans, M. Kuijk, R. Vounckx, and G. Borghs. The Completely Depleted PnpN Optoelectronic Switch, abstract sent in to Optical Computing 94, Edinburgh, August 1994. [4] M. Kuijk, P. Heremans, R. Vounckx, and G. Borghs. The Double Heterostructure Optical Thyristor in Optical Information Processing Applications. Journal of Optical Computing 2, pp 433-444, 1991. [5] M. Kuijk, P. Heremans, R. Vounckx, and G. Borghs. Optoelectronic Switch Operating with 0.2 fJ/m2 at 15 MHz. Accepted for Optical Computing 94, Edinburgh, August 1994. [6] N. Langloh, M. Kuijk, J. Cornelis, and R. Vounckx. An Architecture for a General Purpose Optical Computer Adapted to PnpN Devices. In: S.D. Smith and R.F. Neale (Ed.), Optical information technology: state of the art report, Springer Verlag, pp 291-299, 1991. [7] N. Langloh. A Simulator for Optical Parallel Computer Architectures. HCM ERBCHRXCT930382 note, Vrije Universiteit Brussel, 1994. [8] M. Mertens. Een compiler voor beeldverwerkingsalgoritmen op een PnpN optische computer. Engineering Thesis, VUB, 1993. [9] M. Mertens. A Simulator for Optical Parallel Computer Architectures: description of a standard cell. HCM ERBCHRXCT930382 note, Vrije Universiteit Brussel, 1994. [10] H. Thienpont, M. Kuijk, W. Peiffer, et al. Optical Data Transcription and Optical Logic with Differential Pairs of Optical Thyristors. Topical Meeting of the International Commission for Optics, Kyoto, Japan, April 4-8 1994. [11] C. tloar. Communicating sequential processes, Prentice-Hall, 1985. [12] Parallel C User Guide Texas-Instrument- TMS320C$O, 3L-ltd, 1992.

413 AUTHORS INDEX

Archambaud D., 155 Arioli M., 97 Arvind D.K., 203 Astr6m A., 215 Bouchittd V., 319 Boulet P., 319 Brown D.W., 37 Cardarilli G.C., 109 Catthoor F., 131 Catthoor F., 191 Champeau J., 245 Christopoulos C.A., 235 Clark J.J.,85 Cornelis J., 235 Cornelis J., 401 , ~

!

11

"lr

n,-;~,

Le Pape L., 245 Lemaitre M., 307 Lojacono R., 109 Luksch P., 389 McWhirter J.G., 25 Megson G.M., 283 Mertens M., 401 Pennd J., 155 Perrin G.-R., 341 Pirsch P., 179 Pirsch P., 353 Popp O., 167 Pottier B., 245 Proudler I.K., 25 Rangaswami R., 295 1~

' ~ ~" T

nnn