dded
lltiprocessors Ang
and Synchronization
SUNDARARAJAN SRIRAM S. BHATTACHARYYA SHUVRA
Tsuhan Chen, Carne~ie Sadaoki Furui, Tokyo lnstifut~of ~ e c h n o l o ~ y Aggelos K. Katsaggeios,~ o ~ ~ ~ eUniversity s ~ e r n S. Y. Kung, ~rinceton Un~~ersity P. K. Raja Rajasekaran, Texas lnsfru~ents John A. Sorenson, Technical University of ~ e n ~ a r ~
1. DigitalSignal ProcessingforMultimedia ~ e s h a bK. Parhi and Taka0~ i ~ h i t u n i
Systems, editedby
[L.Multimedia Systems, Standards, and Networks, edited by Atul Puri and T s u ~ a nChen
3. Embedded ~ultiprocessors:Sc~~duling and S ~ c ~ o n i z a t i o n , Sun~ararajarlSriram and ShuvraS. ~ h a t t a c ~ a ~ y a
~ d d i t i o n ~ l ~ oin l uPrepara~ion ~es Signal Processing for Intelligent Sensor Systems, David C. ~ w a ~ s o ~ Compressed Video Over Networks,edited by ~ i n g - ~Sun i n and ~ Amy ~iebman Blind Equalization and Identi~cation,Zhi Ding and Ye ( ~ e o ~ r eLi y)
MARCEL
MARCEL DEKKER, INC. D E K K E R
NEWYORK BASEL e
Sriram, ~undararajan Embedded multiprocessors:scheduling and sync~ronization/ Sundararajan Sriram, Shuvra S. Bhattacharyya. p. cm. -(Signal processing series ;3) Includes bibliographicalreferences and index. ISBN 0-8247-9318-8(alk. paper) 1. Embeddedcomputer ems. 2. M~tiprocessors.3. Multimedia systems. Scheduling. 4. I. ttacharyya, Shuvra S. TI. Title. 111. Signal processing (Marcel D e a e r , Inc.) ;3. T ~ 7 8 ~ ~ . S65 E 4 2000 2 004.l&dc21
00-0~2900
This book is printed on acid-free paper.
Marcel Dekker, Inc. 270 Madison Avenue, New York,W 10016 tel: 2 12-696-9000; fax: 2 12-685-4540
Marcel DekkerAG ~utgasse4, Postfach 8 12, CH-400 1 Basel, Switzerland tel: 41-61-261-8482; fax: 41-61-261-8896
The ~ublisheroffers discounts on t h s book when ordered in bulk quantities. For more i n f o ~ t i o n , write Special to Sa~es~rofessionalMarketing the at ~ e a d q u a ~ eaddress rs above.
any f o m or by Neither this booknor any part m y be reproduced or transmitted in any means, electronic or mechanical, including p h o t o c o p ~ g , m i c r o ~ l ~and ng, recording, or by any ~ f o ~ t i storage o n and retrieval system, without permission in writing from the publisher.
Current printing (last digit) l 0 9 8 7 6 5 4 3 2 1
To my parent^, and Uma Sundararajan Sriram
~~und~ati Shuvra S. Bhattacharyya
This Page Intentionally Left Blank
Over the past 50 years, digital siglla~ rocessing has evolved as a major engineering d i s c ~ p ~ ~The n e . fields of signal processing have grown from the origin of fast Fourier transforln and digital filter design to statistical spectral analysis and array processing, and image, audio, and lnultiln~diaprocessing, and shaped developments in high-performance VLSI signal processor design. Indeed, there are few fields that enjoy so manyapplications-signalprocessingiseverywhere in our lives. Whenoneuses a cellular phone, the voice is compressed,coded,and modulated using signal processing techniques. A s a cruise missile winds along hillsides searching for the target, the signal processor is busy processing the imagestakenalong the way.Whenwe are watching a movie in HDTV, millions of audio and video data are being sent to our homes and received with unbelievable fidelity. When scientists compare DNA samples, fast pattern recognition techniques are being used. On and on, one can see the impact of signal processing in almost every engineering and scientific discipline, Because of the immense importan~eof signal processing and the fastgrowingdemands of businessand in dust^, this series on signal processing serves to report up-to-date developments and advances in the field. The topics of interest include but are not limited to the following: Signal theory and analysis Statistical signal processing Speech and audio processing Image and video processing ~ ~ l t i l ~ esignal d i a processing and technology Signal processing for colnlnunications Signal processing architectures and VLSI design
I hope this series will provide the interested audience with higll-~uality, state-of-the-art signal processing literature through research monographs, edited books, and rigorously written textbooks by experts in their fields. K. J.Ray Liu V
DSP 1 DSP 2 MCU
ASIC
o
4
5
10
(5
io
l
Embedded systems are computers that are not first and foremost computers. They are pervasive, appearing in automobiles, telephones, pagers, consumer electronics, toys, aircraft, trains, security systems,weaponssystems, printers, modems, copiers, thermostats, manufacturing systems, appliances, etc. A technically active person today probably interacts regularly with more embedded systems than conventional computers. This is a relatively recent phenomenon. Not so long ago automobiles depended on finely tuned mechanical systems for the timing of ignition and its synchronization with other actions. It was not so long ago that modems were finely tuned analogcircuits. Embedded systems usually encapsulate domain expertise. Even small software programs may be very sophisticated, requiring deep understanding of the domain and of supporting technologies such as signal processing. Because of this, such systems are often designed by engineers who are classically trained in the domain, for example, in internal combustion engines or in communication theory. They have little background in the theory of computation, parallel computing, and concurrency theory. Yet they face one of the most difficult problems addressed by these disciplines, that of coordinating multiple concurrent activities in real tjme, often in a safety-critical environment.Moreover,they face these problems in a context that is often extremely cost-sensitive, mandating optimal designs, and time-critical, mandatin~rapid designs. Embedded software is unique in that parallelism is routine. Most modems and cellular telephones, for example, incorporate multiple programmable processors. Moreover, embedded systems typically include custom digital and analog hardware that must interact with the software, usually in real time. That hardware operates in parallel with the processor that runs the software, and the software must interact with it much as it would interact with another software process running in parallel. Thus, in having to deal with real-time issues and parallelism, the designers of embedded softwareface on a daily basis problems that occur only in esoteric research in the broader field of computer science.
uter scientists refer to use of physica~ly distinct computational resources (processors) as “parallelism,” and to the logical property that multiple activities occur at the same time as “concu~ency.” Paral~e~ism implies concurrency, but the reverse is not true. Almost all operating systems deal with concurrent ,which is managed by multiplexing multiple processes or threads on a processor. A few also deal with parallelism, for example by mapping S onto physically distinct processors. Typical embedded systems exhibit both concu~encyand parallelism, but their context is different from that of genose opera tin^ systems in many ways. In embedded systems, concu~enttasks are often statically defined, largely the lifetime of the system. A cellular phone, for example, has nct modes of operation (dialing, talking, standby, etc.), and in each mode of operatio ll-defined set of tasks is c o n c u ~ e ~ t active ly (speech encoding, etc.). The static structure of the concurr much more detailed analysis and optimization in a more dynamic environment. is book is about such analysis and optimization. rdered transaction strategy, for example, leverages that relatively static of embedded software to dramatically reduce the synchronization overhead of communication between processors. It recognizes that embedded software is intrinsically less predictable than hardware and more predictable than eneral-pu~osesoftware. Indeed, minimizing synchronization overhead by static i n f o ~ a t i o nabout the application is the major theme of this book. In general-pu~osecomputation, communication is relatively expensive. Consider for example the interface between the audio h a r d w ~ eand the software of a typical personal computer today. Because the transaction costs are extremely h, data is extensively buffered, resu~tingin extremely long latencies. A path from the microphone of a PC into the software and back out to the speaker typically has latencies of hundreds of milliseconds. This severely limits the utility of the audio hardware of the computer. Embed ed systems cannot tolerate such latencies. major theme of this book is communication between components. The iven in the book are firmly rooted in a manipulable and tractable ford yet are directly applied to hardware design. The closely related IPC ssor communication) graph and synchronization graph models, introhapters 7 and 9, capture the essential prope~iesof this com~unicae of graph-theoretic properties of IPC and sync~onizationgraphs,
optimi~ationproblems are formulated and solved. For example, the notion of resynchroni~ation, where explicit synchronization operations are minimi~ed through manipulation of the sync~onizationgraph, proves to bean effective optimi~ationtool. In some ways, embedded software has more in common with hardware thanwith traditional software. ardware is highly parallel. Conceptually9hardware is an assemblage of components that operate continuously or discretely in time and interact via sync~onousor asynchronous communication, oftw ware is an assemblage of components that trade off use"ofa CPU, operating sequentially, and communicating by leaving traces of their (past and completed) execution on a stack or in memo^. Hardware is temporal. In the extreme case, analog hardware operates in a continuum, a computational medium that is totally beyond the reach of software, Communication is not just synchronous; it is physical and fluid, oftw ware is sequential and discrete. ~ o n c u ~ e n cinysoftware is about reconciling sequences, Concu~encyin hardware is about reconciling signals, This book ~xaminesparallel software from the perspective of signals, and identifies joint hardware/software designs that are ~articularlywell-suited for embedded systems. The prima^ abstraction mechanism in software is the ~rocedure(or the method in object-oriented designs). Procedures are terminating computations. The primary abstraction mechanism in hardware is a module that operat allel with the other components. These modules represent non-termina putations. These are very different abstraction mechanisms. Hardw do not start, execute, complete, and return. They just are. In embedded systems9 software components often have the sameproperty. They do not t e ~ i n a t e . ~onceptually,the distinction between hardware and software, from the perspective of co~putation9has only to do with the degree of concu~encyand the role of time. An application with a large amount of concu~encyand a heavy temporal content rnight as well be thought of as using the ~bstract~ons that have been successful for hardware, regardless of how it is implemented. An application that is sequential and ignores time rnight as well be thought of as using the abstractions thathave succeeded for software, regardless ofhowit is implemented. The key problem becomes one of identifying the appropriate abstractions for representing the design. This book identifies abstractions that work well for the joint design of embedded software and the hardware on which itruns. The intellectual content in this book is high. While some of the methods it describes are relatively simple, most are quite sophisticated. Yet examples are given that concretely de strate how these concepts can be applied in practical hardware architectures. over, there is very little overlap with other books on parallel processing. The focus on application-specific processors and their use in
x
FOREWORD
embedded systems leads to a rather different set of techniques. I believe that this book defines a new discipline. It gives a systematic approach to problems that engineers previously have been able to tackle only in an ad hoc manner.
E d w a r ~A.Lee Professor ~ e ~ a r t m e n t o ~ ~ l e cEngineering trical and Computer Sciences University of Cal~orniaat Berkeley Berkeley, Cal~ornia
Softwareimplementation of c0mpute"intensivemultimedia applications such as video conferencing systems, set-top boxes, and wireless mobile terminals and base stations is extremely attractive due to the flexibility, extensibility, and potential portability of programmable implementations. However, the data rates involved in many ofthese applications tend to be very high, resulting in relatively few processor cycles available per input sample for a reasonable processor clock rate. Employing multiple processors is usually the only means for achieving the requisite compute cycles without moving to a dedicated ASIC solution. With the levels of integration possible today, one can easily place four to six digital signal processors on a single die; such an integrated multiprocessor strategy is a promising approach for tackling the complexities associated with future systems-on-achip. However, it remains a significant challenge to develop software solutions that can effectively exploit such multiprocessor implementation platforms. Due to the great complexity of implementing multiprocessorsoftware, and the severe performance constraints of multimedia applications, the develop~nent of automatic tools for mapping high level specifications of multimedia applications into efficient multiprocessor realizations has been an active research area for the past several years. ~ a p p i n gan application onto a multiprocessor system involves three main operations: assigning tasks to processors, ordering tasks on each processor, and determining the time at which each task begins execution. These operations are collectively referred to as s c ~ e ~ ~ the Z iapplication ~g on the given architecture. A key aspect of the multiprocessor scheduling problem for multimedia system implementation that differs from classical scheduling contexts is the central role of interprocessor communication the efficient management of data transfer between communicating tasks that are assigned to different processors. Since the overall costs of interprocessor communication can have a dramatic impact on execution speed and power consumption, effective handling of interprocessor communicatio~is crucial to the development of cost-effective multiprocessor implementations. This books reviews important research in three key areas related to multiprocessor implementation of multimedia systems, and this book also exposes important synergies between efforts related to these areas. Our areas of focus are the incorporation of interprocessor communication costs into multiprocessor scheduling decisions; a modelingmethodology, called the "synchronization
..
~REFA~E
graph,” for multiprocessor system performance analysis; and the application of the synchronization graph model to the development of hardware and software timizations that can significantly reduce the inte~rocessorcommunication erhead of a given schedule. ore specifically, this book reviews, in a unified manner^ several imporiprocessor scheduling strategies that effectively inco~oratethe consideration of inte~rocessorcommunication costs, and highlights the varietyof techniques employed in these multiprocessor scheduling strategies to take interprocessor communication into account. The book also reviews a body of research performed by the authors on modeling implementations of multiprocessor schedules, and on the use of these odel ling techni~uesto optimize interprocessor communication costs. A unified framework is then presented for applying arbitrary scheduling strategies in conjunction with the application of alternative optimization algorithms that address specific subproblems associated with implementing a given schedule. We provide several examples of practical applications that demonstrate the relevance of the techniques desc~bedin this book. are grateful to the Signal Processing Series Editor Professor K. 3.Ray Liu (University of land, College Park) for his encouragement of this project, and to Executive isition Editor B. J. Clark(MarcelDekker, Inc.) for his coordination of the effort. It was a privilege for both of us to be students of Professor Edward A. Lee (University of California at erkeley). Edward provided a truly inspiring research environmen~during our d toral studies, and gave valuable feedbackwhileweweredevelopingmanyof the concepts that underlie n this book. We also acknowledge helpful proofreading assistance andrachoodan, Mukul ~handelia,and Vida Kianzad ~ a r y l a n dat College Park); andenlighteningdiscussionswith n and Dick Stevens (U. S. Naval Research Laboratory), and Praveen (AngelesDesignSystems).Financialsupport (for S. S. Bhattadevelopment of this book was provided by the National Science §un~ururujun§ r i r a ~ §hu~ruS. ~ h ~ t t u c h a ~ y u
v
ay Liu)
vii
xi
* . * .. ~ ........ *. ~ . * ..... * .... 1.l
~*.~.*.*....~*~.~**~..
~ultiprocessorDSP systems
2
l .2 Application-specific multiprocessors
4
1.3 Exploitation of p a r a ~ l e l i s ~5 1.4 Dataflow modeling for DSP design 1.S Utilityof dataflow for DSP 1.6 Overview
6
9
11 e...
2.1 Parallel architecture classifications
2.2
13
Exploiting instruction level parallelism
15
2.2.1 ILP in programmable DSP processors 2.2.2
Sub-word parallelism
2.2.3 VLIW processors
.....................13
15
17
18
2.3 Dataflow DSP architectures
2.4 Systolic and wavefront arrays
19 20
xiii
xitr
CONTENTS
2.5 Multiprocessor DSP
architectures
2.6 Single chip multiprocessors
21
23
2.7 Reconfigurable computing 25 2.8Architectures
that exploit predictable IPC27
Summary 2.9 29 3
......31
*......*.e .......I
3.1
Graphdata structures
31
3.2 Dataflow graphs 32 3.3 Computation graphs 32 3.4 Petri
nets
33
3.5 Synchronous dataflow 3.6Analytical 3.7 Converting
34
properties of SDF graphs35 a general SDF graph into a homogeneous SDF graph
3.8Acyclicprecedenceexpansiongraph
36
38
3.9 Application graph 41 3.10 Synchronous languages
42
3.1 1 HSDFGconceptsand
notations
3.12Complexityofalgorithms
45
3.13 Shortest andlongestpaths 3.13.1
43
in graphs47
Dijkstra’s algorithm
48
3.13.2 TheBellman-Fordalgorithm
48
3.13.3 The Floyd-~arshallalgorithm 3.14Solving
difference constraints using shortest paths
3.15 Maximum cycle mean 3.16 Summary
UL
49 50
53
53 ULl
ELS . ...I..............
4. 1 Task-level parallelism anddata parallelism
....e
55
....5
*
CONTENTS
4.2
XV
Static versus dynamic scheduling strategies
4.3 Fully-static schedules
56
57
4.4 Self-timed schedules
62
4.5 Dynamic schedules
64
4.6 ~uasi-staticschedules
65
4.7
Schedule notation
67
4.8
Unfolding HSDF graphs
69
4.9 Execution time estimates and static schedules 4.10 Summary
72
74
..............7
*
..I........*..
5.1 Froblem description
75
5.2 Stone’s assignment algorithm 5.3 List scheduling algorithms 5.3.1
76 80
81
Graham’s bounds
5.3.2 The basic algorithms HLFET and ETF 5.3.3 The mapping heuristic
84
5.3.4 Dynamic level scheduling 5.3.5Dynamic
85
critical path scheduling
5.4 Clustering algorithms
5.4.2 Internalization
88
89
5.4.3 Dominant sequence clustering
Summary 5.7
89
19
5.5 Integrated scheduling algorithms 5.6 Fipelined scheduling
86
87
5.4.1 Linear clustering
5.4.4 Declustering
84
92
94
100
...............**......... l.... 6.1 The ordered-transactions strategy
101
xvi
~~NT~NTS
6.2 Shared bus
~chitecture 104
6.3 Interprocessor communication mechanisms
104
6.4 Usingthe ordered-transactions approach107 6.5 Design ofan orderedmemory access ~ultiprocessor 108 6.5.1 Highleveldesign
108
description
6.5.2 A modified design 109 112
6.6 Design details of a prototype 6.6.1 Top level design
112
6.6.2 Transaction order controller 6.6.3 Host interface 6.6.4 Processing
114
1 18
element
121
6.6.5 FPGA circuitry
122
6.6.6 Shared memo^
123
6.6.7 Conne~tingmultiple boards 6.7Hardwareand
123
software implementation
125
oard design 125 6.7.2 Software interface
125
6.8 Ordered I10 andparameter control
128
6.9 Application examples 129
Fourier Transform (FFT) 132
6.9.31024pointcomplexFast 6.10 S u ~ ~ a r y,134
7 7.1 Inter-processor communicationgraph (Gipc) 7.2 Execution time
estimates
138
143
7.3 Ordering constraints viewed as edges addedto Gipc 144
CONTENTS
Periodicity 7.4 7.5 Optimal
xvii
145 order
146
7.6 Effects of changes inexecutiontimes149 7.6. l
Dete~inisticcase
150
7.6.2 Modelingrun-timevariationsin
execution times151
7.6.3 Bounds on the average iteration period154 7.6.4 Implications fortheordered transactions schedule 7.7 Summary
157
T 8.1TheBoolean 8.1.1
155
e..
..................................
dataflow model159 Scheduling
160
8.2 Parallel implementation on sharedmemorymachines163 8.2.1 General strategy 163 8.2.2Implementation
on theOMA
165
8.2.3 Improved mechanism 169 8.2.4 Generating theannotatedbus access list 8.3 Data-dependent iteration 8.4 Summary
171
174
175
technique178 9.1 "he barrier ~ I M D 9.2Redundant
synchronization removalin non-iterative dataflow179
9.3 Analysis of self-timed execution182 9.3.1 Estimated throughput 182 9.4 Strongly connected componentsandbuffer size bounds182 9.5 Synchronization model 185 9.5.1
Synchronization protocols
185
9.5.2 The synchronizationgraph G,
187
CONTENTS
xviii 9.6Asynchronization
cost metric190
9.7Removingredundantsynchronizations19 9.7.1
1
The independenceofredundantsynchronizations192
9.7.2 Removing redundant synchronizations 193 9.7.3 Comparisonwith Shaffer’s approach195 9.7.4 An example 195 9.8 Making the synchronizationgraph strongly connected197 9.8.1Addingedges
to the synchronizationgraph199
9.9 Insertion of delays201 9.9.1 Analysis
of DetermineDelays205
9.9.2 Delay insertion example 207 9.9.3 Extending the algorithm208 9.9.4 Complexity 2
10
9.9.5 Related work 210 9.10 Summary 21
1
N.. ........*...........*...l..........* ........ 10.1 Definition of resynchronization2
13
10.2 Properties ofresynchronization2
15
....... .........
*
10.3 Relationship to set covering218
10.4 Intractability of resynchronization221 10.5 Heuristic solutions
224
10.5.1 Applying set-covering techniques to pairs of SCCs 10.5.2 Amore flexible approach225 10.5.3 Unit-subsumptionresynchronization edges23 10.5.4 Example 234 10.5.5 Simulation approach 236 10.6Chainablesynchronizationgraphs236 10.6.1Chainablesynchronizationgraph
SCCs
237
1
CONTENTS
six
10.6.2 Comparison to the Global-Resynchronize heuristic
239
10.6.3 A generalization of thechainingtechnique240 10.6.4 Incorporating the chainingtechnique242 10.7Resynchronizationof 10.8 Summary 111 L
constraint graphs for relative scheduling242
243 -C~N§TRAI~ED
11.1 Eliminationofsynchronizationedges246 11.2Latency-constrainedresynchronization248 11.3 Intractability ofLCR253 11.4Two-processorsystems260 Interval covering26
11.4. l
1
11.4.2Two-processorlatency-constrainedresynchronization262 11.4.3 Takingdelays into account266 11.5 A heuristic for generalsynchronizationgraphs
276
11.S. 1 Customization to transparent synchronization graphs 278 11S.2 Complexity 278 11.5.3 Example 280 11.6 Summary
12.1Computing
286
buffer sizes
29 l
12.2 A framework for self-timed implementation292 12.3 Summary 294 ESEARCH D I R E C T I ~ N .... ~ ...,..'. ...............
.,.,........297 ... ....... 3011 321
This Page Intentionally Left Blank
The focus of this book is theexploration of architectures and design methodologies for application-specific arallel systems in the gener embedded applications in digital si nal processing (DSP).In the such multiprocessors typically consist of one or more central processing units (micro-controllers or programmable digital signal processors), and one or more application-specific hardware components (implemented as custom application specific integrated circuits (ASI~s) or reconfigurable logic such as field programmable gate arrays ( F ~ ~ A s )Such ) . embedded multiprocessor systems are becoming increasingly common today in applications ranging from digital audio/video equipment to portable devices such as cellular phones and personal digital assistants. With increasing levels of integration, it is now feasible to integrate such heterogeneous systems entirely on a single chip. The design task of such multiprocessor systems-on-a-chip is complex, and the complexity will only increase in the future. One of the critical issues in the design of embedded multiprocessors is managing communication and synchronization overhead between the heterogeneous processing elements. This book discusses systematic techniques aimed at reducing this overhead in multiprocessors that are designed to be application-specific. The scope of this book includes both hardware techniques for minimizing this overhead based on compile time analysis, as well as software techniques for strategically designing synchronization points in multiprocessor implementation withthe objective o ducing synchronization overhead. The techniques presented here apply to P algorithms that involve predictable control structure; the precise domain of applicability of these techniques will be formally stated shortly. Applications in signal, image, and video processing require large computing power and have real-time p e ~ o ~ a n requirements. ce The computing engines in such applications tend to be embedded as opposed to general-purpose. Custom
Chapter 1
VLSI implementations are usually preferred in such high throughput applications. However, custom approaches havethe well known problems of long design cycles (the advances in high-level VLSI synthesis notwithstanding) and low flexibility in the final implementation. Programmable solutions are attractive in both of these respects: the p r o g r a ~ ~ a bcore l e needsto be verified for correctness only once, and design changes can be made late in the design cycle by modifying the software program. Although verifying the embedded software to be run on a programmable part is also a hard problem, inmost situations changes late in the design cycle (and indeed even after the system design is completed) are much easier and cheaper to make in the case of software than inthe case of hardware. Special processors are available today that employ an architecture and an instruction set tailored towards signal processing. Such software programmable integrated circuits are called “Digital Signal Processors” (DSP chips or DSPs for short). The special features that these processors employ are discussed extensively by Lapsley, Bier, Shoham and Lee [LBSL94]. However,a single processor -evenDSPs -often cannot deliver the performance requirement of some applications. In these cases, use of multiple processors is an attractive solution, where both the hardware and the software make use of the application~specific nature of the task to be performed. For a multiprocessor implementation of embedded real-time DSP applications, reducing interprocessor communication ( C) costs andsynchronization costs becomes particularly important, because there is usually a premium on proof video cessorcyclesin these situations. Forexample,considerprocessing images in a video-conferencing application. Video-conferencing typically involves Quarter-CIF (Common Intermediate Format) images; this format specifies data rates of 30 frames per second, with each frame containing144 lines and 176 pixels per line, The effective sampling rate of the Quarter-CIF video signal is 0.76 Megapixels per second. The highest performance programmable DSP processor available as of this writing (1999) has a cycle time of 5 nanoseconds; this allows about 260 instruction cycles per processor for processing each sample of the video signal sampled at 0.76 MHz. In a multiprocessor scenario, IPC can potentially waste these precious processor cycles, negating some of the benefits of using multiple processors. In addition to processor cycles, IPC also wastes power since it involves access to shared resources such as memories and busses. Thus reducing IPC costs also becomes important froma power consumption perspective for portable devices.
Over the past few years several companies have offered boards consisting of multiple DSPs. More recently, semiconductor companies have been offering
chips that integrate multiple DSP engines on a single die, Examples of such integrated multiprocessor DSPs include commercially available products such as the Texas Instruments TMS320C80 multi-DSP [GGV92], Philips Trimedia processor [RSSS], and the Adaptive Solutions CNAPSprocessor. The Hydra research at Stanford [H0981 is another example of an effort focussed on single-chip multiprocessors. MultiprocessorDSPs are likely to be increasingly popular in the one to future for a variety of reasons. First, VLSItechnologytodayenables “stamp” 4-5 standard DSPs onto a single die; this trend is certain to continue in the coming years. Such an approachis expected to become increasingly attractive because it reduces the testing time for the increasingly complex VLSI systems of the future. Second, since such a device is programmable, tooling and testing costs of building an ASIC (application-specific integrated circuit) for each different application are saved by using such a device for many different applications. This advantage of DSPs is going to be increasingly important as circuit integration levels continue their dramatic ascent. Third, although there has been reluctance in adopting automatic compilers for embedded DSPs, suchparallel DSP products make the use of automatedtools feasible; with a large number of processors per chip, one can afford to give up some processing power to the inefficiencies in the automatic tools. In addition, new techniques are being researched to make the process of automatically mapping a design onto multiple processors more efficient the research results discussed in this book are also attempts in that direction. This situation is analogous to how logic designers have embraced automatic logic synthesis tools in recent years logic synthesis tools and VLSI technology have improved to the point that the chip area savedby manual design over automated designis not worth the extra design time involved: one can afford to “waste’, a few gates, just as one can afford to waste a limited amount of processor cycles to compilation ine~ciencies in a multiprocessor DSP system. Finally, a proliferation of telecommunication standards andsignal formats, often giving rise to multiple standards for the very same application, makes software implementation extremely attractive. Examples of applications in this categoryinclude set-top boxescapableofrecognizing a varietyofaudiolvideo formatsandcompression standards, modernssupportingmultiple standards, multi-mode cellular phones and base stations that work with multiple cellular standards, multimedia workstations that are required to run a variety of different multimedia software products, and programmable audiolvideo codecs.Integrated multiprocessor DSP systems provide a very flexible software p l a t f o ~for this rapidly-growing family ofapplications.
A natural generalization of such fully-programmable, multiprocessor inte-
Chapter l
grated circuits is the class of multiprocessor systems that consists of an a r b i t r ~
possibly heterogeneous collection of programmable processors as well as a
set of zero or more custom hardware elements on a single chip. ~ a p p i n gapplications onto such an architecture is then a hardware/software codesign problem. However,theproblems of interprocessor communi~ation and synchronization are, for the most part, identical to those encountered in fully-pro~rammable systems, In this book, when we refer to a “m~ltiprocessor,~’ we will imply an architecture that, as described above, may be comprised of dif€ere~ttypes of programmable processors, andmay include custom hardware elements. Additionally, the multiprocessor systems that we address in this book may be packaged in a single integrated circuit chip, or maybe distributed across multiple chips. All of the techni~uesthat we present in this book apply to this general class of parallel processing architectures.
Although this book addresses a broad range of parallel architectures, it focuses on-thedesign of such architectures in the context of specific, well-defi~ed families of applications. We focus on application-specific parallel proce instead of applying the ideas in general purpose parallel systems because systems are typically components of embedded app~ications,and the computational characteristics of embedded applications are fundamentally different from those of genera1“purposesystems. General purpose parallel computation involves user-progra~mablecomputing devices, whichcanbeconveniently config~red for a wide variety of purposes, and can be re-configured any number of times as the user’s needs change. omp put at ion in an embedded app~ication,however, is usually one-time programmed by the designer of that ernbedded system (a digital cellular radio handset, for example) and is not meant to be programmable by the end user. Also, the computation in embedded syste is specia~ized (the c o ~ p u t a tionin a cellular radio handsetinvolvesspecifi SE” functions such as speech compression, channel equalization, modulation, etc.), andthe desi ners of embedded multiprocessor hardware typically have specific knowled applications that will be developed on the p l a t f o ~ sthat they develo trast, ~ c h i t e c t of s general purpose computing systems cannot afYord to customize their hardware too heavily for any specific class of applications. Thus, only designers of embedded systems have the oppo~unityto accurately predict and optimi~efor the specific ap ation subsystems that willbe executing on the hardware that theydevelop.wever,ifonly general purpose imple~entation techniques are used in the development of an embedded system, then the designers of that embedded system lose this oppo~unity.
Furthemore, embedded applications face very different constraints compared to general purpose computation. on-recu~ng design costs, competitive time-to-mar~etconstraints, limitations on the amount and placement of memory, constraints on power consumption, and real-time performance requirements are a few examples. Thus for an embedded application, it is critical to apply techniques for design and implementation that exploit the special characteristics of the application in order to optimize for the specific set of constraints that must be satisfied. These techniques are naturally centered around design methodologies that tailor the hardware and software implementation to the particular application.
Parallel computation has of course been a topic of active research in computer science for the past several decades. Whereas parallelism within a single processor hasbeen successfully exploited (instruction-level parallelism), the problem of pa~itioninga single user program onto multiple such processors is yet to be satisfactorily solved. Although the hardware for the design of multiple processor machines the memory, interconnection network, inpu~outputsubsystems, etc. has received much attention, efficient partitioning of a general program (w~ttenin G, for example) across a given set of processors arranged in a particular configuration is still an open problem. The need to detect parallelism from within the overspecified sequencing in popular imperative languages such as G, the need to manage overhead due to communication and synchronization between processors, and the requirement of dynamic load balancing for some programs (an added source of overhead) complicates the partitioning problem for a general p r o g r a ~ . Ifwe turn from general purpose computation to application-specific domains, however, parallelism is often easier to identifyand exploit. This is because much more is known about the computational structure of the functionality being implemented, In such cases, we do not have to rely on the limited ability of automated tools to deduce this high-level structure from generic, low-level specifications (for instance, from a general purpose programmin~language such as C). Instead, it may bepossible to employ specialized computational models such as one of the numerous variants of dataflow and finite state machine models that expose relevant structure in our targetted applications, and greatly facilitate the manualor automatic derivation of optimized implementations. Such specification models will be unacceptable in a general-purpose context due to their limited applicability, butthey present a tremendous opportunity tothe designer of embedded applications. The use of specialized computational models particularly d a t a ~ o ~ - b a s emodels d -is especially prevalent in the DSP domain.
Chapter l
Similarly, focusing a particular application domain mayinspire the discovery of highly streamlined system architectures. For example, one of the most extensively studied family of application-specific parallel processors is the class of syst~licarray architectures [Kun88][Rao85]. These architectures consist of regularly arranged arrays of processors that communicate locally, onto which a certain class of applications, specified in a mathemat~calform, can be systematically mapped. Systolic arrays are further discussed in Chapter 2.
esi The necessaryelementsin the studyof application-specific computer architectures are: 1) a clearly defined set of problems that can be solved usingthe particular application-specific approach, 2) a formal mechanism for specification of these applications, and 3) a systematic approach for designing hardware and software from such a specification. In this book we focus on embedded signal, image, and videosignal processing applications, and a specification model called Sync~onousDataflow that has proven to be very useful for design of such applications. Dataflow is a well-known programming model in which a program is represented as a set of tasks with data precedences. Figure 1.1 shows an example of a dataflow graph, where computation tasks (actors) A ,B , C , and D are represented as circles, and arrows (or arcs) between actors represent FIFO (first-infirst-out) queues that direct data values from the output of one computationto the input of another. Figure 1.2 shows the semantics of a dataflow graph. Actors consume data (or tokens, represented as bullets in Figure 1.2) fromtheir inputs, perform computations on them (fire), and produce a certain number of tokens on their outputs. The functions performed by the actors define the overall function of the dataflow graph; for example in Figure 1.l, A and B could be data sources, C
Figure 1.l. An example of a dataflow graph.
could be a simple addition operation, and D could be a data sink. Then the function of the dataflow graph would be simply to output the sum of two input tokens. Dataflow graphs are a very useful specification mechanism for signal processing systems since they capture the intuitive expressivity of block diagrams, flow charts, and signal flow graphs, while providing the formal semantics needed for system design and analysis tools. The applications we focus on are those that ) ELM873 and its extensions; described becan by S willwe discuss the fo putational model in detail in Chapter 3. SDF in its pure form can onlyrepresent application sion making at the task level. Extensions of SDF (such as the (BDF) model [Lee91][Buc93]) allow control constructs, so that data-dependent control flow can be expressed in such models. These models are si~nificantly more powerful in terms of expressivity, but they give up some of the useful analytical properties possessed the SDF model. For instance, Buck shows that it is possible to simulate any Turing machine in the BDF model [Buck93), TheBDF model can therefore compute all Turing computable functions, whereas this is not
Figure l .2. Actor "firing".
Chapter 1
possible in the case of the SDF model. We further discuss the Boolean dataflow model in Chapter 8. In exchange for the limited expressivity of an SDF representation, we can efficiently check conditions such as whether a given SDF graph deadlocks, and whether it can be implemented usinga finite amount of memory.No such general procedures can be devised for checking the c o ~ e s p o n d i nconditions ~ (deadlock behavior and bounded memory usage)for a computation model that can simulate any given Turing machine. This is because the problems of determining if any given Turing machine halts (the halting problem), and determining whether‘it will use less than a given amount of memory (or tape) are that is, no general algorithmexists to solve these problems in finite time. In this work, we first focus on techniques that apply to SDF applications, and we will propose extensions to these techniques for applications that can be specified essentially as SDF, but augmented with a limited number of control constructs (and hence fall into the BDF model). SDF has proven to be a useful model for representing a significant class of DSP algorithms; several computeraided design tools for DSP have been developed around SDF and closely related models. Examples of commercial tools based on SDF are the Signal Processing rksystem (SPW) from Cadence [PLN92][BL91]; and COSSAP, from Synopsys [RPM92]. Tools developed at various universities that use SDF and related models include Ptolemy [PHLB95a], the Warp compiler [Pri92], DESCARTES M921, GRAPE[LEAP94],and the GraphCompiler[VPS90].Figure 1.3
Figure 1.3. A block diagram specificationof an F system in Cadence Signal Processing ~ o r k s y s t e (SPW). ~
showsanexampleofansystem SP
specified as a blockdiagraminCadence
The SDF model is popular because it has certain analytical properties that in practice; we will discuss these properties and how they arise in the section. The most important property of SDF graphs in the context of this book is that it is possible to effectively exploit parallelism in an algorithm specified as an SDF graph by scheduling computations in the SDF graph onto multiple processors at compile or design timerather than at run-time. Given such a schedule that is d e t e ~ i n e d at compile time, we can extract i n f o ~ a t i o nfrom it with a view towards optimizingthe final implementation. Inthis book we present techniques for minimizing synchronization and inter-processor communication overhead in statically (i.e., compiletime)scheduledmultiprocessorsinwhich the program is derived from a dataflow graph specification. The strategy is to model run-time execution of such a multiprocessor to determine how processors communicate and sync~onize,and then to use this information to optimize the final implementation.
As mentioned before, dataflow models such as SDF (and other closely related models) have proven to be useful for specifying applications in signal processing and communications, with the goal of both simulation of the algorithm at the functional or behavioral level, and for synthesis from such a high level specification to a software description (e.g., a C program) or a hardware description (e.g., V DL) or a combination thereof. The descriptions thus generated can then be compiled down to the final implementation, e.g., an embedd~d processor, or an ASIC. One of the reasons for the popularity of such dataflow based modelsis that they provide a formalism for block-diagram based visual programming, which is a very intuitive specification mechanism for DSP; the expressivity of the S model sufficiently enco~passesa significant class of DSP applications, including multirate applications that involve upsampling and downsamplingoperations. An equallyimportantreason for employingdataflow is that such a specification exposes parallelism in the p It is wellknown that imperativeprogramming styles such as C andF N tend to over-specify the control structure of a givencomputation,andcompilationofsuch specifications onto parallel architectures is known to be a hard problem. Dataflow onthe other hand imposes minimal data-dependency constraints in the specification, potentially enabling a compiler to detect p~allelismveryeffectively. The sameargumentholds for hardware synthesis, where it is also important to be able to specify and exploit concu~ency.
Chapter 1
The SDF model has also proven to be useful for compiling DSP applications on single processors. Programmable digital signal processing chips tend to have special instructions such as a single cycle multiply-accumulate (for filtering functions), moduloaddressing (for mana&ingdelay lines), and bit-reversed addressing (for FFT computation). DSP chips also contain built in parallel functional units that are controlled from fields in the instruction (such as parallel moves from memoryto registers combined with anALU operation). It is difficult for automatic compilers to optimally exploit these features; executable code generated by commercially available compilers today utilizes one-and-a-half to two times the programmemory that a correspondinghandoptimizedprogram requires, and results in two to three times higher execution time compared to hand-optimi~ed code[ZVSM95]. There are however significant research efforts underway that are narrowing this gap. Forexample, see [LDK95][SM~97]. Moreover, some of the newer DSP architectures such as the Texas Instruments S 3 2 0 C 6 ~ 0are more compiler friendly than past DSP architectures; automatic compilers for these processors often rival hand optimized assembly code for many standard DSP benchmarks. Block diagram languages based on models suchas SDF have proven to be between automatic compilation and hand coding approaches; a library of reusable blocks in a particular programming language is hand coded, this library then constitutes the set of atomic SDF actors. Since the library blocks are reusable, one can afford to carefully optimize and fine tune them. The atomic blocks are fine to medium grain in size; an atomic actor in the SDF graph may implement anything from a filtering function to a two input addition operation. The final program is then automatically generated by concatenating code corresponding to the blocks inthe program according to the sequence prescribed by a schedule. This approach is mature enough that there are commercial tools available today, for example the SPVV and COSSAP tools mentioned earlier, that employ this technique. Powerful optimization techniques have been developedfor generating sequential programs from SDF graphs that optimize for metrics such as program and data memory usage, the run-time efficiency of buffering code, and context switching overhead betweensub-tasks [BM~96]. a bridge
Scheduling is a fundamental operation that must be performed in order to implement SDF graphs on both uniprocessor as well as multiprocessors. Uniprocessor scheduling simply refers to determining a sequence of execution ofactors such that all precedence constraints are met and all the buffers between actors correspondi in^ to arcs) return to their initial states. Multiprocessor scheduling involves determining the mapping of actors to available processors, in addition to determining of the sequence in which actors execute. VVe discuss the issues involved in multiprocessor scheduling in subsequentchapters.
ve~vie The following chapter describes examples of application specific multiprocessors used for signal processing applications. Chapter 3 lays down the formal notation anddefinitions used in the remainder of this book for modeling runtime synchronization and interprocessor communication. Chapter 4 describes scheduling modelsthat are commonly employed when scheduling dataflow graphs on multiple processors. Chapter 5 describes scheduling algorithms that attempt to maximize performance while accurately taking interprocessor communication costs into account. Chapters 6 and 7' describe a hardware based technique for minimizing IPCand synchronization costs; the key idea in these chapters is topredict the pattern of processor accesses to shared resources and to enforce this pattern during runtime. We present the hardware design and implementation of a four processor machine the Ordered Memory Access Architecture (OMA). The OMA is a shared bus multiprocessor that uses shared memory for IPC, Theorder in which processors access shared memory for thepurpose of communication is predetermined at compile time and enforced by a bus controller on the board, resulting in a low-cost IPC mechanism without the need for explicit synchronization. This scheme is termed the Ordered Transactions strategy In Chapter 7 , we present a graph theoretic scheme for modeling run-time onization behavior of multiprocessors using a structure we call the that takes into account the processor assignment and ordering constr that a self-timed schedule specifies. We also discussthe effect of run-time variations in execution times of tasks on the performance of a multiprocessor implementation. *
In Chapter 8, we discuss ideas for extending the Ordered Transactions strategy to models more powerful than SDF, for example, the Boolean dataflow (BDF) model. The strategy here is to assume we have only a small number of control constructs in the SDF graph and explore techniques for this case. The domain of applicability of compile time optimization techniques can be extended to programs that display some dynamic behavior in this manner, without having to deal with the complexity of tackling the general BDF model. The ordered memory access approach discussed in Chapters 6 to 8 requires special hardware support. When such support is not available, we can utilize a set of software-based approaches to reduce synchronization overhead. These techniques for reducing sync~onizationoverhead consist of efficient algorithms that minimize the overall synchronization activity in the imple~entation of a given self-timed schedule. A straightfo~ardmultiprocessor implementation of a dataflow specification often includes ~ ~ u n ~ synchronizatio~ a n t points, i.e., theobjective of a certain set of synchronizations is guaranteed as a side effect
Chapter l
of other synchronization points in the system. Chapter 9 discusses efficient algorithms for detecting and eliminating such redundant synchronization operations. also discuss a graph transformation called C ~ ~ v e r t - t ~ - ~ Cthat - g allows ra~~ e use of more efficient synchronization protocols. It is alsopossible to reduce the overall synchronization cost of a self-timed implementation by adding synchronization points between processors that were not present in the schedule specified originally. In Chapter 10, we discuss a technique, called r ~ s y ~ ~ h r o n ~ ~ for t i osystematically n, manipulating synchronization points in this manner. Resynchronization is performed with the objective of im~rovingthroughput of the multiprocessor implementation. Frequently in realtime signal processing systems, latency is also an important issue, and although resynchronization improves the throughput, it generally degrades (increases) the latency. hapter 10 addresses the problem of resynchronization underthe assumption that an arbitrary increase in latency is acceptable. Such a scenario arises when the computations occur in a feedforward manner, e.g., audiolvideo decoding for playback from media suchas Digital 'Versatile Disk (DVD), andalso for a wide variety of simulation applications. Chapter 11 examines the relationship between resynchronization and latency, and addresses the problem of optimal resynchronizationwhenonly a limited increase in latency is tolerable. Such latency constraints are present in interactive applications such as video conferencing and telephony, where beyond a certain point the latency becomes annoying to the user. In voicetelephony, for example, the round trip delay of the speech signal is kept below about 100 milliseconds to achieve acceptable quality. The ordered memory access strategy discussed in Chapters 6 through 8 can be viewed as a hardware approach that optimizes for IPC and synchronization overhead in statically scheduled multiprocessor implementations. The synchronization optimization techniques of Chapter9 through 12, on the other hand, operate at the level of a scheduled parallel program by altering the synchronization s t ~ c t u r eof a given schedule to minimize the synchronization overhead in the final implementation. ~hroughoutthe book, we illustrate the key concepts by applying them to examples of practical systems.
ot only the dollar cost
tion.
Chapter 2
elements could themselves be self-contained processors that exploit parallelism within themselves. In the latter case, we can view the parallel program as being split into multiple threads of computation, where each threadis assigned to a processing element. The processing element itself could be a traditional von Neumann-type Central Processing Unit (CPU), sequentially executing instructions fetched from a central instruction storage, or it could employ instr~ctionlevel (ILP) to realize high performance by executing in parallel multiple instructions in its assigned thread. The interconnection mechanism between processors is clearly crucial to the performance of the machine on a given application. For fine-grained and instruction level parallelism support, communication often occurs through a simple mechanism such as a multi-po~edregister file. For machines composed of more sophisticated processors, a large varietyofinterconnectionmechanism have been employed, ranging from a simple shared bus to 3-dimensional meshes and hyper-trees [Lei92]. Embedded applications often employ simple structures such as hierarchical busses or small crossbars. The twomain flavors of ILPare superscalar andVLIW(VeryLong Instruction Word) [PH96]. Superscalar processors (e.g.,the Intel Pentium processor) contain multiple functional units (ALUs, floating point units, etc.); instructions are brought into the machine sequentially and are scheduled dynamically by the processor hardware onto the available functional units. Out-of-order execution of instructions is alsosupported. VLIW processors, on the otherhand,relyonacompiler to statically schedule instructions onto functional units; the compiler determines exactly what operationeach functional unit performsineach instruction cycle. The “long instruction word” arises because the instruction word must specify the control i n f o ~ a t i o nfor all the functional units in the machine. Clearly, a VLIW model is less flexible than a superscalar approach; however, the implementation cost of VLIW is also significantly less because dynamic scheduling need not be supported in hardware. Forthis reason, several modern DSP processors have adopted the VLIW approach; at the same time, as discussed before, the regular nature of DSP algorithms lend themselves wellto the static scheduling approach employed in VLIW machines. We will discuss some of these machines in detail in the following sections. Given multiple processors capable of executing autonomously, the program threads running on the processors may be tightly or loosely coupled to one another. In a tightly coupled architecture the processors may run in lockstep executing the same instructions on different data sets (e.g., systolic arrays), or they may run in lock step, but operate on different instruction sequences (similar to VLIW). Alternatively,processors may executetheir programs independent ofone
APPLICATI0~-SPECIFICMULTIPROCESSORS
another, only communicating or sync~onizingwhen necessary. Even in this case there is a wide range of how closely processors are coupled, which can range from a shared memory model where the processors may share the same memory address space to a “network of workstations’’ model whereautono~ousmachines communicate in a coarse-grained manner over a local area network. In the following sections, we discuss app~ication-specificparallel processors that exemplify the many variations in parallel architectures discussed thus far. We will find that these machines employ tight coupling between processors; these machines also attempt to exploit the predictable run-time nature of the targeted applications, by employing architectural techniquessuch as VLIW,and employing processor interconnectionsthat reflect the nature of the targeted application set. Also, these architectures rely heavilyupon static scheduling techniques for their performance.
rocessors DSP processors have incorporated ILP techniques since inception; the key innovation in the very first DSPs was a single cycle multiply-accumulate unit. In addition, almost all DSP processors today employ an architecture that includes multiple internal busses allowing multiple datafetches in parallel with aninstruction fetch in a single instruction cycle; this is also known as a “Harvard” architecture. Figure 2.1 showsanexampleof a modern DSP processor (Texas Instruments TMS320C54x DSP) containing multiple address and data busses, and parallel address generators. Since filtering is the key operation in most DSP algorithms, modern programmable DSP architectures provide highly specialized support for this function. For example, a multiply-and-accumulate operation may be performed in parallel with two data fetches from data memory (for fetching the signal sample and the filter coefficient); in addition, an update of two address registers (potentially including modulo operations to support circular buffers and delay lines), and an instruction fetch can also be done in the same cycle. Thus, there are as many as seven atomic operations performed in parallel in a single cycle; this allows a finite impulse response (FIR) filter implementation using only oneDSP instruction cycle per filter tap. For example, Figure 2.2 shows the assembly code for the inner loop of an FIR filter implementation on a TMS32OC54x DSP. The MAC instruction is repeated for each tap in the filter; for each repetition this instruction fetches the coefficient and data pointed to by address registers AR2 and AR3, multiplies and accumulates them into the “A” accumulator, and postincrements the address registers.
Chapter 2
have a complex inst~ctionset and follow a philosophy very difTerent from ““Reduced n s t ~ c t i o nSet ~ o m ~ u t e ( r” tectures, that are prevalent in the general p u ~ o s e high ~ e ~ o ~ a n c e microprocessor domain. The advantages of a com~lex inst~ction set are compact
ified viewof the Tex
object code, and dete~inistic perfo~ance, while the price of supporting a complex instruction set is lower compiler efficiency and lesser portability of the software. The constraint of lowpower,andhigh performance-to-cost ratlo re~uirementfor embedded DSP applications has resulted in very differe tion paths for DSP processors compared to general-purpose processors. these paths eventually converge in the future remains to be seen.
Sub-word parallelism refers to the ability to divide a wide ALU into narrower slices so that multiple operations on a smaller data type can be performed on the same datapath in an SIMD fashion (Figure 2.3). Several general purpose microprocessors employ a multi-media enhanced instruction set that exploits sub-word parallelism to achieve higher performance on multimedia applicatio~s that require a smaller precision. Technology”-enhanced Intel Pentium processor [E own general purpose CPU with an enhanced instruction set to handle throughput intensive “media” processing. The MMX instructions allow a 64~bitALU to be partitioned into $-bit slices, providing subThe $-bit ALU slices work in parallel in an SIMD fashion. The Pentiurn can perform operations such as addition, subtraction, and logical operations on eight &bit samples (e.g., image pixels) in a single cycle. It also can perform data movement operations such as single cycle swapping of bytes within words, p a c ~ n gsmaller sized words into a 64-bit register, etc. operations such as four 8-bit multiplies (with or without satu shifts within sub-words, and sum of products of sub-words, may all be p e r f o ~ e d in a singlecycle. Similarly enhanced microprocessors have been developed by systems (the “VIS” inon set for the SPARC processor [TO Hewlett-Packard (the ’inst~ctionsfor the PA RISC process The VIS instruction set includes a capability for performing S absolute difference (for image compression ~pplications). The include a sub-word average, shift and add, and fairly generic permute instr~ctions “
Chapter 2
that change the positions of the sub-words within a 64-bit word boundary in a very flexible manner. The permute instructions are especially useful for efficiently aligning data within a 64-bit word before employing an instruction that operates on multiple sub-words. DSP processors such as the TMS32OC60 and ~ S 3 2 ~ 8and 0 the , Philips Trimedia also support sub-word parallelism. Exploiting sub-word parallelism clearly requires extensive static or compile time analysis, either manually or by a compiler.
ro~~ssors Asdiscussed before, the lower cost of a compiler-scheduledapproach employed in VLIW machines compared to hardware scheduling employed in superscalar processors makes VLIWa good candidate as a DSP architecture. It is therefore no surprise that several semiconductormanufacturershave recently announced VLIW-based signal processor products. The Philips Trimedia [RS98] processor, for example, is geared towards video signal processing, and employs a VLIW engine. The Trimedia processor also has special V0 hardware for handling various standard video formats. In addition, hardwaremodules for highly specialized functionssuch as Variable Length Decoding (usedfor MPECvideo decoding), color and format conversion, are also provided. Trimedia also has instructions that exploit sub-word parallelism among byte-sized samples withina 32-bit word. The ChromaticsMPACT architecture [Pur971usesan interesting hardware/software partitioned solution to provide a programmable platform for PC-
byte
+
+
+
+
a + be + cf + gd + h
Figure 2.3. Example of sub-word parallelism: Additionof bytes within a 32 bit register (saturation or truncation could be specified).
APPLICATION-SPECIFIC~ULTIPROC~SSORS
based multi-media. The target applications are graphics, audiohide0 processing, and video games. The key idea behind Chromatic’s multimedia solution is to use some a ~ o u n tof processing capability in the native x86 CPU, and usethe MPACT processor for accelerating certain functions when multiple applications are operated simultaneously (e.g., when a FAX message arrives while a teleconferencing session is in operation). Finally, the Texas Instruments TMS32OC6x DSP [Tex98]is a high performance, general purpose DSP that employs a VLIW architecture. The C6x processor is designed around eight functional units that are grouped into two identical sets of four functional units each (see Figure 2.4). These functional units are the D unit for memory loadlstore and addhubtract operations; the M unit for multiplication; the L unit for additio~subtraction,logical and comparison operations; and the S unit for shifts in addition to addhubtract and logical operations. Each set of four functional units has its own register file, and a bypass is provided for accessing each half of the register file by either set of functional units. Each functional unit is controlled by a 32-bit instru~tionfield; the instruction word for the processor therefore has a length between 32 bits and 256 bits, depending on how many functional units are actually active in a given cycle. Features such as predicated inst~ctionsallow conditional execution of instructions; this allows one to avoid branching when possible, a very useful feature considering the deep pipeline of the C6x.
Several multiprocessors geared towards signal processing are based on the dataflow architecture principles introduced by Dennis ~ D e n 8 0 these ~ ; machines deviate from the traditional von Neumann model of a computer. Notable among these are Hughes Data Flow Multiprocessor [GB91], the Texas Instruments Data Flow Signal Processor [Gri84], and the AT&T EnhancedModular Signal Processor [Blo86]. The first two perform the processor assignment step at compile time (i.e., tasks are assigned to processors at compile time) and tasks assigned to a processor are scheduled on it dynamically; the AT&T EMPS performs even the assignment of tasks to processors at run-time. The main steps involved in scheduling tasks on multiple processors are discussed fully in Chapter 4. Each of these machines employs elaborate hardware to implement dynamic scheduling within processors, and employs expensive communication networks to route tokens generated by actors assigned to one processor to tasks on other processors that require these tokens. In most DSP applications, however, such dynamic scheduling is u n n e c e s s ~since compile time predictability makes static scheduling techniques viable. Eliminating dynamic scheduling results in much simpler hardware without an undueperformance penalty.
Chapter 2
Another example ofan application-specific dataflow architecture is the 1 [Cha84], which is a single chip processor geared towards image ch chip contains one functional unit; multiple such chips can be connected together to execute programs in a pipelined fashion. The actors are statically assigned to each processor, and actors assigned to a given processor are scheduled on it dynamically. The primitives that this chip supports, convolution, bit manipulations, accumulation, etc., are specifically designed for image processing applications.
ystolic arrays consist of processors that are locally connected and may be arranged in different interconnection topologies: mesh, ring, torus, etc. The term “systolic” arises because all processors in such a machine run in lock-step, alternating between a computation step and a communication step. The model followed is usually SIMD (Single Instruction ~ u l t i p l eData). S execute a certain class of problems that can be specified as o ~ t h m s(RIA)” [Rao85]; systematic techni~uesexist for mapping an algo-
256-bit instruction word
~ 3 ~ 0 C VLlW 6 x ar~hitecture.
rithm specified in A. form onto dedicated processor arrays in an optimal fashion. ~ptimalityes i metrics such as processor and communication link utilization, scalability with the problem size, and achieving best for a givennumber of essors. Several numerical computation problerriswere found to fall into the ar algebra, matrix operations, singular value decomposition, [Lei921 for interesting systolic array implementations of a variety of di~erentnumerical problems). Only highly regular computations can be specified in the RIA form; this makes the applicability of systolic arrayssomewhat restrictive. vefront arrays are similar to systolic arrays except that processors are not under the control of a global clock [ n881. Communication between processors is async~onousor self-timed; ands shake between processors ensures runtime sync~onization,Thus processors in a wavefront array can be complex and the arrays themselves can consist of a large number of processors without incurring the associated problems of clock skew and global sync~onization.The ibility of wavefront arrays over systolic arrays comes atthe cost of llon University [A+87] is an example of ed ato dedicated array designed for one specific application. anged in a linear array and communicate with their neighbors es. Programs are written for this comhe Warp project also led to the i orate inter"processor c node is a single VL composed of a computation engine and a communication engine. tion agent consists of an integer and logical unit as well as a Ao and multiply unit. Each unit is capable of ~ n n i inde~endently, ~ g to a multi-po~edregister file. The communication agent connects to its neig~bors via four bidirectional communication links, and provides the interface to support message passing type communication between cells as well as word-based sysS i tolic communication. The i nodescan therefore be connected invari gle and two dimensional topologies. Various image processing applicat FFT, image smoothing, computer vision) and matrix algorithms ( decomposition) have been reported for this machine [Lou93].
a programmable systoli
S
ext, we discuss multiprocessors that make use of multiple off-the-shelf p r o ~ r a ~ m a DSP ~ l e chips. An example of such a system is the S ~ A . R Tar ture [Koh90] that is a reconfigurable bus-based design comprised of SP32C processors, and custom VLSI components for routing data between pro*
Chapter 2
cessors. Clusters of processors may be connected onto a common bus, or may form a linear array with neighbor-to-neighbor communication. This allows the multiprocessor to be reconfigured depending on the communication requirement of the particular application being mapped onto it. Scheduling and code generation for this machine is done by an automatic parallelizing compiler [HJ92]. The DSP3 multiprocessor [SW921 is comprised of AT&T DSP32C processors connectedin a mesh configuration. The meshinterconnect is implementedusingcustomVLSIcomponents for data routing. Eachprocessor communicates with four of its adjacent neighbors through this router, which consists of input and output queues, and a crossbar that is configurable under program control. Data packets contain headersthat indicate the ID of the destination processor. The RingArrayProcessor(RAP)system[M+92]uses TI DSP32OC30 processors connected in a ring topology. This system is designed specifically for speech-recognition applications basedon artificial neural networks.TheRAP system consists of several boards that are attached to a host workstation, andacts as a co-processor for the host. The unidirectional pipelined ring topology employed for interprocessor communication was foundto be ideal for the particular algorithms that were to be mapped to this machine. The ring structure is similar to the SMART array, except that no processor ID is included with the data, and processor reads and writes into the ring are scheduled at compile time. The ring is used to broadcast data from one processor to all the others during one
INmRFACE UNIT
Figure 2.5. WARP array.
APPLICATION-SPECIFIC~~LTIPROCESSORS
phase of the neural network algorithm, andis used to shift data from processor to processor in a pipelined fashion in the second phase. Several modern oE-the-shelf DSP processors provide special support for multiprocessing. Examples include the Texas Instruments TMS32OC40 (C40), Motorola DSP96000, Analog Devices ADSP-21060 “SHARC”, as well as the Inmos(nowowned by SGS Thompson)Transputer line of processors. The DSP96000 processor is a floating point DSP that supports two independent busses, one of which can be usedfor local accesses and the other for inter-processor communication. The C40 processor is also a floating point processor with two sets of busses; in addition it has six $-bit bidirectional ports for interprocessor communication. The ADSP-21060 is a floating point DSP that also provides six bidirectional serial links for interprocessor communication. The Transputer is a CPU with four serial links for interprocessor communications. Owing to the ease with which these processors can be interconnected, a numberofmulti-DSPmachineshavebeen built around the C40, D S P 9 6 0 ~ , SHARC,and the Transputer. Examplesofmulti-DSPmachinescomposed of DSP96000s include MUSIC [G+92] that targets neural network applications as well as the OMA architecture described in Chapter 6; C40 based parallel processors havebeendesigned for beamforming applications [Ger9S],andmachine vision [DIE3961 among others; ADSP-21060basedmultiprocessorsinclude speech-recognition applications [T+9S], applications in nuclear physics [A+98], and digital music [Sha98]; and machines built around Transputers have targeted applications in scientific computation [Mou96], and robotics [YM96].
Modern VLSI technology enables multiple CPUs to be placed on a single die, to yield a multiprocessor system-on-a-chip, Olukotun et al. [0+96] present an interesting study that concludes that goingto a multiple processor solution is a better path to high performance than going to higher levels of instruction level parallelism (using a superscalarapproach, for example). Systolic arrays have been proposed as ideal candidates for application-specific multiprocessor on a chip implementations; however, as pointed out before, the class of application targeted by systolic arrays is limited. We discuss next some interesting single chip multiprocessor architectures that have been designed andbuilt to date. The Texas I n s t ~ m e n t s~ S 3 2 0 C 8 0(Multimedia Video Processor) [GGV92] is an example ofa single chip multi-DSP. It consists of four DSP cores, and a RISC processor for control-oriented applications. Each DSP core has its own local memory and some amount of shared RAM. EveryDSP can access the shared memory in any one ofthe four DSPs through an interconnection network. A powerful transfer controller is responsible for moving data on-chip, and also
graphics applications. ta transfers are all persor desi~nedfor video PE9 consists of nine indiction level paral~e~ism by means of four indivi~ualprocess in^ uniwhichcanperform mu~tiple arithmetic operations each cycle. Thus the is a h i ~ h l y~ a r a l ~ architecel ture that exploits p~allelismat m ~ l t i p ~levels. e m~eddedsingle-chip mu~tiprocessor§may also be composed of heteroe ~ e o processors. ~§ For exa anyconsumerdevicestoday, rive controllers, etc., are CO signal processi~gtasks, ~ h i l the e other is a ~icrocontrol~er such as a a two-processor s y s t e ~is increasingly found in embedded applicaoptimization used in each processor. t i o ~ ~ b ~of~ the a u types s e of arch~te~tural microcontroller has an ef~cient inte~upt-hand~in~ capability, and is more
APPLICATION-SPECIFIC~~LTIPROCESSORS
amenable to compilation from a high-level language; however, it lacks the multiply-accumulate performance of a DSP processor. The microcontroller is thus ideal for p e r f o ~ i n guser interface and protocol processing type functions that are somewhat asynchronous in nature, while the DSP is more suited to signal processing tasks that tend to be synchronous and predictable. Even though new DSP processors boasting microcontroller capabilities havebeen int~oduced recently (e.g., the itachi SH-DSP andthe TI TMS320C27x series) an AR DSP two processor solution is expected to be popular for embedded signal processinglcontrol applications in the near future. A good example of such an architecture is described in [Reg94]; this part uses two DSP processors along with a microcontroller to implement audio processing and voice band modemfunctions in software.
Reconfigurable computers are another approach to application-specific computing that has received significant attention lately.. Reconfigurable computing is based on implement in^ a function in hardware using con~gurablelogic (e.g., a field programmable gate array or FPGA), or higher'levelbuilding blocks that can be easily configured and reconfigured to provide a range of different functions, Building a dedicated circuit for a given function can result in large speedups; examples of such functions are bit manipulation in applications such as cryptography and compression; bit-field extraction; highly regular computations such as Fourier and Discrete Cosine Transforms; pseudo random number generation; compact lookup tables, etc. One strategy that has been employed for building configurable computers is to build the machine entirely out of reconfigurable logic; examples of such machines, used for applications such as DNA sequence matching, finite field arithmetic, and encryption, are discussed in [G+91][~~95][GMN96~[~+96].
A second and more recent approach to reconfigurable architectures is to augment a programmable processor with configurable logic. In such an architecture, functions best suited to a hardware implementation are mapped to the FPGA to take advantage of the resulting speedup, and functions more suitable to software (e.g., control dominated applications, and floating point intensive computation) can make useof the programmable processor. The Garp processor [ H ~ 9 7 ] , for example, combines a Sun UltraSPARC core with an FPGA that serves as a reconfigurable functional unit. Special instructions are defined for configu~ng the FPGA, and for transferring data between the FPGA and the processor. The authors demonstrate a 24x speedup over a SunUltraSPARC machine, for an encryption application. In [HFHK97] the authors describe a similar architecture, called Chimaera, that augments a RISC processor with an FPGA. In the Chimaera architecture, the reconfigurable unit has access to the processor register
Chapter 2
file; in the GARP architecture the processor is responsible for directly reading from and writing data to the reconfigurable unit through special instructions that are augmented to the native instruc~ion setof the RISC processor. Both architectures include special inst~ctionsin the processor for sending commands to the reconfigurable unit. Another example of a reconfigurable architecture is Matrix [MD97], which attempts to combine the efficiencyof processors on irregular, heavily multiplexed tasks with the efficiency of FPGAs on highly regular tasks. The Matrix architecture allows selection of the granularity according to application needs. It consists ofan array of basic functional units (BFUs) that maybe configured either as functional units (add, multiply, etc.), or as control for another BFU. Thus one can configure the array into parts that function in SIMD mode under a common control, where each such partition runs an independent thread in an MIMD mode. In [ASI+98] the authors describe the idea of domain-specific processors that achieve low power dissipation for a small class of applications they are optimized for, These processors augmented with general purpose processors yield a practical trade-off between flexibility, power and performance. The authors esti-
Instruction, Data
Configuration, Data
Figure 2.7. A RlSC processor augmentedwith an FPGA-based accelerator [H~97][~FHK97].
APPLICATION-SPECI~IC ~ULTIPROCESSORS
7
mate that such an approach can reduce the power utilization of speech coding implementations by over an order of magnitude compared to an implementation using only a general purpose DSPprocessor. PADDI (Programmable Arithmetic Devices for DIgital signal processing) is another reconfigurable architecture that consists of an array of high performance execution units (EXUs) with localized register files, connected via a flexible interconnectmechanism[CR92]. The EXUs perform arithmetic functions such as add, subtract, shift, compare, accumulate etc. The entire array is controlled by a hierarchical control structure: A central sequencer broadcasts a global control word, which is then decoded locally by each EXU to determine its action. The local EXU decoder (“nan~store~’) handles local control, for example the selection of operands and program branching. Finally, Wu and Liu [WLR98] describe a reconfigurable processing unit that can be used as a building block for a variety of video signal processing functions including FIR, IIR, and adaptive filters, and discrete transforms such as DCT, An array of processing units along with an interconnection networkis used to implement any one of these functions, yielding t ~ o u g h p ucomparable t to custom ASIC designs but with much higher flexibility and potential for adaptive operation.
As we will discuss in Chapter 4, compile time scheduling is very effective for a large class of applications in signal processing and scientific computing, Given such a schedule, we can obtain information about the pattern of inter-processor communication that occurs atrun-time. This compile time information can be exploited by the hardware architecture to achieve efficient communication between processors. We exploit this fact in the ordered transaction strategy discussedinChapter 3. In this section wediscuss related work in this area of employing compile time information about inter-processor communication coupled with enhancements to the hardware architecture with the objective of reducing IPG and sync~onizationoverhead. Determining the pattern of processorcommunications is relatively straightforward in SIMD implementations. Techniques applied to systolic arrays in fact use the regular communication pattern to determine an optimal interconnect topology for a given algorithm. An interesting architecture in this context is the GF11 machine built at IBM [BDW85]. The GF11 is an SIMD machine in which processors are interconnected using a Benes network (Figure 2.8), which allows the GF1 l to support a variety of different interprocessor communication topologies rather than a fixed topology. Benes networks are non-blocking, i.e., they can provide one-to-one con-
Chapter 2
nectionsfrom all the network inputs to the networkoutputssimultaneously according to any specified permutation. These networks achieve the functional capability of a full crossbar switch with much simpler hardware. The drawback, however, is that in a Benes network, computing switchsettings needed to achieve a particular p e ~ u t a t i o ninvolves a somewhat complex algorithm [Lei92]. In the GFl1, this problem is solved by precomputing the switch settings based on the program to be executed onthe array. A central controller is responsible for reconfiguring the Benes network at run-time based on these predete~inedswitch settings. Interprocessor communication in the GFl l is synchronous with respect to computations in the processors, similar to systolic arrays. The GF11 has been used for scientific computing, e.g., calculations in quantum physics, finite element analysis, LU decomposition, and other applications, An example ofa mesh connected parallel processor that uses compile time information at the hardware level is the ~ u M e s hsystem at MIT [SHL+97]. In this system, it is assumed that the communication pattern source and destination of each message, and the communication bandwidth required can be extracted from the parallel pro~ramspecification. Some ~ o u noft dynamic execution is also supported by the architecture. Each processing node in the mesh gets a communication schedule which it follows at run-time. If the compile time estimates of bandwidth requirements are accurate, the architecture realizes effiInterconnection Network
Disks
Central Controller
Figure 2.8. The IBM GF11architecture: an example of statically scheduled communication,
cient, hot-spot free, low-overhead communication. Incorrect bandwidth estimates or dynamic executionare not catastrop~ic,but these do cause lower pe~ormance. machine [W+97] is another example of a paral~elprocessor re configured statically. The processing elements are tiled mesh topology; each element consists of a RISC-like processor, with CO ements special inst~ctionsand configurable data widths. es enforce a compile-time determined static communication pattern, allowingdynamicswitchingwhen necessary. Implementing the static communication pattern reduces sync~onizationoverheadandnetwork congestion, A compiler is responsible for pa~itioningthe program into threads mappedontoeach processor, configuring the reconfigurable logic oneach processor, and routing communications statically.
In this chapter we discussed various types of application-specific multiprocessorsemployed for signal processing. Although these machinesemploy arallel processing techni~ueswell known in general pu ing, the predictable natureof the computationsallows for simp~ified syste ~chitectures.It is often possible to configure processor interconnectsstatically to make use of compile time knowledgeof inter-processor communication patterns. This allows for low overhead inte~rocessorcommunication and synchr ~ e c h a n i s that ~ s employ a combination of simple hardware s u p p o ~for softw~e tech~iques applied to programsrunning on the processors. explore these ideas f u ~ h ein r the following chapters.
This Page Intentionally Left Blank
In this chapter we introduce terminology and definitions usedinthe remainder of the book, and formalize the dataflow model that was introduced intuitively in Chapter 1. We also briefly introduce the concept of algorithmic complexity, and discuss various shortest and longest path algorithms in weighted directed graphs alongwith their associated complexity. These algorithms are used extensively in subsequent chapters. To start with, we define the difference of two arbitrary sets S, and S2 by S , -S2 = { S E St 1s sf: S,} ,and we denote the number of elements in a finite set S by IS1 .Also, if r is a real number, then we denote the smallest integer that is
greater than or equal to r by r r l .
d pair (V, E) ,where V is the set of vertiedge is an ordered pair (v1, v2) where v , , v 2 E V .If e = (v , , v2) E E ,we say that e is directed from v 1 to v2 ;v1 is the source vertex of e , and v2 is the sink vertex of e We also refer to the source and sink vertices of a graph edge e E E by src( e) and snk(e) .In a directed graph we cannot have two or more edges that have identical source and sink vertices. A generalization of a directed graph is a directe which two or more edges have the same source and sink vertices. ( .
Figure 3.l(a) shows an example of a directed graph, and Figure 3.l(b) shows an example of a directed multigraph. The vertices are represented by circles and the edges are represented by arrows between the circles. Thus, the vertex set of the directed graph of Figure 3.l(a) is {A,B,C, D } ,and the edge set is {(AY B),(A,m , (A, C), (D, B),(C, C)}. 3
Chapter 3
directed multirah,wherethe vertices (actors) represent com~utationand edges (arcs) repre rst-in-~r~t-out) queues that direct data values from the output of one to the input of another. es thus represent data precedences between computations. cons~medata (ortokens) from their inputs, p e r f o ~computations on them re), and produce certain numbers of tokens on their outputs. -level functional lan uages such as pure L1 and as Id Lucid ea be directly converted i presentations; such a conversion is possible because these laned to be free of ~ ~ ~ e - ei.e., ~ eprograms c ~ ~ , in these languages contain global variables or data structures, and functions in these lan~uagescannot modify their ~ g u m e n t s[Ack82]. Also, since it is possible to s i ~ u l a t eany Turing machine in one of these languages, questions such as deadlock (or equivalently, t e ~ i n a t i nbehavior) ~ and determining maximum h become undecid-
inand
the speci~edcomputation in har~wareor s o f t ~ ~ e .
ne such restricted model (and in fact one of the earliest graph-based
computation models) is the eo of Karp and Miller [ where the authors establish th graph model is ~ e t e ~ i n a t e , i.e., the sequence of tokens produced on the edges of a given computation graph are unique, and do not depend on the order that the actors in the graph fire, as long as all data dependencies are respected by the firing order. The authors also provide an algorithm that, based on topological and algebraic properties of the graph, determines whether the c putation specified by a given computation graph willeventually t e ~ i n a t e . cause of the latter property, computation graphs clearly cannot simulate all Turing machines, and hence are not as expressive as a general dataflow language like Lucid or pure LISP. omp put at ion graphs provide some of the theoretical foundations for the SDF model to be discussed in detail in Section 3.5. S
Another model of computation relevant to dataflow is the [Pet8l][Mur89]. A Petri net consists of a set of transiti~ns,which are analogous to actors in dataflow, and a set of places that are analogous to arcs. Each transition has a certain number of input places and output places connected to it. Places may contain one or more to~ens.A etri net has the following semantics: a transition fires when all its input places have one or more tokens and, upon firing, it produces a certain number of tokens on each of its output places. A large number of diff~rentkinds of Petri net models have been proposed in the literature formodeling di~erenttypes of systems. Some of these Petri net models have the same expressive power as Turing machines: for example, if transitions areallowed to possess “inhibit” inputs (if a place co~espondingto such an input to a transition contains a token, then that transition is not allowed to fire) then a Petri net can simulate any Turing machine (pp. 201 in [Petsl]). Others (depending on topological restrictions imposed on how places and transitions can be interconnected) are equivalent to finite state machines, and yet others are similar to SDF graphs. Some extended Petri net models allow a notion of time, to model execution times of computations, There is alsoa body of work on stochastic extensions of timed Petri nets that are useful for modeling uncertainties in computation times. We will touch upon some of these Petri net models again in Chapter 4. Finally, there are Petri nets that distinguish between different classes of tokens in the specification ( c ~ Z Petri ~ ~nets), e ~ so that tokens can have information associated withthem. We refer to [Pet811 [Mur89] for details on the extensive variety of Petri nets that have been proposed overthe years.
Chapter 3
The particular restricted dataflow model we are mainly concerned with in this book is the SDF Sync~onousData Flow model proposed by Lee and ~esserschmitt[LM97].The SDF model poses restrictions on the firing of actors: the number of tokens produced ( ~ o n s u ~ e by d )an actor on each output (input) edge is a fixed number that is known at compile time. The number of tokens produced and consumed by each SDF actor on each of its edges is annotated in illustrations of an SDF graph by numbers at thearc source and sink respectively. In an actual im~lementation,arcs represent buffers in physical memory. "%e arcs in an SDF graph may contain initial tokens, which we also refer to as delays. Arcs with delays canbe interpreted as data dependencies across iterations of the graph; this concept will be formalized in the following cha ter when we discuss scheduling models. We will represent delays using bullets ( on the edges of the SDF graph; we indicate more than one delay on an edge by a number alongside the bullet. An example of an SDF graph is illustrated in Figure 3.2. DSP applications typically represent computations on an indefinitely long data sequence; therefore the SDF graphs we are interested in for the purpose of signal processing must execute in a non-te~inatingfashion. Consequently, we must be able to obtain periodic schedules for SDF representations, which can then be run as infinite loops using a finite amount of physical memory. Unbounded buffers imply a sample rate inconsistency, and deadlock implies that all actorsin the graph cannot be iterated indefinitely. Thus for our purposes, correctly constructed SDF graphs are those that can be scheduled periodically using a finite amount of memory. The main advantage of imposing restrictions on the SDF model (over a general dataflow model) lies precisely in the ability to determine whether or not an arbitrary SDF graph has a periodic schedule that neither
1 1 figure 3.2.An SDF graph.
BACKGROUND TERMINOLOGY ANDNOTATION
deadlocks nor requires unbounded buffer sizes [LM87]. The buffer sizes required to implement arcs in SDF graphs can be determined at compile time (recall that this is not possible for a general dataflow model); consequently, buffers can be allocated statically, andrun-timeoverhead associated withdynamicmemory allocation is avoided. The existence of a periodic schedule that can be inferred at compile time implies that a correctly constructed SDF graph entails no run-time scheduling overhead.
This section briefly describes some useful properties of SDF graphs; for a more detailed and rigorous treatment, please refer to the work of Lee an schmitt [LM87][Lee86]. An SDF graph is compactly represented by its atrix. The topology matrix, referred to henceforth as I", represents the SDF graph structure; this matrix contains one columnfor each vertex, and one row for each edge in the SDF graph. The ( i , j ) th entry in the matrix corresponds to the number of tokens produced by the actor numbered j onto the edge numbered i . If the j th actor c o n s ~ ~tokens es from the i th edge, i.e., the i th edge is incident into the j th actor, then the ( i , j ) th entry is negative. Also, if the j th actor neither produces nor consumes any tokens from the i th edge, then the (i,j ) th entry is set to zero. For example, the topology matrix I" for the SDF graph in Figure 3.2 is:
where the actors A ,B ,and C are numbered 1 ,2 , and 3 respectively; the edges (A,B) and (A,C) ,are numbered l and 2 respectively. A useful property of I" is stated by the following Theorem. 3.1: A connected SDF graph with S vertices that has consistent sample rates is guaranteed to have rank(r) = S -1 ,which ensures that l? has a null space.
Proo) See [LM87]. This can easily be verified for (3-1). This fact is utilized to determine the epetitions vector for an SDF graph with S actors numbered 1 to S is a column vector of length S , with the property that if each actor i is invoked a number of times equal to the i th entry of q ,then the number of tokens on each edge of the SDF graph remains unchanged. Furthermore, q is the smallest integer vector for which this property holds.
Chapter 3
Clearly, the repetitions vector is very useful for generating infinite schedules for SDF graphs by inde~nitelyrepeating a finite length schedule, while maintaining small buffer sizes between actors. Also, q will only exist if the SDF graph has consistent samplerates. The conditions for the existence of q is determined by Theorem 3.1 coupled with the following Theorem. :The repetitions vector for an SDF graph with consistent sample rates is the smallest integer vector in the nullspace of its topology matrix. That is, q is the smallest integer vector such that rq = 0
roo^ See [ ~ ~ 8 ~ ] . e easily obtained by solving a set of linear equations; these are ~ ~ t ~ osince n s ,they represent the constraint that the number of samples produced and consumed on each edge of the SDF graph be the same after each actor fires a number of times equal to its corresponding entry in the repetitions vector. For the example of Figure 3.2, from (3-l),
4 =
~
]
*
Clearly, if actors A ,B ,and C are invoked 3 ,2 , and 3 times respectively, the number of tokens on the edges remain unalte~ed(no token on (A, token on (A,C) ).Thus, the repetitions vector in (3-2) brings the SDF graph back to its “initial state”. S
:An SDP graph in which every actor consum
each of its inputs and outputs is called a
G actor fires when it has one or more tokens on all its input es one token from each input edge when it fires, and produces one token on all its output is very similar to a ns in the marked gra S ond to edges, and initial tokens (or in arking) of the marked graph correinitial tokens (or delays) in H The repetitions vector defined ious i
section canbeused
to con-
GY AND NOTATION
outline this t r a n s f o ~ a -
It of this transformation. For invocations) of A ;let us call
A, B) in G , let ~1~ represent A fires, and let aB represent
and consumes only one token from each of which A is a source, the co~espondst now be the source vertex for nA edges. Each of these l l A c o n s u ~ e sin the origin~l
us call these o u t ~ u and t tively. The k th sample
enerated on the
t su and o ~ t p uports
Chapter 3
F graph that is not an HSDFG can always be convertedinto an equivalent HSDFG [Lee86]. The resulting HSDFG has a larger number of actors than the original SDF graph. It in fact has a number of actors equal to the sum of the entries in the repetitions vector. In the worst case, the SDF to HSDFG transformation may result in an exponential increase in the number of actors (see [PBL95] for an example ofa family of SDF graphs in which this blowup occurs). Such a transfo~ation,however, appears to be necessary when constructing periodic multiprocessor schedules from multirate SDF graphs, although there has been some work on reducingthe complexity of the HSDFG that results from transforming a given SDF graph by applying graph clustering techniques to that SDF graph [PBL95]. An SDF graph converted into an HSDFG for th sor scheduling can be further converted into an Acyc
rposes of multi roces-
Figure 3.3. Expansion of an edge in an SDF graph 6 into multiple edgesin the e~uivalentHSDFG G, .Note the input and output ports on the verticesof 6,.
~ A C ~ 6 R O U N D T E R ~ I N O LAND O 6 YNOTATION
)by removing from the HSDFG arcs that contain initial tokens (delays). Recall that arcs with initial tokensonthem represent dependencies between successive iterations of the dataflow graph. An APEGis therefore useful for constructing multiprocessor schedules that, for algorithmic simplicity, do not attempt to overlap multiple iterations of the dataflow graph by exploiting precedence constraints across iterations. Figure 3.5 shows an example of an APEG, Note that the precedence constraints present in the original HSDFG of Figure 3.4
Figure 3.4. HSDFG obtained by expanding the SDF graphin Figure 3.2.
Figure 3.5. APEG obtained from the HSDFGin Figure 3.4.
Chapter 3
are maintaine~by this APEG, as long as each iteration of the graph is c o ~ p l e t e ~ efore the next iteration begins. Since we are concerned with ~ultiprocessorschedules, we assume that we ith an ~p~lication represented as a homo~eneous F graph hencefo~h, unless we state otherwise. This of course results in no loss of ~eneralitybecause a
general SDF graph is converted into a homogeneous graph for the purposes of multiprocessor scheduling anyway. In Chapter 8 we discuss how the ideas that apply to HSDF graphs can be extended to graphs containing actors that display data-dependent behavior (i.e.,~ y ~ actors). ~ ~ i c
resentation ofanalgorithm (for example, a k, or a Fast Fourier T r a n s f o ~ is ) called an For example, Figure 3.7(a) shows an SDF representation of a two-channel rnultirate filter bank that consists of a pair of analysis filters followed by synthesis filters. This graphcanbetransformed into anequivalent which represents the application graph for the two-channel filter bank, as shown
Figure 3.7. (a) SDF graphrepres~nting ta~ o - c h a n nfilter ~ l bank. (b)Ap graph.
Chapter 3
in Figure 3,7(b). Algorithms that map applications specified as SDF graphs on to single and multiple processors take the equivalent application graph as input. Such algorithms will be discussed in Chapters 4 and 5. Chapter 7 will discuss how the performance of a multiprocessor system after scheduling is commodeled by another HSDFG called the inte ,or IPG graph. The IPC graph is derived original application graph, and the given parallel schedule. Furthermore, Chapters 9 to 11 will discuss how a third HSDFC, called the synchronization graph, can be used to analyze and optimize the synchronization structure of a multiprocessor system. The full interaction of the application graph, IPG graph, and synchronization graphs, and also the formal definitions of these graphs will then be further elaborated in Chapters 7 through 11.
es SDF should not be confused with sync (e.g., LUSTW, SIG~AL, and E S ~ ~ Lwhich ) , have very different semantics from SDF. Synchronous languages have been proposed for formally specifying and modeling reactive systems, Le., systems that constantly react to stimuli from a given physical environment. Signal processing systems fall into the reactive category, and so do control and monitoring systems, communication protocols, man-machine interfaces, etc. In synchronous languages, variables are possibly infinite sequences of data of a certain type. Associated with each such sequence is a conceptual (and sometimes explicit) notion of a clock signal. In LUSTRE, each variable is explicitly associated with a clock, which determines the instants at which the value of that variable is defined. SIGNAL and ESRREL do not have an explicit notion of a clock. The clock signal in LUSTRE is a sequence of Boolean values, and a variable in a LUSTRE program assumes its n th value when its corresponding clock takes its y1 th TRUE value.Thus we may relate one variable with another by means of their clocks. In ESTEREL, on the other hand, clock ticks are implicitly defined in terms of instants when the reactive system co~espondingto an E S R W L program receives (and reacts to) external events. Allcomputations in synchronouslanguage are definedwithrespect to these clocks. In contrast, the term “synchronous” in the SDF context refers to the fact that SDF actors produce and consume fixed number tokens, of and these numbers are known at compile time. This allows us to obtain periodic schedules for SDF graphs such that the average rates of firing of actors are fixed relative to one another. ~e will not be concerned with synchronous languages, although these languages have a close and interesting relationship with dataflow models usedfor specification of signal processing algorithms [LP95].
BACKGROUND TERMINOLOGY AND NOTATION
DFG)is a directed multigraph (V,E ) , f initial tokens) on e by deZay(e) We say that e is an output edge of src( e ) ,and that e is an input edge of snk( e ) . We will also use the notation (vi, vi) ,vi, vi E V, for an edge directed from vi to vj .The delay on the edge is denoted by delay ((vi, vj)) or simply delay (vi, vi) .
ath in (V,E ) is a finite, non-empty sequence ( e l , e2,...,e,), where a member of E , and snk(e,) = src(e2), snk(e2) = src(e,), ..., snk( e, - = src( e,) .Wesaythat the path p = ( e e2, ...,e,) c o n ~ i n §each e; and each subsequence of ( e l , e2,...,e,,) ;p is directedfrom src(el) to snk(e,) ;and each member of ( s r c ( e , ) ,src(e2),...,src(e,), snk(e,)} is on p . nates atvertex src ( e and terminate§ atvertex a path that terminates at a vertex that has no successors. That IS, p = ..,e,) isa dead-end path such that for all e E E , h that is directed from a vertex to itself is called a cycle, e is acycle of which no proper subsequence is a cycle. ( p lp, 2 ,
...,p k )
is a finite sequence of paths such that for 1 S i c k , and snk(ei,,i) = s r c ( e i + l , l ) ,for l I: i I: ( k -1),then we define the concat~natiQnof (p , , p 2.,..,pk),denoted ((PI, P27 P k ) ) , by pi
=
If
(ei,l,ei,2,...,ei,,), * a s )
Clearly, ( ( p , ,p2, ...,p,)) is a path from src(el,,) to snk(ek,,,). If p = (e,, e2, ..,e,) is a path in an
ay of p , denoted Delay ( p ) ,by
WSDFG,then we define the pa
n
i= l
(3-3)
Since the delays on all WSDFG edges are restricted to be non-negative, it
is easily seen that between any two vertices x, y E V , either there is no path directed from x to y ,or there exists a (not necessarily unique) minimu between x and y oGiven an HSDFG G , and vertices x, y in G , we define pG(x, y ) to be equal to the path delay of a minimum-delay path from x to y if there exist one or more paths from x to y ,and equal to 00 if there is no path from x to y .If G is understood, then we may drop the subscript and simply write “p ”
in place of“pG ”.It is easily seen that minimum delay path lengths satisfy the following triangZe i n e q u a Z ~ ~
Chapter 3
of ( V , E) ,we mean the directed graph formed byany h the set of edges {e E El src( e), snk( e) E V’} .We denote the subgraph associated with the vertex-subset V’ by subgraph( V’) . We say that (V , E) is stron~lyconn~c if for each pair of distinct vertithere is a path directed from y ces x, y ,there is a path directed from x to y i subgraph( V’) is say that a subset V’ c: V i onnected. A stron~lycoma strongly connected subset V’ c: V su properly contains V’. If V’ is an SCC, then when there is no ambiguity, we may also thatsay s u b g r a ~V’) ~ ( is are distinct SCCs in ( V , E ) , we say that G, is a C2 if there is an edge directed from some vertex in Clto some vertex C2 is a predecessor SCC of sor.SCC; and an SCC is a si essor SCC. An edge e is a ge of ( V , E) if it is not contained in an SCC, or equivalently, if it in a cycle; an edge that is contained in at least one cycle is called
A sequence of vertices (v,, v2.. .,V k ) is a chain that joins v 1 and vk if acent to v . for i = 1,2, ...,( k -1).We say that a directed multigraph f for any pair of distinct members A ,B of Z , there is a B . Given a directed multigraph G = ( V ,E ) ,there is a unique partition (unique up to a reordering of the members of the partition) V , , V2,...,V,, such that for 1I i I n ,subgra~h(V ; ) is connected; and for each e E E , src( e), snk(e) E V i for some j .Thus, each V i can be viewed as a maximal connected subset of V , and we refer to each V ; as a CO of G .
ical sort of an acyclic directed ~ultigraph(V,E) is an ordering the members of V such that for each e E E , (snk(e) = vj)
(i
j)) ;
that is, the source vertex of each edge occurs earlier in the orderin than the sink vertex. An acyclic directed multigrapli is said to be one topological sort, and we say that an n -vertex ifit has ( n -1) edges. L
*
For elaboration on any of the graph-theor~ticconcepts presented in this section, we refer the reader to Cormen, Leiserson, and Rivest [CLR92].
AG
as complex as one of these
Chapter 3
mation from “B” to “A” implies that a polynomial time algorithm to solve “A” can be used to solve “B” in polynomial time, and if “B” is NP-complete then the transformation implies that “A” is at least as complex as any NP-complete problem. Such a problem is called We illustrate this concept with a simple example. Consider the set-cover,where we are given a collection of subsets C of a finite set S , and a positive integer I 5 lC’l .The problem is to find out if there is a subset C’ c: C such that IC’[ 5 I and each element of S belongs to at least one set in C’ . By finding a polynomial transfor~ationfrom a known NP-complete problem to the set-covering problem we can prove that the set cover problem is NPhard. For this purpose, we choose the vertex c o ~ e problem, r where we are given a graph C = ( V , E) and a positive integer I 5 IVI ,and the problem is to determine if there exists a subset of vertices V’ V such that I V’l 5 I and for each edge e E E either src( e ) E V’ or snk( e ) E V’. The subset V’ is said to be a cover of the set of vertices V . The vertex cover problem is known to be NP-complete, and by transforming it to the set covering problem in polynomial time, we can show that the set covering problem is NP-hard. Given an instance of vertex cover, we can convertit into an instance of setcovering by first letting S be the set of edges E.Then for each vertex v E V , we construct the subset of edges 7’” = { e E E I v = src( e ) or v = snk( e ) } .The set {1,“1 v E V } f o m s the collection C’. Clearly, this transfo~ationcanbe done in time at most linear in the number of edges of the input graph, and the resulting C’ has size equal to IVI .Our transformation ensures that V’ is a vertex cover for C if and only if {T VIv E V 1 is a set cover for the set of edges E . Now, we may usea solution of set cover to solve the transformed problem, since a vertex cover i V’l 5 I exists if and only if a corresponding set cover IC’[ 5 I exists for E . Thus, the existence of a polynomial time algorithm for set cover implies the existence ofa polynomial time algorithmfor vertex cover. This provesthat set cover is NP-hard. It can easily be shown that the set cover problem is also NP-complete by showing that it belongs to the class NP. However, since a fomal discussion of complexity classes is beyond the scope of this book, we will refer the interested reader to [CJ79]for a comprehensive discussion of complexity classes and the definition of the class NP. In summa^, by finding a polynomial transformation from a problem that is known to be NP-complete to a given problem, we can prove that the given problem is NP-hard. This implies that a polynomial time algorithm to solve the given problem in all likelihood does not exist, and if such an algorithm does kquired to find it. exist, a major breakthrough in complexity theory would be This provides a justification for solving such problems using suboptimal polyno-
BACKGROUND TERMINO~OGY AND
NOTATION
7
mial time heuristics. It should be pointed outthat a polynomial transformation of an NP-complete problem to a given problem, if it exists, is often quite involved, and is not necessarily as straightforward as in the case of the set-covering example discussed here. In Chapter 10, we use the concepts outlined in this section to show that a particular synchronization optimization problem is NP-hard by reducing the setcovering problem to the synchronization optimization problem. We then discuss efficient heuristics to solve that problem.
There is a rich history of work on shortest path algorithms and there are many variants and special cases of these problems (depending, for example, on the topology of the graph, or on the values of the edge weights) for which efficient algorithms have been proposed. In what follows we focus on the most general, andfrom the pointofviewof this book,most useful shortest path algorithms. Consider a weighted, directed graph G = ( V , E) ,with real valued edge weights W ( U , v ) for each edge (U,v ) E E . The single-source shortest path problem finds a path with minimum weight (defined as the sum of the weights of the edges on the path) from a given vertex v, E V to all other vertices U E V ,U # v , whenever at least one path from v, to U exists. If no such path exists, then the shortest path weight is set to W . The two best known algorithms for the single-source shortest path algorithm are Dijkstra’s algorithm and the Bellman-Ford algorithm. Dijkstra’s algorithm is applicable to graphswithnon-negativeweights ( w ( u ,v ) 20 ). The running time of this algorithm is O( I VI2),The Bellman-Ford algorithm solves the single-source shortest path problem for graphs that may have negative edge weights; the Bellman-Ford algorithm detects the existence of negative weight cycles reachable from v, and, if such cycles are detected, it reports that no solution to the shortest path problem exists. If a negative weight cycle is reachable from v , , then clearly we can reduce the weight of any path by traversing this negative cycle one or more times. Thus, no finite solution to the shortest path problem exists in this case. An interesting fact to note is that for graphs containing negativecycles, the problem of determining the weight of the shortest simple path between two vertices is NP-hard [CJ79].A simple path is defined as one that does not visit the same vertex twice, i.e., a simple path does not include anycycles. The all-pairs shortest path problem computes theshortest path between all pairs of vertices in a graph. Clearly, the single-source problem can be applied
Chapter 3
eatedly to solve the all-pairs problem. owever, a moreefficient algorithm asedon dynamic programming the Floydall algorithm maybe used to solve the all-pairs shortest path problem I”, time. This algorithm solves the all-pair§ problem in the absence of ne ding longest path pro~lemsmay be solved using theshortest e straightforw~dway to do this is to simply negate all edge .e., use the edge weights W’( U, v algorithm for the sin~le-source roblem. If all the edge weights the longest simple path becomes NP-hard reachable from the source vertex. the following sections, where we briefly describe the s h o ~ e spath t algoiscussed thus far. ~e describe the algorithms in pseudo-code, and assume we only need the weight of the longest or shortest path; these a l g o ~ t h ~ s actual path, but we do not need this information for the purposes of 0, will we not delve into the correc ofs e algoI1 refer the reader to texts such as an 871 for detaile~discussion of these graph algorithms. ’S
e pseudo-code for the algorithm is shown times, the total time spent in th e~entationof extracting the ~ i n i m u mele I) for each iteration of th lernented in time O(I VI2 more clever implementation of the minimum extraction ste leads to a modified tationofthe algorithm with r
lgorithm solves the sin ts are negative, proble from thedesigcycles when these are present. e nested For loop in Step 4 deter~inesthe complexity of the algorithrn; This algorithrn is based on the ~ Y ~ ~ ~ i G - ~ techni~ue, r o g r ~ ~ ~ i ~ g
Next, consider the all-pairs shortest path problem. One simple me tho^ of solving this is to apply the single- urce problem to all vertices in the algorithm. "he Floy takes O( IVI2IEI) time using the ellman-Ford algorithm improves upon this. A pseudo-code speci~cationof this given in Figure 3.10, The triply nested Fir loop in this algorithm clearly implies a c o m ~ l e x i of t~ O( I VI3) ,This algorithm is also based upon dynamic programmin~:At the k th iteration of the o u t e r ~ o sFor t loop, the shortest path from the vertex n u ~ ~ e i r e ~ e t e ~ i n among ~d all pathsthat do not visit any vertex n u m ~ e r ek ~ ain, we leave it to texts such as [ ~ ~ ~ ~ 2 ] [ for A aHformal ~ 8 ~ ]
(V, E),with non-n
V €
nd a source vertex S E V . rtest path from S to
v
3. vQ*v 4.
tract v s 4"
U E
V Q such that d( U )
= min(d(v)lvE V , )
U
d( t ) +-min (d(t ) ,d( U ) + W( e ) )
Figure 3.8. Dijkstra's a l g o r i t ~ ~ ,
Chapter 3
proof of correctness.
A s discussed in subsequent chapters, a fea obtained as a solution of a system of ~ i f f e r ~ ~ straints are of the form
S~ng~eSourceShortestPath ighted directed graph C = ( V , E),with edgewei~ht w ( e ) for each e E E ,and a source vertexS E V . :& V ) , the weight of the shortest path from S to each vertex V E V , or elseaBoolean indicatin~thepresence of negative cycles reachable from S
1. l n i t i a i i ~d(s) ~ = 0, and d(v) = 2. v s t- 63 3. V,+V
00
for ail other vertices
( v ) c- min (d(v), d( U ) + W( U, v))
d( v ) >d( U )
+ W (U , v)
Set ~e~ative~yclesExist = TRUE
Figure 3.9. The Bellman-Ford algorithm.
~ A C ~ G R O U N D T E R ~ ~ N O LAND O G YATIO ION
xi
-x j 2Cjj,
(3-5)
where x i are unknowns to be determined, and cij are given; this problem is a special case of linear programming. The data precedence constraints between actors in a dataflow graph often lead to a system of difference constraints, as we shall see later. Such a system of inequalities can be solved using shortest path algorithms, by t r a n s f o ~ i n gthe difference constraints into a constraint graph. This graph consists of a number of vertices equal to the number of variables x i ,
ut: ~ e i g h t e ddirected graph G = ( V , E),with edgeweight w(e) for
V €
v.
e weight of the shortest path from S to each vertex
1.Let (V(= yt ;number the vertices 1,2, ...,n . 2. Let A be an n x n matrix, set A ( i , j) as the weight of the edge from i to thevertex j . If nosuchedgeexists, thevertexnumbered A(i, j ) = 00. Also, A(i, i ) = 0 .
4(i, j ) + ~ i y t ( A ( j), i , A ( i , k ) + A(k, j ) )
4. For vertices U, V E V with enumeration U +-i and d(u, v )
= A(i, j )
Figure 3.1 0. The Floyd-~arshallalgorithm.
V
+- j ,set
Chapter 3
and for each di~erenceconstraint xi -x j I c i j , the graph contains an edge ( v j , v i ) , with edge weight w(vjt vi) = cij. An additional vertex v. is also ,with zero weight edges directed from v. to all other vertices in the he solution to the system of di~erenceconstraints is then simply given eights of the shortest path from v. toall other vertices in the graph 1.That is, setting each xi to be the weight of the shortest path from v. to in a feasible solution to the set of difference constraints. A feasible so~utionexists if, and only if, there are no negative cycles in the constraint graph. nce constraints can therefore be solved using the ~ e ~ l m a n - F algoor~ reason for adding v. is to ensure that negative cycles in the graph, if present, are reachable from the source vertex. This in turn ensures that given v. as the source vertex, the ellman-Ford algorithm will determine the existence of a feasible solution. For example, consider the following set of ine~ualitiesin three variables: x,-x2<2
x2 -x3 5 -3
x3-x1
Il
e constraint graph obtained from these ine~ualitiesis shown in Figure 3.l l. A sible solution is obtained by computing the shortest paths from v. to each xi ; thus x1 = -1 ,x2 = -3 ,and x3 = 0 , is a feasible solution. Clearly, given such a feasible solution {xl, x2, x3} ,if we add the same constant to each x i , we obtain another feasible solution. make use of such a solution of difference constraints in Chapter 7.
.l l, ~ o ~ s t r a igraph. nt
AC
The ~
S
~ cycle ~i e foru an ~ ~
we shall see insubsequent chapters, the aximum achi~vablethroughput for a give
~ is defined as: Fu graph C
umcyclemean is related to the a comprehensiv~over vie^ of
m cycle mean, Out of these, appears to have the most ef~cient is the sum of ?(v) over all
round relating to d a t a ~ omode ~ cussed conversion of a general and generation of an Acyclic Precey described asymptotic not of NP-complete proble~s. described some useful shortest path algorith~sthat are used extensively in the f~llowingchapters, and define^ the maximum cycle mean. This bac~groundwill be used extensively in the remainder ofthis book.
This Page Intentionally Left Blank
This chapter discusses parallel scheduling of application graphs. The perce metric of interest for evaluating schedules is the avera T :the average time it takes for all the actors in the graph to once. Equivalently, we could use the throughput 7"' (i.e.; the number of iterations of the graph executed per unit time)as a performance metric. Thus an optimal schedule is onethat minimizes T.
In the execution of a dataflow graph, actors fire when sufficient number of tokens are present at their i ts. A dataflow graph therefore lends itself naturally to ~ ~ n c t i o nora ~ ,where the problem is to assign tasks cessors. Systolic andwavefront arrays ata ~araIle~ism; where the data set is partitioned among multiple processors executing the same program. Ideally, we would like to exploit data parallelism along withfunctional parallelism within the same parallel programming framework. Such a combined framework is currently an active research topic; several parallel languages have been proposed recently that allow a programmer to specify both data as well as functional parallelism [BH98][RS~97].Ramaswamy et al. ([RSB97])propose a hierarchical Macro Dataflow Graph representation of programs written inF O R ~ A NAtomic . nodes at the lowest level of the hierarchy represent tasks that are run in a data parallel fashion on a specified number of processors. The nodes themselves are run concurrently, utilizing functional parallelism. The work of Printz [Prig11on geometric scheduling,and the Multidimensional SDF modelproposed by Lee in [Lee93], are two other promising approaches for combining data and functional parallelism.
~ULTIPROCESSOR SCHEDULI~G MODELS
tions strategy is essentially a self-timed approach where the order inwhich processors communicate is determined at compile time, and the target hardware enforces the predetermined transaction order during run-time. Such a strategy leads to a low overhead interprocessor communication mechanism. ~e will discuss this modelin greater detail in the following two chapters. The trade-off between generality of the applications that can be targeted by a particular scheduling model, and the run-time overhead and implementation complexity entailed by that model is shown in Figure 4.1.
~e discuss these scheduling strategies in detail in the following sections.
) strategy, the exact firing time of each actor is assumed to be known at compile time. Such a scheduling style is used in conjunction with systolic array ~chitecturesdiscussed in Section 2.4, for scheduling
Run-time overhead, implementation complexity
Figure 4.1 Trade-off of generality againstrun-time overhead and implement~tio~ ~omplexity. (I
Chapter 4
VLIW processors discussed in 2.2.3, and also in high-level VLSI synthesis of applications that consist only of operations with guaranteed worst-case execution times [De 941. Under a fully-static schedule,all processors run in lock step; the operation each processor performs on each clock cycle ispredetermined at compile time and is enforced at run-time either implicitly (by the program each processor executes, perhaps augmented with “nop”s or idle cycles for correct timing) or explicitly (by means of a program sequencer, for example). A fully-static schedule of a simple WSDFG G is illustrated in Figure 4.2. ”he fully-static schedule is schematically represented as a a ~ t tc ~ a r twhich , indicates the processors along the vertical axis, and time along the horizontal axis. The actors are represented as rectangles with horizontal length equal to the execution time of the actor. The left edgeof each rectangle in the Gantt chart corresponds to the starting time of the corresponding actor. The Gantt chart can be viewed as a processor-time plane; scheduling can then be viewedas a mechanism to tile this plane while minimizing total schedule length, or equivalently minimizing idle time (“empty spaces’’ inthe tiling process). Clearly, the fully-static strategy is viable only if actor execution time estimates are accurate and dataindependent, or if tight worst-case estimates are available for these execution times. As shown inFigure 4.2, two different types of fully-static schedules arise, depenging onhow successive iterations of the HSDFG are treated. Execution times of all actors are assumed to be one time U in this example. The fully-static schedule in Figure 4.2(b) represents a s c h e ~ ~ lsuccessive e: iterations of the HSDFG in a blocked schedule are treated separately so that each iteration is completed before thenext one begins. A more elaborate blocked schedule on five processors is shown in Figure 4.3. The HSDFG is scheduled as if it executes for only one iteration, i.e., inter-iteration dependencies are ignored; this schedule isthen repeated to get an infinite periodic schedule for the HSDFC. ”he length of the blocked schedule determines the average iteration period T. ”he scheduling problem is then to obtain a schedule that minimizes T (which is also called the makespan of the schedule). A wer bound on T for a blocked schedule issimply the length of the critical pa of the graph, which is the longest delay-free path in the graph. *
Ignoring the i~ter-iterationdependencies when scheduling an application graph is equivalent to the classical multiprocessor scheduling problem for an Acyclic Precedence Expansion Graph (APEG). As discussed in Section 3.8, the APEG is obtained from the given application graph by eliminating all edges with delays on them (edges with delays represent dependencies across iterations) and replacing multiple edges that are directed between the same two vertices in the same direction with a single edge. This replacement is done because such multiple edges represent identical precedence constraints; these edges are taken into
MU~TI~ROCESSOR S C ~ E D U ~ MODELS I~G
account individually during buffer assignment, however. Optimal multiprocessor scheduling of an acyclic graph is known to be NP-hard [GJ?9], and a number of heuristics have been proposed for this problem. One of the earliest, and still popular, solutions to this problem is l i s t - s c ~ e ~ ~first l i ~ gproposed , by Hu [Hu61]. ~ist-schedulingis a greedy approach: whenever atask is ready to run, it is sched-
(a) HSDFG
acyclic precedence graph
t
(b)bloc ked schedule
Proc l Proc 2 t
T = 2 t.u.
(c) overlapped schedule Figure 4.2. Fullystatic schedule.
Chapter 4
as soon as a processor is available to run it. Tasksare assigned priorities, and am on^ the tasks that are ready to run at any instant, the task with the highest priority is executed first. Various researchers have proposed different priority mechfor list-scheduling [ACD74], some of whichuse critical-path-based (76723[Koh75][Bla87] ([ la871 summari~esa large number
, Proc
Execution Times
,
\
A,B,F
:3
D E
:6
c,w
:5 :4
(a) HSDFG “G”
roc 1 roc 2 roc 3 roc roc 5
= Idle
T
=l1
roc 1 roc roc roc roc t
(c) ~ully-staticexecution ure 4.3. Fully-st~tic schedule on five processors.
S
N iterations of
is d e s c ~ b in e ~detail in Section 4.8.
Chapter 4
can be computedefficiently and optimally in polynomial time [P~91][GS92]. Overlapped scheduling heuristics have not been as extensively studied as blocked schedules. The main work in this area is by Lam [Lam88], and deGroot [dGH92], who propose a modified list-scheduling heuristic that explicitly constructs an overlapped schedule. Another workrelated to overlapped scheduling is the “cyclo-static scheduling”approachproposed by Schwartz.Thisapproach attempts to optimally tile the processor-time plane to obtain the best possible schedule. The search involved in this process has a worst-case complexity that is exponential in the size of the input graph, althoughit appears that the complexity is manageable in practice, at least for small examples [SISS].
The fully-static approachintroducedin the previous section cannotbe usedwhen actors have variable execution times; the fully-static approach requires precise knowledge of actor execution times to guarantee sender-receiver sync~onization.It is possible to use worst-case execution times andstill employ a fully-static strategy, but this requires tight worst-case execution time estimates that may not beavailable to us. An obvious strategy for solving this problem is to introduce explicit synchronization whenever processors communicate. Thisleads s ~ h e ~ ~ l(ST) i n strategy ~ in the scheduling taxonomy of Lee and H a [LH89]. In this stratkgy we first obtain a fully-static schedule using techniques that will be discussed in Chapter5 , making use ofthe execution time estimates.Aftercomputing the fully-static schedule(Figure4.4 (b)), wesimply discard the timing information that is not required, and only retain the processor assignment andthe ordering of actors on each processoras specified by the fullystatic schedule (Figure 4.4(c)). Each processor is assigned a sequential list of actors, some of whichare send and receive actors, which it executes in an infinite loop. When a processor executes a communication actor, it synchronizes with the processor(s) it communicates with. Exactly when a processor executes eachactor depends on when, at run-time, all input data for that actor is available, unlike the fully-static case where no such run-time check is needed. Conceptually, the processor sending data writes data into a FIFO buffer, and blocks whenthat buffer is full; the receiver, on the other hand, blocks when the buffer it reads from is empty. Thus flow control is performed at run-time. The buffers may be implemented using shared memory, or using hardware FIFOs between processors. In a self-timed strategy, processors run sequential programs and communicate when theyexecute the communication primitives embeddedin their programs, as shown schematically in Figure 4.4(c). The multiple DSP machines that wediscussedin the Section2.5 all employ some form of self-timed scheduling. Clearly, general purpose parallel
MULTIPRQCESSQR SCHEDULING MODELS
machines can also be programmed using the self-timed scheduling style, since these machines provide mechanisms for run-time synchronization and flow control.
A self-timed scheduling strategy is robust with respect to changes in execution times of actors, because sender-receiver sync~onizationis performed at run-time. Such a strategy, however, implies higher IPC costs compared to the fully-static strategy because of the need for synchronization (e.g., using semaphore management). In addition the self-timed scheduling strategy faces arbitration costs: the fully-static schedule guarantees mutually exclusive access of shared communication resources, whereas shared resources need to be arbitrated at run-time in the self-timed schedule. Consequently, whereas IPC in the fullystatic schedule simply involves reading and writing from shared memory (no synchronization or arbitration needed), implying a cost of a few processor cycles for IPC, the self-timed scheduling strategy requires of the order of tens of processor cycles, unless special hardware is employed for run-time flow control. Run-time flow control allows variations in execution times of tasks; in
Proc 1
Proc 1
Proc 2
start
start
Proc 2 (a) HSDFC (c) Self-timed implementation (schematic) t
(b)Fully-static schedule
Figure 4.4. Steps in a self-timed scheduling strategy.
Chapter 4
p l i ~ e sthe compiler softw
e, since the c o ~ p i l e rno longer
m$$], that could potential~yuse f~l~y-static scheduli~~, still choose t such run-time flow control (at the expense of additional hardware) ting software si~plicity.L a m pres~ntsan interestin the trade-off involved between hardware CO whenweconsider d y n a ~ i cflow CO e ~ e n t e din hardwareversus ow control enforced by a compiler
ection 2. l, where an les instructions in the The classic dataflow
munication. E ~ b e d d e dsignal processing systems will usually not require this type of scheduling owing to the run-time overhead and complexity involved, and the availability of compile time i n f o ~ a t i o nthat makes static scheduling techniques practical.
Actors that exhibit data dependent execution time usually do so because they include one or more data-dependent control structures, for example CO tionals and data-dependent iterations. In such a case, if we have some know1 about the tati is tics of the control variables (number of iterations a loop will go through or the boolean value of the control input to an if-then-else type construct), it is possible to obtain a static schedule that optimi~esthe aver me of the overall computation. The key idea here is to define an e for each actor in the dataflow graph, An execution profile for construct consists of the number of processors assigned to it, and a local schedule of that construct on the assigned processors; the profile essentially defines the shape that a dynamic actor takes in the processor-time plane. In case the actor execution is data-dependent, an exact profile cannot be pre-determined at compile time. In such a case, the profile is chosen by making use of stati~ticalinformation about the actor, e.g., average execution time, probability distri control variables, etc. Such an approach is called [Lee$$b]. Figure 4.5 shows a quasi-static strategy applied to a conditiona~construct (adapted from [Lee$$b]). I
.
H a [HL97] has applied the quasi-static approach to data~owconstructs representing data-dependent iteration, recursion, and conditionals, where optimal profiles are computed assuming the knowledge of the probability density functions of data-dependent variables that influence the profile. The data-dependent constructs must be identified in a given dataflow graph, either manually or automatically, before Ha’s techniques can be applied. These techniques make the simplifying assumption that the +controltokens for different dynamic actors are independent of one another, and that each control stream consists of tokens that take TRUE or FALSE values randomly and are independent and identically distributed (i.i.d.) according to statistics known at compile time.
Ha’s quasi-static approach constructs a blocked schedule for oneiteration of the dataflow graph. The dynamic constructs are scheduled in a hierarchical fashion; each dynamic construct is scheduled on a certain number of processors, and is then convertedinto a single node in the graph and is assigned a certain exehen scheduling the remainder of the graph, the dynamic construct is treated as an atomic block, and its execution profile is used to d e t e ~ i n e how to schedule the remaini~gactors around it; the profile helps tiling actors in
Chapter 4
the processor-time plane with the objective of minimizing the overall schedule length. Such a ~ierarchicalscheme effectively handles nested control constructs, e.g., nested conditionals. The locally optimal decisions made for the dynamic ~onstructsare shown to be effective when the variability in a dynamic construct
TRUE ranch
proc I FALSE proc 2 Branch proc 3
Figure 4.5. A quasi-static schedule for a conditional construct (adapted from
[Lee88bl).
M~LTIPRQCESSQR SC~EDULING MODELS
is small. We will return to quasi-static schedules again in Chapter 8.
To model execution times of actors (and to perform static scheduling), we 2 (non-negative integer) with each actor v associate an execution time t(v) E ' in the HSDFG; t ( v ) assigns execution time to each actor v (the actual execution time can be interpreted as t ( v ) cycles of a base clock). Interprocessor communication costs are represented by assigning execution times to the send and receive actors. The values t ( v) may be set equal to execution time e s t ~ ~ a twhen e s exact execution times are not available, in which case results of the computations that make use of these values (e.g., the iteration period 7 ' ) are compile-time estimates. Recall that actors in an HSDFG are executed essentially infinitely. Each of that actor.An it~ratiQ firing ofan actor is called an in~o~atiQn HSDFG corresponds to one invocation of every actor in the HSDFG. A schedule specifies processor assignment, actor ordering andfiring times of actors, and these may be done at compile-time or at run-time, depending on the scheduling strategy being employed. To specify firing times, we let the function start( v, k ) E 2 ' represent the time at which the k th invocation of the actor v starts. Correspondingly, the function end(v, k ) E ' 2 represents the time at which the k thexecution of the actor v completes, at which point v produces data tokens at its output edges. Since we are interested in the k th execution of each actor for k = 0, 1,2, 3, ...,we set start(v, k ) = 0 and end(v, k ) = 0 for k 0 as the "initial conditions". If the k th invocation of an actor v, takes exactZy t ( v,) time units to complete for all k ,then we can claim: end(v,, k )
= start(vj,k ) + t(v,).
Recall that a fully-static schedule specifies a processor assignment, actor ordering on each processor, and also the precise firing times of actors. We use the following notation for a fully-static schedule: :A fully-static schedule S (for P processors) specifies a triple: S {CYp(v>, CYt(v)7 TFS} 7
where op(v)"+{1,2, ...,P} is the processor assignment, and TFS is the iteration period. A fully-static schedule specifies the firing times start(v, k ) ofall actors, and since we want a finite representation for an infinite schedule, a fullystatic schedule is constrained to be periodic: start( v, k )
=
G,( v )
+ kTFS,
o , ( v ) is thusthe starting time of the first execution of actor v (i.e., start(v, 0) = CY,(v ) ).Clearly, the t~oughputfor such a schedule is 7';;.
Chapter 4
The op(v ) function and the G,( v ) values are chosen so that all data precedence constraints and resource constraints are met, ~e define precedence constraints as follows: dge ( v j , v iE) E in an HSDFG (V, E) represents the (data) s t a ~ ( v k) i , 2 end(vj,k -delay(vj, v i ) ),for all
k 2 d e l ~ y ( v;)j ,.
(4-1)
The above definition arises because each actor consumes one token from each of its input edges when it fires. Since there are already delay(e) tokens on each incoming edge e of actor v, another (k -d e l ~ y ( e) l ) tokens must be produced on e before the k th execution of v can begin. Thus the actor src( e) must have completed its ( k -delay(e) -l )th execution before v can begin its k th execution. The “-1 S” arise because we define start( v , k) for k 2 0 rather than k >0 .This is done fornotational convenience. Any schedule that satisfies all the precedence constraints specified by G is also called an G [Rei68]. A n HSDFC correspon {start( vi, k ) } admissible schedule, That is, a valid execution respects all data precedences specified by the HS For the purposes of the techniques presented in this book, we are only recedence relationships between actors in the HSDF graph. In a ne or more pairs of vertices can havemultiple edges connecting them in the same “direction”; in other words a general HSDFG is a directed mul3.1). Such a multi~graphoften arises when a multirate SDF d intoan HSDFG. ~ u l t i p l eedges between the same pair of vertices in the same direction are redundant as far as precedence relationships are concerned. Suppose there are multiple edges from vertex vi to v j ,and amongst these edges, the minimum edge delay is equal to dmin.Then, if we replace all of these edges by a single edge with delay equal to dmin,it is easy to verify that this single edge ~aintainsthe precedence ts for all of the edges that were maybe preprocessed into a form directed from v; to v j .Thus a general where the source and sink vertices uniquely identify an edge in the graph, and we represent an edge e E E by the ordered pair (src (e), snk( FG that is a directed multigraph may be t r a n s f o ~ e dinto a a directed graph such that the precedence constraints of the original HSDFG are maintained by the transfo~ation.Such a transformation is illustrated in Figure 4.6. The multiple edges are taken into account individually when buffers are assi~nedto the arcs in the graph. We p e ~ such o ~a transformation to avoid needless clutter in analyzing HSDFGs, an to reduce the running time of algorithms that operate on HSDFCs. *
In a self-timed scheduling strategy, we determine a fully-static schedu~e, ng the execution time esti~ates,but we retain only the and the ordering of actors on each processor as speciwe discard the preci formation specified in static schedule. Although we may sta setting s t ~ r t ( v ,0) = subse~uents t ~ r t ( v ,k ) values are d e t e ~ i n e dat run-time based on th The average iteration period of a self-ti~ed ity of data at the input of each a analy~ethe evolution of a self-timed sched-
As we discussed in Section 4.3, in some cases it is advanta~eousto ~ n ~ o E d factor, say N , andschedule N iterations of the graph together in order to e oit inter-iteration ~ a r a ~ ~ e l more i s m effectively. In this section, we describe the unfo~dingtransformat~on. a graph by a certain unfold
G = ( V , E ) unfolded N times represents N iterations of the unfold in^ transformation therefore results inanother S N copies of each of the vertices of G . rtex vi E V ,and the N copies of I = Nl VI. From the definitio v:, it is obvious that: ;the
s t ~ r t ( v ~m) ,
end( v:, m)
=
= end( vi, m N
s t ~ ~ t (m~Ni ,=tE) ,and -tE) ,for all m 2 0 , and
(4-2) 0 S E
(4-3)
Also, G N maintains exactly the same precedence constraints as G , and therefore the edges EN must reflect the same inter and intra-iteratio~
3
f o r ~ i an n~ that isadirected while main recedenee constraints.
~ ~ i t i ~into r aon~ h
Chapter 4
7
constraints as the edges E ,For the precedence constraint in G represented by the edge (vj, vi) E E ,there will be a set of one or more edges in EN that represents the same precedence constraint in G N .The construction of EN is as follows. From (4- l), an edge (vj, vi) E E represents the precedence constraint: start(v,,k ) 2 end(vj,k -delay(vj,vi)) ,for all k 2 del~y(vj7 vi) .
Now, we can let k = mN + l , and write delay(vj, vi) as N
+ delay(vj,vi) modN,
(4-4)
where (x mod y) equals the value of x taken modulo y ,and Lx/y] equals the quotient obtained when x is divided by y .Then, (4-1) can be written as:
We now consider two cases: 1. If delay(vj,vi) mod N 5 I
3) to yield:
2. If 0 5 l <delay(vj,vi) mod N ,then (4-5) may be combined with (4-2) and (43) to yield:
Equations (4-6) and (4-7) are summarized as follows. For each edge (vj, vi) E E , there are a set of N edges in E N ,which is the edge set of GN.In particular, EN I-drluy(
contains edges from vj
vp mod vi) N
to v : , each with delay Ldelay(vj7v i ) / N ] ,
for values of l such that delay (vj, vi) mod N 5 l
edgesfrom vj forthe
vi) mod N
to v i , each with delay
Ldelay(vj,v i ) / N 1 + 1 ,
values of l such that 0 5 l <delay(vj,vi) mod N . Note that if
delay(vj7vi)
= 0 , then there
is azero-delay
edge fromeach
v; to vf for
MULTIPROCESSOR SCHEDULING MODELS
0 2l
7
N . Figure 4.7 shows an example of the unfolding transformation. Figure
4.8 lists an algorithm that may be usedfor unfolding. Note that this algorithm has a complexity of O(NIEI) .
CT,
When constructing a schedule for an unfolded graph, the processorassignment and the actor starting times CT, are defined for all vertices of the u ~ f o l graph ~ e ~ (i.e., and CT,are defined for N invocations of eachactor); T,, is the iteration period for the unfolded graph, andthe average iteration period for the original graph is then ( T , , / N ) .In the remainder of this book,we assume we are dealing with the unfolded graph and we refer only to the iteration period and throughputof the unfoldedgraph,ifunfolding is in fact employed,with the understanding that these quantities can be scaledby the unfolding factor to obtain the corresponding quantities for theoriginal graph.
CT,
G:
G3:
Figure 4.7. Example of an unfolding transformation: HSDFG G is unfolded by a factor of 3 to obtain the unfolded HSDFG G3
Chapter 4
assume that we have reasonably good estimates of actor execution times available to us at compile time to enable us to exploit static s c h e d u ~ i n ~ t e c h n i ~ ~ ehowever, s; these estimates need not be exact, and execution times of actors may even be data-dependent, Thus we allow actors that have d i ~ e r e nexet cution timesfromone iteratio FG to the next, as long as these variations are small or rare. This is the casewhenestimates are available for are close to the c o ~ e s p o n ~ ~ the task execution times, andcutiontimes ing estimates with high ~robability,but deviations from the estimates of (eEectively) arbitrary magnitude occasio~allyoccur due to ~ h ~ n o m e nsuch a as cache misses, i n t e ~ p t suser , inputs, or error handling. ~ o n s e ~ u e n t ltight y ; worst-case e bounds cannot generally bed e t e ~ i n e dfor such operations; how-
es v;, v ) , ...,v;"
lo the S
0 I E <delay (vi, vi) mod N N + / -de/uy(v p vi) mod N N c I- delay(vi, vi) mod N
delay(vi, vi) mod
N
,vi) = I
+
Ldelay(vj,v i ) / N j 1
S E
,v;) to EN
~ U L T I P ~ O C E S SSCHEDULI~G O~ MODELS
ever, reasonably good execution time estimates can in fact be obtained for these operations, so that static assignment and ordering techniques are viable. For such applications self-timed scheduling is ideal, because the performance penalty due to lack of dynamic load balancing is overcome by the much smaller run-time scheduling overhead involved whenstatic assignment and ordering is employed. The estimates for execution times of actors can be obtained by several different mechanisms. The most straightforward method is for the programmer to provide these estimates while developing the library of primitive blocks (actors). In this method, the programmer specifies the execution time estimates for each actor as a mathematical function of the p~ametersassociated with that actor (e.g., number of filter taps for an FIR filter, or the block size of a block operation such as an FFT). This strategy is used in the Ptolemy system EPto98) for example, and is especially effective for libraries in which the primitives are written in the assembly language of the target processor. The programmer can provide a good estimate for blocks written in such a low-level library by counting the number of processor cycles each inst~ctionconsumes, or by profilingtheblockonan inst~ction-setsimulator. It is more difficult to estimate execution times for blocks that contain control constructs such as data-dependent iterations and conditionals within their body, and when the target processor employs pipelining and caching. Also, it is difficult, if not impossible, for the programmer to provide reasonably accurate estimates of execution times for blocks written in a high-level language (as in the C code generation library in Ptolemy). The solution adopted in the G tern [LEP90] is to automatically estimate these execution times by compiling the block (ifnecessary) and ~ n n i n git by itself in a loopon an instruction-set simulator for the target processor.To take into account data-dependent execution behavior, different input data sets can be provided for the block during simulation. Either the worst-case or the average-case execution time is used as the final estimate. The estimation procedure employed by CRAPE is obviously time-consuming; in fact, estimation turns out to be the most time-consuming step in the PE design flow. Analytical techniques can be used instead to reduce this estimation time; for example, Li and Malik ELM951 have proposed algorithms for estimating the execution time of embedded software, Their estimation technique, which forms a part of a tool called cin~erella,consists of two components: 1) determining the sequence of inst~ctionsin the program that results in maximum execution time (program path analysis) and 2) modeling the target processor to determine how much time the worst case sequence determined in step 1 takes to execute (micro-~chitecturemodeling). The target processor model also takes the effect of instruction pipelines and cache activity into account. The input to the tool is a generic C program with annotations that specify the loop bounds (Le.,
Chapter 4
the maximum number ofiterations for which a loop runs). Although the problem is formulated as an integer linear program (ILP), the claim is that practical inputs to the tool can be efficiently analyzed using a standard ILP solver. The advantage of this approach, therefore, is the efficient mannerinwhichestimates are obtained as compared to simulation. It should be notedthat the program path analysis component of the Li and Malik technique is, in general, an undecidable problem; therefore for these techniques to function, the programmer must ensure that his or her program does not contain pointer references, dynamic data structures, recursion, etc. and must provide bounds on all loops. Li and Malik’s techniquealso depends on the accuracy of the processormodel,although one canexpectgoodmodels to eventually evolve for DSPchips and microcontrollersthat are popular in the market. The problem of estimating execution times of blocksis central for us to be able to effectively employ compile time design techniques. This problem is an important area of research in itself, and the strategies employed in Ptolemy and CRAPE, andthoseproposed byLiandMalik are useful techniques, andwe expect better estimation techniques to be developed inthe future.
In this chapter, wediscussedvariousschedulingmodels for dataflow graphs on multiprocessor architectures that differ in whether scheduling decisions are made at compile time or at run-time. The scheduling decisions are actor assignment, actor ordering, and d e t e ~ i n a t i o nof exact firing times of each actor. A fully-static strategy lies at one extreme, in which all of the scheduling decisions are made at compile time, whereas a dynamic strategy makes all scheduling decisions at run-time. The trade-off involved is the low complexity of static techniques against the greater generality and tolerance to data dependent behavior in dynamic strategies. Fordataflow-oriented signal processing applications, the availability of compile time information makes static techniques very attractive. A self-timed strategy is commonly employed for such applications, where actor assignment and ordering is fixed at compile time (or system design time) butthe exact firing time of each actor is determined at run-time, in a data driven fashion. Such a strategy is easily implemented in practical systems through sender-receiver sync~onizationduring interprocessor co~munication.
In this chapter, we focus on technjquesthat are used in self-timed scheduling algorithms to handle IPC costs. Since a tremendous variety of scheduling algorithms have been developedto date, it is not possible here to provide comprehensive coverage of the field. Instead, we highlight some of the most fundamental developments to date in I ~ ~ - c o n s c i o umultiprocessor s schedulin~strategies for HSDFGs.
To date, mostof the research on scheduling DFGs has focused on the problem of minimizing the schedule ~ ~ p, ch ~ is the time e required s to ~ execute all actors in the HSDFG once. When a schedule is executed repeatedly for example by being encapsulated within an infinite loop, as would typically be the case for a DSP application the resulting throughput is equal to the reciprocal (1 /p) of the schedule makespan if all processors synchronize (perform a “barrier sync~onization”as described later in Section 9.1) at the end of each schedule iteration. The throughput can often be improved beyond (1/p) by abandoning a global barrier synchronization? and implementinga self-timed execution of the schedule, as described in Section4.4 [Sri9.5].To further exploit parallelism between graph iterations, one may employ the technique of unfold in^, which is discussed in Section 4.8, To model the transit time of inte~rocessorcommunication data in a multiprocessorsystemmple, the time to write andread data values to andfrom F edges are typically weighted by an estimate of the delay to transmit and receive the associated data if the source and sink actors of the edge are assigned to different processors. Suchestimates are similar
~
Chapter 5
to the execution time estimates that we use to model the r~n-timeof individual dataflow actors, as discussed in Section 4.9. In this chapter, we are concerned p r i m ~ i l ywith efficient scheduling of
SDFCs in which an IPC cost is associate
lern of constructing a ~ i n i m makespan u~ given target ~ultiprocessorarchitecture as nition suggests, solutions to the schedu~in~ problem are heavily dependent on the underlying target architecture. In an a t t e ~ pto t decompose this problem and separate target-speci~caspects from aspects of the problem that are fundam~ntalto the s t ~ c t u r eof the input HSDFC7some rese~chershave applied a t~o-phased approach, pioneered by Sark The first phase involves schedu~ingthe input ), which consists of an i n ~ n i t enumber of interconnection network. can perform interproces-
early, the complexity of this second phase of the scheduling process is a
uniprocessor target or a chain-st~cturedprocessor interconnection to~ology7
focused on the ~erivationof e~ectiveheuristic r Since the inter-iteration dependencies represented relevant in thecontext of mini~um-makespan sc~eduling,
graph is often r e f e ~ e dto as a
that for each e E E ,we app~icationis speci~edas section 3.8) obtain~dby
classic algorithm for computin n ~ e n of t actors to processors based on etw work flow principles was developed by Stone [Sto77]. This algorithm is designed for heterogeneous ~u~tiprocessor syste~s, and its goal is to
map actors to processors so that the sum of computation time and time spent on IPC is minimized. More specifically, suppose that we are given a target multiprocessor architecture consisting of n (possibly heterogeneous) processors P i , P2, ...,P,, ;a set of actors A I , A 2.,..,A,,, ;a set of actor execution times { t i ( A j ) } ,where for each i E {1,2, ...,n > and each j E {1,2, ...?m } , t i ( A j ) gives the execution time of actor A , on processor Pi ;and a set of inter-actor communi~ation costs { C , } where C, = 0 if actors A,. and A , do not exchange data, and otherwise, C, gives the cost of exchanging data between A, and A , if A , and A , are assigned to different processors. The goal of Stone's assignment algorithm is to compute an assignment
such that the net computation and communication cost
is minimized. Note that minimizing (5-2) is not equivalent to minimi~ingthe makespan. For example, if the set of target processors is homogeneous, then an optimal solution with respect to (5-2) results from simply assigning all actors to a single processor. The core of the a l g o ~ ist an ~ elegant approach for t r a n s f o ~ i n ga given instance I of the assignment problem into an instance Z ( I ) = ( V ( ] ) ,E ( I ) ) of the minimum-weight cutset problem. For example, for a two-processor system, two vertices p 1 and p2 are created in Z(1) co~espondingto the two heterogeneous target processors, and a vertex ai is created for each actor A i . For each A , an undirected edge in Z ( I ) is instantiated between ai and p 1 ,and the weight of this edge is set to the execution time $,(ai) of a, on p2 This edge models the execution time cost of actor A, that results if ai and p1 lie on opposite sides of the cutset that is computed for Z(I ) .Similarly an edge (ai,p2) is instantiated with weight w((ai,p 2 ) ) = t , ( a i ).Finally, for each pair of vertices ai and a, such that Cjj#0 , an edge (a;, a,)in Z ( I ) is instantiated with weight w((a,,a,)) = C, ' From a minimu~-weightcutset in Z(1) that separates p 1 and p 2 ,an optimal solution to the heterogeneous processor assignment problem can easily be derived. Specifically9if R c:E ( I ) is a minimum-weight cutset that separates pl and p2 an optimal assignment can be derived from R by:
Chapter 5
The net computation and communicationcost of this assignment is simply cost(F) =
44 Y E
(5-4)
R(I)
An illustration of Stone’s Algorithm is shown in Figures 5.2 and 5.1. Figure 5.2(a) shows the actor interaction structure of the input application; Figure 5.2(b) specifies the actor execution times; and Figure 5.2(c) gives all non-zero communication costs C, f 0 . The associated instance Z ( I ) of the minimumweight cutset problem that is derived by Stone’s Algorithm is depicted in Figure 5.1 (a), and a minimum-weight cutset
( I ) i s shown in Figure 5.l(b). The optimal assignment that results from this cutset is given by
F ( A , ) = F ( A , ) = P I ,and F ( A , ) = F ( A , ) = P , .
(5-6)
For the two-processor case ( n = 2 ),a variety of efficient algorithms can to derive a minimumweight cutset for Stone’s constructio~ C ( I ) hen the target architecture containsmorethantwoprocessors ( n 2 3 ),the weight of each edge ( a ; ,p j ) is set to a weighted sum of the values
beused
ure 5.1. (a) The instance of the minimum-weight cutset problem that is derive from the example of Figure5.2. (b) An illustrationof a solution to this instance of the minimum- eight cutset problem.
IPC-CONSCIOUSS C H ~ ~ U L ~ I LN~~O R I T H M S
I
I
c, = cji
1 2 1 3 1 1
2
Figure 5.2. An example that is used to illustrate Stone's Algorithm for computing heterogeneous processor assignments.
Chapter 5
{t j ( ~ Ii(1) 2j 5 n ) } , and an optimal assignment F is derived by computin~a minimum n -way cutset in C(1).When n 24 ,Stone9sapproach becomes cotnputationally intractable. Although Stone9salgorithm has high intuitive appeal and has had considerable in~uenceon the SDI;scheduling community?the most effective algorithms ~ n o w ntoday for self-timed scheduling of SDI; graphs have jointly considered both the assi~nmentand ordering sub-~roblems.The approaches used in these joint algorithms fall into two broad categories a~proachesthat are driven by iterative, list-based mapping of individual tasks, and those that are based on cons t ~ c t i n gclusters of tasks that are to be executed on the same processor. These ories of schedulin~techniques are discussed in the following three sections.
list L of actors in constructed; a global time clock cc is maintained; and each task I" is eventually mapped into a time interval [.xT, yT] on some processor (the time intervals for two distinct actors assigned to the sameprocessor cannot overlap). The priority list L is a linordering (v,, v2,..,vlvl) of the actors in the input task graph G = (V, E) = {v 1 9v2, ...)vlq} )such that for any pair of distinct actors v e given higher scheduling ~rioritythan vi if and only if i <j ma~pedto an available processor as soon as it becomes the highe according to L among all actors that are ready. An actor is ot yet been mapped, but its predecessors have all been mapped yT 5 t ,where t is the current value of cc. For self-timed implementation, actors on each processor are ordered according to the order of their associated time intervals.
An impo~antgeneralization of list scheduling, which we call as been formalized by Printz [ eady-list scheduling maincheduling convention that a schedule is constructed by repeatedly ch~dulingready actors, but eliminates the notion of a static priortime clock. Thus, the only list that is fundamental to the S the list of actors that are ready at ven schedulin~step. t of effective ready-list algorithms for scheduling roble^
To be effective when I C costs are notne ible, a list-scheduling or y-list algorithm must incorporate the latencies associated with I
US S ~ H E ~ U AL~ORITH L I ~ ~
tions. This involves either explicitly scheduling IPC operations onto the communication resources of the target architecture as the scheduling process progresses, or incorporating estimates of the time that it takes for data that is produced by an actor on one processor to be available for consumption by an actor that has been assigned to another processor. In either case, an additional constraint is imposed on the earliest possible starting times of actors that depend on the arrival of data.
ties have been shown to Its and properties have -case of the s~heduling problem, wewhich call In ideal s c h e d ~ ~ ~the ng, target multiprocessor ar ocessors that is homogetime of an actor is independent of the processor it is assigned to), and IPG is performed in zero time. ~lthough,issues of heterogeneous processing times and IPC cost are avoided, the ideal scheduling problem is intracta~le[GJ’79]. hen list scheduling is applied to an instance of the ideal sche problem and a given priority list for the problem instance, the resulting schedule is not necessarily uni~ue,and generally depends on the details of the particular list schedulin~algorithm that is used. ~pecifically,the schedule depends on the processor selection scheme that is used when more than one processor is availat a given scheduling step. For exam ,consider the simple task graph in ure S.3(a); and suppose that t(A) = t( = t ( C ) = l ,and the targetmultiprocessor architecture consists of two processors P , and P , . If list scheduling is applied to this example with ~rioritylist L = ( A ,B, C) ,then any one of the four schedules illus~atedin Figure S.3(b) may result, depending on the processor selection scheme.
(C, t, n ) denote the instance of the ideal scheduling problem that consists of task graph C = ( V ,E) ;actor execution times (on each processor ;and a target, zero-IPC architect in the target ~chitecture){t(v)l(v E that consists of n identical processors. n given a list-scheduling algorithm fine S,( G, t , n, L ) to be and a priority list (v1, v2, ...,vlvl) fo dule produced by A when it is p,( G, t, n, L) to be the makesp
,t, n, L) = {S,( C , t, n, L) IA is a list scheduling al~orithm}. of schedules thatcanbe
(
produced when a list
,t , n ) with priority list L . For the example of Fi ure 5.3, we have
Chapter 5
S1
0
1
S2
2
0
1
2
4
S3
0
1
2
0
1
2
Figure 5.3. An example thatillustrates the dependenceof list scheduling on processor selection (for a given priority list).
It is easily shown that schedules produced by list-scheduling algorithms on a given instance ISP( G, t, n ) all have the same makespan. That is, /{l.ln(G,
t, n7 L)I(A
E
C(G7 $7 n, L))}I = 1 *
(5-9)
This property of uniform makespan does not generallyhold, however, if we allow heterogeneous processors in the target architecture or if we incorporate non-zero IPC costs. Clearly, effective construction of the priority list L is critical to achieving high-quality results with list scheduling. Graham has shown that when arbitrary priority lists are allowed, it is possible for the list scheduling approach to produce unusual results. In particular, it is possible that increasing the number of processors, red~cingthe execution time of one or more actors, or relaxing the precedence constraints in an SDF graph (removing one or more edges) can all causea list scheduling algorithm to produce results that are worse (longer total execution time) than those obtained whenthe algorithm is applied with the original number of processors, the original set of SDF edges, or original execution times respectively [Gra69]. Graham, however, has established a tight bound on the anomalous performance degradation that can be encountered with list scheduling [Gra69]. This result is summarized by the following theorem. eore :Suppose that G = ( V , E) is task graph; L = (vl, v2, ..., is a priority list for G ; n and n’ are positive integers such n’ 2n ; t :V -+ {0, 1, 2, ...} and t’ :V -+ {0, 1,2, .} are assignments of non-negative integers to members of V (sets of actor execution times) such that for each v E V ,t’( v) 2 t( v) ;E’ c: E ;S E C( G, t, n, L) and S’ E C( G’, t’, n’, L) ,where G’ = ( V , E’) .Then (5- 10)
and this is the tightest possible bound. Graham has also established tight bound on the variation in list scheduling performance that can be encountered when different priority lists are used for the same instance of the ideal scheduling problem.
eore : ’ Suppose that G = ( V , E ) is task graph; L and L’ are a priority lists for G ;n is a positive integer; t :V ”+ {0, 1,2, ...} are a s s i ~ n ~ e nof t s execution times to members of V ;S E C( G, t, n, L ) ;and S’ E C( G, t, n , L’) .Then
Chapter 5
(5- 11)
and this isthe tightest possible bound. an actor A in an acyclic SDF graph C is defined to be the length of the longest directed path in C that originates at A . ere, the length of a path is taken to be the sum of the execution times of the actors on the path. Intuitively, actors with high level values need to bescheduled early since long sequences of computation depend on their com~letion.One of the earliest and most widely-used list-scheduli S is the HLF’ET (high611. In this al~orithm, est level first with estimated times) algorithm L is created by sorting the actors in decreasing order of their levS guaranteed to produce an optimal result if there are only twoprocessors, both processors are identical, and all tasks have identical execution times For the general ideal schedulin~ ~roblem (any finite number of cessorsis allowed, ecution times need not be identical, are uniformly zero), has been proven to frequently produce ne~-optimalschedules [ACD7 Early strategies for i n c o ~ o r a t i nIPC ~ costs intolist scheduling include the algorithm of Yu [Yu84]. Yu’s algorithm, a modification of H L F E ~scheduling, repeatedly selects the ready actor that has the highest level, and schedules it on the processor that can finish its execution at the earliest time. The earliest finishing time of a ready actor A on a processor P depends both on the time intervals that have already been scheduled on P , and on the IPC time required for the data required from the predecessors of A to arrive at P. In contrast to these early algorithms, the ETF (earliest task first) algorithm wang, Chow and Angers uses the level metric only as a tie-breaking critethe earliest time at rion. At each scheduling step in ETF, the value t,(A, P) which actor A can commence execution on processor P is comp~tedfor every ready actor A andevery target processor P . Ifan actor-processor pair (A*,P*) uniquely minimizes $,(A,P ) then A* is scheduled to execute on P* starting at time t,(A*, P*);otherwise, the tie is resolved by selecting the actorprocessor pair that has the highest level.
osed a list-scheduling algorith ich attempts to account for within an arbitrary, ~ u l ~ - int~rconnection h o ~ network. Since ~aintainingand a ~ ~ l y ai precise n~ accounting of traffic within such a network can be com~uta-
tionally expensive, the algorithm hasbeen devised to maintain an approxiwhich reasonable estimates of mate view of network state E($) from communication delaycanbederived for scheduling purposes. Atany given scheduling time step t ,E( t ) incorporates three P X P matrices H , L ,and where P is the set of processors in the target multiprocessor arc~itecture.Given pl, p2 E P ,H ( p , ,p2) gives the number of hops between p l and p2 in the interL@,, p2) givesthe prefe~edoutgoing communication ~ o ~ n e c t i onetwork; n channel of p1 that should beusedwhen communicating data to p2 ; and D( p ] , p 2 gives ) the communication delay between p1 and p2 that arises due to contention with other IPC operations in the system. H algorithm, actors are first prio~tizedby a static, modi~edlevel metric Zmh that incorporates the communication costs that are assigned to the task graph edges. For an actor x ,t!mh(x) is the longest path length in the task graph that originates at x , where the length of a path is taken as the sum of the actor execution times and edge communication costs along the path. A list-scheduling loop is then carried out in which at any givenscheduling time step t ,an actor x* that maximi~eslmh( ) is selected from a ~ o n gthe actors that are ready at t .Processor selection is thenachieved by ing x* to the processor thatallows the earliest estimate^ completion time. estimated completion time is derived from the network state approximati wini and Lewis observe that there is a signi~canttrade-off‘ between cywithwhichthenetwork state approximation E($) isupdated on), and the time complexity of the (which affects the accuracy of the app resulting scheduling algorithm. The orithm addresses this trade-off by updating ~ ( $only ) when a schedule gins sending IPC datato a successor actor that is scheduled on another processor, or when the IPC data associated with a task graph edge ( x , y ) arrives at the processor that y is assigned to. *
*
Loosely speaking, thepr ~echanismin the H algorithm is the converse of that employed in EW. H,the “earliest ac processor mapping’’ is used as a tie-brea~ng crite~on, while the modified level Zmh is used as the primary priority function. Note also that when thetarget processor set P is homogeneous, selecting an actor-processor pair that minimizes the starting time is equivalent to selecting a pair that minimizes completion time, while this equivalence does not necessarily hold for a heterogeneous architecture. Thus, the concept of “earliest actor-processor mapping” that is employed by ETF is different from that in MH only in the heterogeneous processor case.
In the DL§ (dynamic level scheduling) algorit~mof §ih and Lee, the use of levels in traditional HLFET scheduling is replaced by a measure of scheduling priority that is to be continually re-evaluated as the schedule is constructed
Chapter 5
[SL93a]. Sih and Lee demonstrated that such a concept is preferable because the "scheduling affinity" between actor-processor pairs depends not only on longest paths in the task graph, but also on the current s c h e d ~ l i ~state, g which includes the actor/time"interval pairs that have already been scheduled on processing resources, and the IPC operations that have already been scheduled on the communication resources. As with E W , the DLS algorithm also abandons the useofthe global scheduling clock c c , and allows all target processors to be considered as candidates in every scheduling step (instead of just those processors that are idle at the current value of c c ) . With the elimination of c c , Sih's metric for prioritizing the d y ~ a m i clevel can be formulatedas actors -
where a represents the scheduling state at the current scheduling step; L ( A ) denotes the conventional (static) level of actor A ;D ( A , P,a) denotes the earliest time at which all data required by actor A can arrive at processor P ;and F(P,a) gives the completion time of the last actor that is presently assigned to P.
While the incorporation of scheduling state by the DLS algorithm represents an important advancement in IPC-conscious scheduling, the fornulation of the dynamic level in (S-12) contains a subtle limitation, which was observed by Kwok and Ahmad [ U 9 6 ] . This limitation arises because the relative contributions of the two componentsin (5-12) (the static level and the data arrival time) to the dynamic level metric vary as the scheduling process progresses. Early in the scheduling process, the static levels are usually high, since the actors considered generally have relatively many topological descendants,and similarly, data arrival times are low, since the scheduling of actors on each processor begins at the origin of the time axis and progresses towards increasing values of time. A s more and more scheduling steps are carried out, the static level parameters of ready actors will decrease steadily (implying lower influence on the dynamic level), and the data arrival times will increase (implying higher influence). Thus, the relative weighting of the static level and the data arrival time are not constant, but rather can vary strongly between different scheduling steps. This variation is not taken to account in the DLS algorithm. sc
Motivated partly by their observation on the limitations of dynamic level scheduling,KwokandAhmadhavedevelopedan alternative variation of list scheduling, called the DCP (dynamic critical path) algorithm, that also dynamically re-evaluates actor priorities [ U 9 6 ] . The DCP algorithm is motivated by the observation that the set of critical paths in a task graph can change from one
I~C-CO~SCIOUS se ~ ~ U L ALGO~ITHMS I ~ G scheduling step to the next, where the critical path is defined to be a directed path along which the sum of computation and communication timesis m a x i ~ i ~ ~ d . For example, consider the task graph depicted in Fig. 5.4. Here, the number beside each actor gives the execution time of the actor, and each numeric edge weight gives the IPC cost associated with the edge. Initially, in this graph, the critical path is A -+B -+ C ”+ F , and the length of this path is 16 time units. If the first two scheduling steps map both actors A and B to the same processor (e. g., to minimize the starting time of B ),then the weight of the edge (A,B ) in Fig. 5.4 effectively changes to zero. The critical path in the new “partially scheduled” graph thus becomes the path A -+ D -+E -+ F , which has a length of 14 time units. Because critical paths can change in this manner as the scheduling process progresses, the critical path of the partially scheduled graph is called the The DCP algorithm operates by repeatedly selecting and scheduling actors on the dynamic critical path, and updating thepartially scheduled graph as scheduling decisions are made. An elaborate processor selection scheme is also incorporated to map the actor selected at each scheduling step. This scheme not only considers the arrival of required data on each candidate processor, but also takes into accountthe possible starting timesof the taskgraphsuccessorsof the selected actor [KA96].
ltiprocessor scheduling operate by incrementally constructing groupings, called clusters, of actors that are to be executed
Figure 5.4. An illustrationof “dynamic” critical paths inmultiproc~ssorschedu~in~.
Chapter 5
on the same processor. Clustering and list scheduling can be used in a complem ~ n t afashion. ~ ~ypically,clustering is applied to focus the efforts of a listscheduler on effective processor assignments. When used ef~ciently,clustering can signi~cantlyenhance the results produced by a list scheduler (and a variety of other scheduling techniques). A scheduling algorithm (such as a list scheduler) processes a clustered H ~ by constraining ~ F the ~ vertices of V that are encompassed by each cluster to be assigned to the same processor. More than one cluster may be mapped by the scheduling algorithm to execute on the same processor; thus, a sequence of clustering operations does not necessarily specify a complete processor assignment, even when the target processors are all homogeneous. The net result of a clustering algorithm is to identify a family of disjoint subsets M , , M2, ...,M kc: V such that the underlying scheduling algorithm is forced to avoid IPC costs between any pair of actors that are members of the i . In the remainder of this section, we examine a variety of algorithms for computing such a family of subsets.
In the Linear Cl~steringA l g o r i t ~of~ Kim and Browne, longest paths in the input task graph are iteratively identi~edand clustered until every edge in the graph is either encompassed by a cluster or is incident to a cluster at both its source and sink. The path length metric is based on a function of the computation and communicatio~along a given path. If p = (e,, e2, ..,e,,) is a task graph path, then the value of Kim and Browne’s path length metric for p is given by (5- 13) YE
Tp
where (5-14)
is the set of actors traversed by p ;c ( e ) is the IPC cost associated with an edge e;
is the total IPC cost between an actor v E T, and actors that are not contained in T, ;and the “normalization factors” a and p are parameters of the algorithm.
Kim and Browne do not give a systematic technique for determining the normalization factors that should be used with the Linear Clustering A~gorithm. Indeed, the derivation of the most appropriate normalization factors based on characteristics of the input task graph and the target multiprocessor architecture appears to be an interesting direction for &furtherstudy. When a = 0.5 and p = 1 ,linear clustering reduces to clustering of critical paths.
Sarkar’s ~ ~ t e r ~ u l i zalgorithm u t i ~ ~ [Sar89] for graph-clustering is based on determining a set of clustering operations that do not degrade the per a task graph on a machine with boundless processing resources (i.e., In internalization, the task graph edges &efirst sorted in decreasing order oftheir associated IPC costs. The edges in this list are then traversed according to this ordering, When each edge e is visited in this traversal, an estimate T, of the parallel execution time is computed with the source and sink vertices of e constrained to executeon the sameprocessor. This estimate is derived for an unboundednumberof processors, and a fully-connectedcommunicationnet’, does not exceed the parallel executjon time estimate of the current work. If ? clustered graph,then the c u ~ e n tclustered graph is modified by merging the source and sink of e into the same cluster. An important strength of internalization is its simplicity, which makes it easily adaptable to accommodate additional scheduling objectives beyond minimizing execution time. For example, a hierarchical scheduling framework for multirate DSP systems has been developed usingSarkar’s clustering technique as a substrate [PBL95]. This hierarchical framework provides a systematic method for combiningmultiprocessorschedulingalgorithms that minimizeexecution time with uniprocessor scheduling techniques that optimize a target program’s code and data memory re~uirements.
algorithm of Yang and Gerasoulis inco~oratesprinciples similar to those used inthe DCP algorithm, but applies these princi~lesunder the methodologyof clustering than list scheduling [YG94]. As withDCP, a “partially scheduledgraph” is repeatedlyexamined and updated as scheduling steps are carried out. The IPC costs of intra-cluster edges in the PSG are all zero; all other IPC costs are the same as the co~espondingcosts in the task graph. Additionally, the DSc algorithm inserts new intra-cluster edges into the PSG so that a linear (total) ordering of actors is always maintained within eachcluster. Initially, each actor in the task graph is assigned to its own cluster. Each clustering step selects a task graph actor that has not been selected in any previ-
Chapter 5
ous clustering step, and determines whether or not to merge the selected actor with one of its predecessors in the PSG. The selection process is based ona priority function ~ ( A )which 9 is defined to be the length of the longest path {computation and communication time) in the PSG that traverses actor A .This priority function fully captures the concept of dynamic critical paths: an actor maximizes )if, and only if, it lies on a critical path of the PSG.
At a given clustering step, an actor is selected if it is “free” which means that all of its PSG predecessors have been selected in previous clustering steps and it maximizesf ( )over all free actors. Thus, if a free actor exists that is on a dynamic critical path, then an actor on the dynamic critical path will be selected. However, it is possible that none of the free actors are on the PSC critical path. In such cases, the selected actor is not on the critical path {in contrast, the DCP algorithm alwaysselects actors that are on the dynamic critical path). Once an actor A is “selected,” its predecessors are sorted in decreasing order of the sum of t(x) + c(x) + % ( x ) where t(x) is the execution time of predecessor x , c(x) is the IPC cost of edge (x, A) ,and h ( x ) is the len longest direct path in the PSG that terminates at x . A set of one or more predecessors xi, x?, ...,x, is then chosen from the head of this sorted list such that “zeroing” {setting the IPC cost to zero) the associated output edges (xl, A), (x?, A), ...) (x,, A ) minimizes the value ofh(A) ,and hence f ( A ) ,in the new PSG that results from clustering the subset PSG of vertices {XI, X 2 t x,, A I The DSc algorithm was designed with low computationa~complexity as the primary objective. The algorithm achieves a time complexity of O( ( N + E)logN) ,where N is the number of task graph actors, and E is the number of edges. Incontrast, linear clustering is O(N(N + E ) ) [GY92]; linearization * * a >
is an O ( E ( N
*
+ E ) ) algorithm; ETF is O ( P N * ) where P ?
is the number of target
is O ( P 3 ~ ; *DLS ) is O ( N 3 P g ( P ) , ) where g is the complexity of the data routing algorithmthat is used to compute D(A, P, a);DCP is 0 ( N 3 ); and the D e c l ~ ~ t e r i ~ g A Zdiscussed g ~ r i t ~ ~in, Section5.4.4 below, has Complexity
As with internalization, DSC is designed for a fully connected network containing an unbounded number ofprocessors, and for practical, processor-constrained systems it can be used as a preprocessing or intermediate compilation phase. A s discussed in Section 5.1, optimal scheduling in the presence of IPC costs is intractable even for fully connected, infinite processor systems, and thus, given the polynomial complexity of DSc and internal~zation,we cannot expect guaranteed optimality from these algorithms. However, DSC is shown to be opti-
IPC-CONSCIOUS S C ~ ~ ~ U L A I LNGGO ~ I T H ~ S
mal for a number of non- trivia^ sub-classes of task graphs [YC94]. Sih and Lee have developeda clustering approach called ~ e c l ~ s t e rthat i~g is based on examining pairs of paths in the task graphto systematically determine which instances of paral~el~sm should be preserved during theclustering process [SL93b]. Rather than exhaustively examining allpairs of paths (in general, a task that is hopelessly time-consuming) the Declustering technique focuses on paths rs, which are actors that have mu~tip~e successors. Branch actors are examined in increasing order of their static levels. Examination of a branch actor B begins by sorting its successors in decreasing order of their static levels. The two successors C, and C , at the head of this list (highest static levels) are then categorized as being either an ~ b r u (“non~ c ~ ~ c ~ branch”) pair intersecting branch”) or an ~ b r u (“intersecting To perform this categorization, it is necessary to compute the tr of C, and C, .The transitive closure ofan actor X , denoted TC(X) in a task graph C is simply the set of actors Y such that there is a delayless path in C directed from X to Y .Given the ~ansitiveclosures TC( C,) and TC( C,) the successor pair (Cl, C,) is an TC( C,) n TC( C,) = 6 3 ,
(5-16)
and otherwise (if the transitive closures have non-empty intersection), (C,, C,) is an instance. Intuitively, the transitive closure is relevant to the derivation of parallel schedules since two actors can execute in parallel (execute over overlapping segments of time) if, and only if, neither actor is in the transitive closure of the other. Once the branch-actor successor pair (C,, C,) is categorized as being an Ibranch or Nbranch instance, a from it to determine an efTective means for capturing the parallelism associated with (C,, C,) within a clustering framework. If (C,, C,) is an Nbranch instance, then the TPPI associated with (C,, C , ) is the subgraph formed by combining a longest path (cumu~ativeexecution time) from C, to any task graph sink actor (an actor that has no outputedges), a longest path fromC , to any task graph sink actor, the associated branch actor B , and the connectingedges (B, C,) and
( B ,C , )
+
For example, consider the task graph shown in Figure 5.5(a); for simplicity, assume that the execution t of each actor is unity; observe that the set of ,T,U ) ;andconsider the TPPI computation branch actors in this graph is associated with branch actor T.The successors of this branch actor, U and V , satisfy TC( U ) = {W, X, Z} and TC( V ) = {Y } . Thus,wehave
Chapter 5
TC( U )nTC( V ) = a,which indicates that for branch actor T, the successor pair ( U , V ) is an branch instance. The TPPI associated with this branch instance is shown in Figure 5.5(b). If (Ct, C,) is an Ibranch instance then the TPPI associated with (C1,C,) is derived by first selecting an actor M , called a or, fromthe intersecstatic level. The TPPI is the tion TC( U ) n TC( V ) that has maxi combining a longest pathfrom C, t ,a longest pathfrom C , to
connecting edges (B,C,) and (B,C,) .
Among the branch actors in Figure 5S(a), only actor Q has an Ibranch instance associated with it. The corresponding TP I, derived from merge actor U , is shown in Figure 5.5(c). After a TPPI is identi~ed,an optimal schedule of the TPPI onto a two-processor arc~itectureis derived. Because of the restricted structure of TPPI topologies, such an optimal schedule can be computed efficiently. Furthermore, depending on whether the TPPI corresponds to an Ibranch or an ~ b r a n c ~ instance7and on whether the optimal two-processor schedule utilizes both target processors, the optimal schedule can be represented by removing zero, one, or arcs, from the TPPI: after removing the cut arcs from the TPPI, the (one or two) co~nectedcomponents in the resulting subgraph give the processor assignment associated with the optimal two-processor schedule. The declustering algorithm repeatedly applies the branch actor analysis discussed above for all branch actors in the task graph, and keeps track of all cut arcs that are found during this traversal of branch actors. After the traversal is complete, all cut arcs are temporarily removed from the task graph, and the connected components of the resulting graph are clustered. These clusters are then combined (hierffrchicfflcluster grouping)in a pairwise fashion two clusters at a time to produce a hierarchy of two-actor clusters. Careful graph analysis is used to guide this hierarchy formation to preserve the most useful instances of parallelism for as large a depth as possible within the cluster hierarchy.
Then7during the cluster ~ierurchy deco~~osition and cluster ~ r e u ~ ~ o w ~ phases of the Declustering Algorithm, the cluster hierarchy is systematically broken down and scheduled to match the ch~acteristicsof the target multiprocessor ~chitecture,For full details on h ~ e r f f r c h i c f f ~ c l u s t e ~ ~cluster r ~ u p i~n ~ g ,e ~ f f r c h y ~ecomposition,and cZuster ~reakdown,the reader is encouraged to consult [Sih91,SL93bJ.
Due in part to the high complexity of the assignment and ordering problems in the presence of IPC costs, independent comparisons on subsets of algo-
Figure 5.5. An illustration of TPPIs in the Declustering Algorithm.
Chapter 5
rithms developed for the scheduling problem consistently reveal that no single algorithm dominates as a clear “best-choice” that handles most applications better than all of the other algo~thms(for example, see [LAAG94, Thus, an important challenge facing tool designers for application-speci~cmultiprocessor implementation is the development of efficient methods for integrating the variety of algorithm innovations in IPC-conscious scheduling so that their advantages can be combined in a systematic manner. One example of an initial effort in this direction is the DS (dynamic selection) strategy [LAAG94]. DS is a list scheduling algorithm that compares the number of available processors nzpto the number of executable (ready) actors n, at each scheduling step. If (np 5 n,) , then one step of the DLS algorithm is invoked to complete the current scheduling step; otherwise, a minor variation of HLFET is applied to complete the step. This algorithm was motivated by experiments that revealed certain “regions of operation” in which scheduling algorithms exhibit particularly strong or weak performance compared to others. The performance of DS is shown to be significantly better than that of DLS or WLFET alone.
Pipelinedschedulingalgorithmsattempt to efficiently partition a task graph into stages, assign groups of processors to stages, and construct schedules for each pipeline stage. Under such a scheduling model, the slowest pipeline stage determines the throughput of the multiprocessor implementation. In general, pipelining can significantly improve the throughput beyond what is achievable by the classical (minimum-makespan) scheduling problem; however, this improvement in throughput may corneat the expense of a s i ~ n i f i c aincrease ~t in latency (e.g., overthe latency that is achievable by employing a minimum makespan schedule). Research on pipelined schedulin~is at a significa~tlyless mature state than on the classical s c ~ e ~ ~ l~i nr go ~defined l e ~ in Section 5.1. Due to its high relevance to DSP and multi~ediaapplications, we expect that in the coming years, there will be increasing activity in the area of pipelined scheduling. Bokhari developed fundamentalresults on the mapping oftask graphs into pipelined schedules. Bokharidemonstratedan efficient, optimalalgorithm for mapping a chain-structured task graph onto a linear chain of processors (the is based on an innovative data ,for modeling the chain pipelining problem. Figure 5.6 illustrates an instance of the chain pipelining problem, and the co~espondinglayered assignment graph.The task graph to be scheduled; the linearly-connected target multiprocessor architecture; and the layered assignment
3
4
Figure 5.6. An instance of the chain pipelining problem ((a) and (b)), and theassociated layered assignment problem.
Chapter 5 graph associated with the given task graph and target ~chitectureare shown in Figures 5.6(a), 5.6(b) and 5.6(c), respectively. The number above each task graph actor A, in Figure 5.6(a) gives the execution time $(A,)of A,,and the number above each task graph edge gives the IPC cost from the associated source and sink actors if the source and sink are mapped to successive stages in the linear chain of target processors. If the source and sink actors of an edge are mapped to the same stage (processor), then the co~municationcost is taken to be zero. Given an arbitrary chain-structured task graph G = (V,E) consisting of actors {Xi, X2, ...,X,} such that E
= { ( X i , X i + l ) ~
and given a linearly connected ~ultipro~essor architecture consisting of processors P , ,P*, ..,P, such that each P, is connected to P,+ by a single unidirectional communication link, the associated layered assignment graph is constructed by first associating a layer Lj of candidate “actor subchains” with each processor Pi in the target architecture. Each layer consists of a vertex for each subsequence (X,, X, + ., ,X , k) of task graph actors that may be assigned to processor i .When constructing this association of candidate subchains to layers, it is assumed, without loss of generality, that aprocessor P, may be assigned a non-em~tysubset of actors only if P , ,P2, ...,P,- are all assigned non-empty subsets as well. It is also assumed that m n (otherwise mapping each actor to a separate processor trivially achieves the optimum result). Thus, the set of vertices assigned to a layer L, can be specified as It
+
and (5-18)
where each triple ( i , b, c) corresponds to the assignment of actors {X69 x b + I , ..,X,} to processor P , . The set S, represents the set of all valid assignments of actor subsets to processor P, under the chain pipelining scheduling model. Edges in the layered assignment graph model compatibility relationships between elements of successive Si ‘S, and weights on these edges model the computation and communication costs associated with specific processor assignments. For each pair of vertices v i = {a, b, c} and v2 = {a + 1,b’, c’} in “adjacent” layers ( a and a + l ) that satisfy 6’ = c + 1 ,an edge is (v,, v2) instantiated in the layered assignment graph, The weight w ( ( v , ,v2)) assigned to (vl,v2) is the total computatio~time of the subset of actors associated with the assignment v 1 plus the IPC cost between the last actor, c ,associated with v 1 , and the first actor, b’ ,associated with v2 .In other words,
c
(5-19) i=b
where c(e) denotes the PG cost associated with edge e in the input task graph. From the above formulatjons for the const~ctionof the layered assignment graph, the graph illustrated in Figure 5.6(c) is easily seen to be the layered assignment graph associated with Figures 5.6(a-b). For clarity, the layer identifier a is omitted from the label of each vertex (a, b, c) ,and instead, the grouping of vertices into layers is designated by the annotations “Layer l”, “Layer 2”, “Layer 3” on the right side of Figure 5.6(c). Each dead-end path in Figure 5.6(c) that originates at a vertex in Layer 1 represents a possible solution to the chain pipelining problem for Figure 5.6(a-b). For example, the path (((1, 1, l), (2,2,3)), ((2,2,3), (4,4,4))) corresponds to the processor assignment illustrated in Figure 5.7(a), and the single-edge path (((1, 1,Z), (2,3,4))) co~espondsto the assignment showing in Figure 5.7(b). In general, the processing rate, or throughput, of the pipelined implernentation associated with a given dead-end path p = (el,e2, ..,e,) ,SE( e,) E S , , in a layered assignment graph is given by (5-20) that is, the throughput is simply the reciprocal of the maximum weight (computation plus comm~nication)of an edge in the path p .An edge in p that achieves .Thus, the chain pipelining this maximum weight is called a problem reduces to the problem of CO i n i r n ~bottleneck ~ dead-end path in the layered assignment graph that originates in Layer l. eferring back to the example of Figure 5.6, the t~oughputsof the assignments corresponding to Figures 5,7(a) and 5.7(b) are easily seen to be 1/9 and l/11 ,respectively, and thus, Figure 5.7(a) leads to a more efficient implementation. However, the paths
bothlead to more efficient pipelined implementations. Both of these paths achieve theminimum achievable t~oughputof 1/7 for Figure 5.6(a-b). The associated processor assignments are shown in Figure 5.8 (a-b), respectively. Bokhari observed that computing minimum bottleneck paths in layered assignment graphs can be performed in polynomial time by applying an adaptation by Edmonds and Karp [EK72]of Dijkstra’s shortest path algorithm [Dij59].
Chapter 5
Using the Edmonds-~arpadaptation allows an optimal chain pipelining to be computed in O(m2n4) time, where m is the number of processors in the target multiprocessor architecture, and n is the number of actors in the chain-structured task graph, However, Bokhari has deviseda significantly more efficient algorithm that exploits the layered structure of his assignment graph model. Usingthis technique, optimal solutions to the chain pipelining problemcanbecomputed in ~ ( mtime. ~ ~ ~ ) Bokhari developed a number of extensions to his algorithm for chain pipelining. These compute optimal solutions for ~ost-s~telzjte pipeline systems under various restriction on the application structure [Bok88]. A host-satellite systems
Figure 5.7. Two possible chainpipelining implementations for the system shown in Figure 5.6. Both of these processor assignments are suboptimal.
IPC-CONSCIOUS SCHE~ULINGA L G O ~ I T H ~ S
consists of an arbitrary number of independent chainpipelining systems that have access to a single, shared host processor. Heuristics for more general formsof the for example, for pipelined scheduling that conpipelined scheduling problem siders arbitrary task graph topologies, and more general classes of target multiprocessor architectures have been developed by Hoang and Rabaey [H Banerjee et al. [BHCF95]]; and Liu and Prasanna [LP98]. A. more detailed discussion of pipelined scheduling techniques is beyond the scope of this book. For further elaboration on this topic, the reader is encouraged to consult the aforementioned references.
Figure 5.8. Two alternative chainpipelining implementations for the system shown 0th of these solutions attain the maximum achievable throughput.
Chapter 5
This chapter has surveyed I ~ ~ - c o n s c i o schedu~ing us techniques for application-speci~cmultiprocessors, and has emphasi~edthe broad range of fundamental, graph-theoretic insights that have been established in the context of IPCconsciousscheduling.More speci~cally,wehavereviewed key algorithmic developments in four key areas relevant to the scheduling of H ~ ~ F static ~ s : assignment, list scheduling, c l u ~ t e ~ n and g , pipeline scheduling. For ~ i n i ~ u m makespan schedulin~, techni~ues that jointly address the assign~entand ordering sub-proble~sare typically much mor^ effective than techni~uesthat are based on static assignment algorithms, and thus, list schedulin~and ~lustering have received signi~cantlymore atte~tionthan static assignment techniques in this problem domain. ow ever, no p a ~ i c u list l ~ scheduling or clustering algorithm has emerged as a clear, widely~acceptedbest algorithm so far that outperforms all other algorithms on most applicat oreover, there is recent evidence that techniques to systematically in rent scheduling S . . gies canlead to algo~thmsthat s i g n i ~ c a ~ t l r i t h ~ on s which such i~tegratedalgorithms are based. The development of such integrated algorithms appears to be a promising direction for further work on scheduling. noth her recent trend that is relevant to application-speci~cmultiprocessorimplementation is the investigation of pipelinedscheduling strate which focus on throughput as the key performance metric. In this chapter, we have outlined fundamental results that apply to restricted versions of the pipelined scheduling problem. The development of algorithmsthat address more general formsofpipelinedscheduling,which are commonlyencountered in the designof application-speci~cmultiprocessors, is cu~entlyan active research area.
The self-timed scheduling strategy described in Chapter 4 introduces synchronization checks when processors communicate; such checks permit variations in actor execution times, but they also imply run-time sync~onizationand arbitration costs. In this chapter we present a scheduling model called orderedtransactions that alleviates some of these costs, and in doing so, trades off some of the run-time ~exibilityafforded by the self-timed approach. The ordered-transactions strategy was first proposed by Bier, Lee, and Sriram ELBOO this chapter, we describe the idea behind the ordered-transactions then we discuss the design and hardware implementation of a shared-bus multiprocessor that makes use of this strategy to achieve a low-cost interprocessor communication using simple hardware. The software environment for this board, for application specification, scheduling, and object code generation for the DSP lemy system developed at the~niversityof to98].
In the ordered-transactions strategy, we first obtain a fully-static schedule using the execution time estimates, but we discard the precise timing i n f o ~ a t i o n specified in the fully-static schedule; as in the self-timed schedule we retain the processor assi~nment(oP)and actor-ordering on each processor as specified by G, ;in addition, we also retain the order in which processors communicate with one another and we enforce this order at run-time. ~e formalize the concept of transaction order in the following. Suppose there are k inter-processor communication points r,), (s2,r2),...,(sk,r k ) where each (si,r i ) is asend-receive pair in the
(S,?
Chapter 6
fully-static schedule that we obtain as a first step in the construction of a selftimed schedule. Let R be the set of receive actors, and S be the set of send actors } and S = { S ] , s2,.. sk} ). We define a .9
0
(v17 v2, v37
v2k-
It
VZk)
9
where { v , ,v29
* ** ,
V Z k - 1, V 2 k )
SvR
(each communication actor is present in the sequence 0 ). order 0 (as defined above) is on a multiprocessor if at run-time the send and receive actors are for ecute inthesequencespecified by 0 . = (v l , v2, v3, ...,V 2 k - vZk),then imposing 0 means ensuring the constraints: end( v i , k ) I start(v2, k ) ,end( v2, k ) I start( vs, k ) ,... end( vk-l , k ) 2 start( vk9k ) ;"dk 2 0 .
Thus, the ordered-transactions schedule is essentially a ule with the added transaction order constraints specified by transaction order constraints must satisfy the data precedence FC being scheduled; we call such a transaction order an One simple mechanism to obtain an admissible transactjon order is as follows. After a fully-static schedule is obtained using the execution time estimates, an ad~issibletransaction order is obtainedfrom the function by setting the transaction order to 0 = ( v l , v2, vs, ..,v2k- vZk),where
n admissible transaction order can therefore be determined by sorting the set of communication actors (S v R ) according to their start times B,. Figure 6. l shows an example of how such an order could be derivedafrom given fully-static schedule.This fully-static schedulecorresponds to the HSDFC andschedule illustrated in Chapter 4 (Figure 4.3). Such an order is clearly not the only admissible transaction order; an order (S17
rl, $2,
r2, s3, r3, S47 r4, sg, Th, S57
r,>
also satisfies all precedence constraints, andhence is admissible. In the next chapter we will discuss how to choose a good transaction order, which turns out
to be close to optimal under certain reasonable assumptions. For the purposes of this chapter, we will assume a given ad~issibletransaction order, and defer the details of how to choose a good transaction order to the next chapter. he transaction order is enforced at run-time by a controller implemented
THE O R ~ E ~ E ~ - T R A ~ S A C T STRATEGY IO~S
in hardware. The main advantage of ordering interprocessor transactions is that it allows us to restrict access to communication resources statically, based on the communication pattern d e t e ~ i n e dat compile time. Sincecommunication resources are typically shared between processors, the need for run-time arbitration of these resources, as well as the need for sender-receiver synchronization is eliminated by ordering processor accesses to them; this results in an efficient IPC mechanism at low hardware cost. We have built a prototype four-processor DSP board, called the OrderedemoryAccess (OMA) architecture, thatdemonstrates the ordered-transacnsconcept.The OMA prototypeboard utilizes shared memory anda single shared busfor IPC the sender writes data to a particular shared memory location that is allocated at compile time, and the receiver reads that location. In this multiprocessor, a very simple controller on the board enforces the pre-dete~inedtransaction order at run-time, thus eliminating the need for run-time bus arbitration or semaphore synchronization. This results in efficient IPC (comparable to the fully-static strategy) at relatively low hardware cost. As in the self-timed scenario, the ordered-transactions strategy is tolerant of variations in execution times of actors, because the transaction order enforces correct sender-receiver sync~onization;however, this strategy is moreconstrained than self-tirned scheduling, which allowsthe order in which communication actors fire to vary at run-time. The ordered-transactions strategy, therefore, falls in between ful~y-staticand self-timed strategies in that, like the self-timed strategy, it is tolerant of variations in execution times and, like the fully-static strategy, has low communication and synchronization costs. These p e ~ o ~ a n c e issues will be discussed quantitatively in the following chapter; the remainder of this chapter describes the hardware and software implementation of the 0 prototype.
rod 1 rod roc Proc Proc 5
Figure 6.1. One possible transaction order derived from fu~ly-~tatic a schedul
Chapter 6
The OMA architecture uses a single shared bus and shared memory for inter-processor communication. This kind of shared memory architecture is attractive for embedded multiprocessor implementations owingto its relative simplicity and low hardware cost and to the fact that it is moderately scalable a fully interconnected processor topology, for example, would not only be much more expensive than a shared bus topology, but wouldalso suffer from its limited us bandwidth limits scalability in shared bus multiprocessors, but for medium throughput applications (digital audio, music, etc.), a single shared bus provides sufficient bandwidth (of the order of lOOMBytes/s). One solution to the scalability problem is the use of multiple busses and hierarchies of busses, for which theideas behind the OMA architecture directly apply. The reader is referred to Lee and Bier [LB9O] for how the OMA concept is extended to such hierarchical bus structures, Although in this book we apply the ordered-transactions strategy to a single shared bus ~chitecture,the synchronization optimization techniques described in Chapters 9 through 11 are applicable to more general platforms and are not restricted to medium throug~putapplications. From Figure 4.4 we recall that the self-timed scheduling strategy falls naturally into a message-passing paradigm that is i~plementedby the send and receive primitives inserted in the HSDFG. Accordingly, the shared memo^ in an architecture implement in^ such a scheduling strategy is used solely for message passing: the send primitive corresponds to writes to shared memory locations, and the receive primitive corresponds to reads from shared memory. Thus the shared memory is not usedfor storing shared data structures or forstoring shared program code. In a self-timed strategy we can further ensure, at compile time, that each shared memory location is written to by only one processor. One way of doing this is to simply assign distinct shared buffers to each of the send primitives; this is the scheme implemented in the multiprocessor DSP code generation domain in the Ptolemy environment [Pto98].
Let us now consider the implementation of IPC in self-timed schedules on such a shared bus multiprocessor. The sender has to write into shared memory, which involves arbitration costs it has to request access to the shared bus, and the access must be arbitrated by a bus arbiter. Once the sender obtains access to shared memory, it needsto perform a synchronization check on the shared memory location to ensure that the receiver has read data that was written in the previous iteration, to avoid overwriting previously written data. Such synchronization is typically implemented using a semaphore mechanism; the sender waits until a semaphore is reset before writing to a shared memory location, and upon writing
that shared memory location, itsets that semaphore (the semaphore could be a bit in shared memo^, one bit for each send operation in the parallel schedule). The receiver, on the other hand, busy-waits until the semaphore is set before reading the shared memo^ location, and resets the semaphore after completing the read operation. It can easily be verified that this simple protocol guarantees correct sender-receiver synchronization, and, even though thesemaphore bits have multiple writers, no atomic test-and-set operation is required of the hardware. In summa^, the operations of the sender are: request bus, wait for arbitration, busy-wait until semaphore is in the correct state, write the shared memo^ location if semaphore is in the correct state, and then release the bus. The corresponding operations for the receiver are: request bus, wait for arbitration, busy wait on semaphore, read the shared memory location if semaphore is in the correct state, and release the bus. The IPC costs are therefore due to bus arbitration time and due to semaphore checks. If no special hardware support is employed for IPC, such overhead consumes on the order of tens of instruction cycles, and also expends power an importa~tconcern for portable applications"In addition, semaphore checks consume shared bus bandwidth. la D S P ~ 6 ~ O - b a s eshared d An example of this is a four-process0 bussystem designed by Dolby Labs for digi processing applications. In this machine, processors communicate through shared me ry, and a central bus arbiter resolves bus request conflicts between processors. en a processor gets he shared memo^ the bus it performs a semaphore check, and continues transaction if the semaphore is in the correct state. It explicitly releases the bus after completing the shared memory transaction. A receive and a send together consume 30 instruction cycles, even if the semaphores are in their correct state and the processor gets the bus immediately upon request. Such a high cost of communication forces the scheduler to insert as few interprocessor communication nodes as possible, which in turn limits the amount of parallelism that can be extracted from the algorithm. One solution to this problem is to send more thanone data sample when a processor gets access to the bus; the arbitration and sync~onizationcosts are then amortized over several data samples. A mannerhasbeen proposed by ~ivojinovic, 911 is used to move delays in t FG such that data canbe transferred in blocks, instead of one sample at a time. Several issues need to be taken care of before the vectorization strategy can be employed. First, retiming SDFCs has to be done very carefully: moving delays across actors can change FC causing undesirable transients inthe algorithm the initial state of the implementation. This can potentially be solved by including preamble code to compute the value of the sample co~espondingto the delay when that delay is moved across actors. This, however, results in increased code size, and other
Chapter 6
associated code generation complications. Second, the workof Zivojinovic et al. does not apply uniformlyto all H S ~ F G sif: there are tight cycles in the graph that need to be partitioned among processors, the samples simply cannot be “vectores881. Thus, presence of a tight cycle precludes arbitrary blocking of data. Third, vectorizing samples leads to increased latency in the imp~ementation; some signal processing tasks such as interactive speech are sensitive to delay, and hence the delay introduced dueto blocking of data may be unacceptable. Finally, the problem of vectorizing data in EISDFGs into blocks, even with all the above limitations, appear to be fundamentally hard; the algorithms proposed by Zivojinovic et al. have ex~onentialworst case run-times. Code generated cu~entlyby the Ptolemy system does not support blocking (or vectorizing) of data for many of the above reasons, Another possible solution is to use special hardware. One could provide a full interconnection network, thus obviating the needto go through shared memory. Semaphores could be imp~ementedin hardware. One could use multi-ported memories. Needless to say, this solution is not favorable because of cost and potentially higherpowerconsumption, especially when targeting embedded applications. A general-pu~osesharedbusmachine, the S~quentBalance ]for example, will typically use caches between the processor and the bus. Caches lead to increased shared memory bandwidth due to the averaging effect provided by block fetches and due to probabilistic memory access speedup dueto cache hits. In signal processing and other real time applications, however, there are stringent requirements for deterministic p e ~ o ~ a n guarantees ce as opposed to probabilistic speedup.In fact, the unpredictability in task executiontimes introduced due to the use of caches may be a disadvantage for static scheduling techniques that utilize compile time estimates of task execution times to make scheduling decisions (we recall the discussion in Section 4.9 on techniques for estimating task execution times). In addition, due to the d e t e ~ i n i s t i cnature of most signal processing problems (and also many scientific computation problems), shared data can bedeterministically prefetched because i n f o ~ a t i o nabout when particular blocks of data are required by a pa~icularprocessor can often be predicted by a compiler. This feature has been studied in [M authorsproposememory allocation schemes that exploit predictability in the memoryaccess pattern in DSP algorithms;such a “smart allocation” scheme alleviates someof the memorybandwidthproblemsassociatedwithhigh throughput applications. Processors with caches can cache semaphoreslocally, so that busy waiting can be donelocal to the processor without havingto access the shared bus, hence saving the bus bandwidth normally expended on semaphore checks. Such a procedure, however, requires special hardware (a snoopingcache controller, for
example) to maintain cache coherence; cost of such hardware usually makes it prohibitive in embedded scenarios. Thus, for the embedded signal, image, and video signal processing applications that are the primary focus of this book, we argue that caches do not often have a significant role to play,andweclaim that the ordered-transactions approach discussed previously provides a cost-effective solution for minimizing IPC overhead in implementing self-timed schedules,
The ordered-transactions strategy, we recall, operates on the principle of determining (at compiletime) the orderinwhichprocessorcommunications occur, and enforcing that order at run-time, For a shared bus implementation,this translates into determining the sequenceofsharedmemory(or, equivalently, shared bus) accesses at compile time and enforcing this predete~inedorder at run-time. This strategy, therefore, involves no run-timearbitration; processors are simply granted the bus according to the pre-dete~inedaccess order. hen a processor obtains access to the bus, it performs the necessary shared memory transaction, and releases the bus; the bus is then granted to the next processor in the ordered list. The task of ma~ntainin ordered access to sharedmemory is done by a hen the processors are downloaded with code, the pre-determined access order list. At run-time the controller simply grants bus access to processors according to this list, granting access to the next processor in the list when the current bus owner releases the bus. Such a mechanism is robust with respect to variations in execution times of the actors; the functionality of the system is unaffected by poor estimates of these execution times, although the real-time performance obviously suffers as in any scheduling strategy that involves static ordering and assignment. will show that if we are able to perform accurate compile time analysis, then the new transaction ordering constraints do not significantly impact performance. Also, no arbitration needs to be done since the transaction controller grants exclusive access to the bus to each processor. In additjon, no semaphore synchronization needs to be performed, because the transaction ordering constraints respect data precedences in the algorithm; when a processor accesses a shared memory location and is correspondingly allowed access to it, the data accessed by that processor is certain to be valid. As a result, under an orderedtransactions scenario, a send (receive) operation always occupies the shared bus for only one shared memory write (read) cycle. This reduces contention for the bus and reduces the number of shared memory accesses required for each IPC operation by at least a factor of two and possibly much more, depending on the
Chapter 6
amount of polling required in a conventional arbitration-based shared bus implementation. The perfor~anceof this scheme depends on how accurately the execution times of the actors are known at compile time. If these compile time estimates are reasonably accurate, then an access order can be obtained such that a processor gains access to shared memory whenever necessary. ~therwise,a processor may have to idle until it getsa bus grant, or, even worse, a processor when granted the bus may not complete its transaction immediately, thus blocking all other processors from accessing the bus. This problem would not arise in normal arbitration schemes, because dynamic reordering of independent shared memo^ accesses is possible. will ~uantifythese p e r f o ~ a n c eissues in the next chapter, where we show that when reasonably good estimates of actor execution times are available, ~orcinga run-time access order does not in fact sacrifice performance significantly.
with floating point level block diagram of a
processors are connected to the shared bus, and shared m e ~ resides o ~ on the
the asse~ion. After a processor obtains access to the shared bus, it performs a single shared memory o~eration(send or receive) and releases the bus. The ans sac ti on controller detects the release of the bus and steps throu~hits ordered list, ranti in^ the bus to the next processor in its list. The cost of transfer of one word of data bet wee^ processors is 3 instruc-
THE O ~ ~ E ~ E ~ - T R A N S A C T ISTRATEGY ONS
tion cycles inthe ideal case where the sender and the receiver obtain access to the shared bus immediately upon request; two of these correspond to a shared memory write (by the sender) and a shared memory read (by the receiver), and an extra i n s t ~ c ~ i ocycle n is expended in bus release by the sender and bus acquisition by the receiver.Such low-overhea~ interprocessor communication is obtained with the transaction controller providing the only additional hardware support. A s described in a subsequent section, this controller can be impleme~ted with very simple h a r d w ~ e .
In the designdiscussedabove,processor-to-processorcommunication one write and one occurs through a central shared memory; two transactions read must occur over the shared bus for each sender-receiver pair. This situation can be improved by distributing the shared memory among processors, as shown in Figure 6.3, where each processor is assigned shared memory in the form of hardware FIFO buffers. Writes to each FIFO are accomplished through the shared bus; the sender simply writes to the FIFO of the processor to which it Bus Access Schedule lnfor~ation
~~~""-"~"-"""""~ Figure 6.2. Block ~ i a g r of a~ the OMA prototype.
l I
Chapter 6
wants to send data by using the appropriate shared memory address. Use of a FIFO implies that the receiver must know the exact order in which data is written into its input queue. This, however, is guaranteed by the ordered-transactions strategy. Thus replacing a RAM (random access memory)based shared memory with distributed FIFOs does not alter the functionality of the design. The sender need only block when the receiving queue is full, which can be accomplished in hardware by using the ‘Transfer Acknowledge (TA)’ signal on the DSP96002; a device can insert an arbitrary number of wait states in the processor memory cycleby de-asserting the TA line. Whenever a particular FIFO is accessed, its ‘Buffer Full’ line is enabled onto the TA line of the processors (Figure 6.4). Thus a full FIFO automatically blocks the processor trying to write into it, and no polling needs to be done by the sender. At the receiving end, reads are local to a processor, and do not consume shared bus bandwidth. Thereceiver can be made to either poll the FIFO empty line to check for an empty queue, or one can use the same TA signal mechanism to block processor reads from an empty queue.The TA mechanism will thenuse the local(“A”) bus control signals (“A” bus TA signal, “A”address bus, etc.). This is illustrated in Figure 6.4.
Schedule Information
THE ORDE~ED-TRANSACTIONSSTRATEGY
Use of such a distributed shared memory mechanism has several advantages. First, the shared bus traffic is effectively halved, because only writes need to go through the shared bus. Second,in the design of Figure 6.2, a processor that is granted the bus is delayed in completing its shared memory access, all other processors waiting for the bus get stalled; this does not happen for half the transactions in the modified design of Figure 6.3 because receiver reads are local. Thus there is more tolerance to variations in the time at which a receiver reads data sent to it. Last, a processor can broadcast data to all (or any subset) of processors in the system by simultaneously writing to more than one FIFO buffer. Such broadcast is not possible witha central shared memory. The modified design, however, involves a significantly higher hardware cost than the design proposed in Section 6.5.1. A s a result, the OMA prototype discussed in the following sections (Sections 6.6 to 6.9) was built around the centralsharedmemorydesignandnot the FIFObased design. In addition, the DS~96002processor has an on-chip host interface unit that can be used as a 2deep FIFO; therefore, the potential advantage of usingdistributed FIFOs can still
6Lx’m “A” Add.
“”14” Data
I
Figure 6.4. Details of the “TA line mechanism (only one processor is shown).
Chapter 6
be evaluated to some degree by using the chip host interface even in the absence of external FIFO h ~ d w a r e , imulation models were written for both the abovedesigns using the Thor re simulator [Tho861 under the Frigg multi-processor simulator system .Frigg allows the Thor simulator to communicate with a timing-driven function~lsimulator for the ~ P 9 6 0 processor 0~ provided by otorola simulator also simulates Inpu~Output(VO) operations of the pins of essor, and Frigg interfaces the signals on the pins to the rest of the Thor sim~lation;as aresult, hardware associated with each pro~essor(memories, de~odinglogic, etc.) and intera~tionbetween processors can be simuFrigg. This allows functionality of the entiresystem to be verified by ~ n n i n gactual programs on the processor simulators. This model was not used for performance evaluation of the 0 A prototype, however, because with just a four-~r~cessor system the cycle-by- le Frigg simulation was far too slow, even for very simple programs. higher“leve1(behavioral) simulation would be more useful than a cycle-by~cyc simulation for the purposes of p e ~ o ~ h such high-level simulation was not carried out on the he remainder of this chapter describes hardware and software design A board prototype.
A architecture has been designed ard design is comprise er is implemented on ilinx chip also handles the host interface functions, and im~lementsa simple V 0 mechanism. A hierarchical description of the hardware design follows.
ure 6.5. At the top level, there are four “processelement” blocks that consist of the processor, local memo^, local address oder, and some glue logic. Address, data, and control busses from the PE blocks are connected to form the shared bus. Shared m e ~ isoconnected ~ to this bus; address decoding is done by the “shared address decoder” PAL ( ~ r o ~ r a m m a ble array logic) chip. A central clock ene era tor provides a common clock signal C3090) imple~entsthe transaction controller and a is also used to irn lement latches and buffers during x 8) stores the bus
Chapter 6
access order in the form of processor identifications (IDS). The sequence of processor IDS is stored in this “schedule RAM”,and this determines the bus access order. An external latch is used to store the processor ID read from the schedule .This ID is then decoded to obtain the processor bus grants.
A subset of the 32 shared bus address lines connect to the Xilinx chip, for addressing the I/O registers and other internal registers. All 32 lines from the shared data busare connected to the Xilinx. The shared data bus can be accessed from the external connector (the “right side” connector in Figure 6.5) only through the Xilinx chip. This feature can be made use of when connectingmultiple OMA boards: shared busses from different boards can be madeinto one contiguous bus, or theycanbe left disconnected,withcommunicationbetween busses occurring via asynchronous “bridges” implemented on the Xilinx FPGAs. We discuss this further in Section 6.6.7. Connectorsonboth ends of the boardbringout the sharedbus in its entirety. Both left and right side connectors follow the same format, so that multiple boards can be easily connected together. Shared control and address busses are buffered before they go off board via the connectors, and the shared data bus is buffered within the Xilinx. The DSP96000 processors have on-chip emulation (“OnCE’ in Motorola terminology) circuitry for debugging purposes, whereby a serial interface to the OnCE port of a processor can be used for in-circuit debugging. On the OMA board, the OnCE ports of the four processors are multiplexed and brought out as a single serial port; a host may select any one of the four OnCE ports and communicate to it through a serial interface. We discuss the design details of the individual components of the prototype system next.
The task of the transaction order controller is to enforce the predetermined bus access order at run-time. A given transaction order determines the sequence processor of bus accesses that must run-time. We refer to this sequence of bus accesses by the term ,Since the bus access order list is progra~-dependent,the controller must possess memory into which this list is downloaded after the scheduling and code generation steps are completed, and when the transaction order that needs to be enforced is determined. The controller must step through the access order list, and must loop back to the first processor ID in the list when it reaches the end. In addition, the controller must be designed to effectively use bus arbitration logic present on-chip, to conserve hardware.
THE ORDERED-TRANSACTIONS STRATEGY
ess
(m)
The bus grant signal on the DSP chip is used to allow the processor signal is used to tell the to perform a shared bus access, and the bus request controller when a processor completes its shared bus access.
(m)
Each of the two ports on the DSP96002 has its own set of arbitration signals; the and signals are the most relevant signals for the OMA design, and these signals are relevant only for the processor port connected to the shared bus. As the name suggests, the line (which is an input to the processor) must be asserted before a processor can begin a bus cycle: the processor is forced to wait for to be asserted before it can proceed with the instruction that requires access to the bus. Whenever an external bus cycle needs to be performed, a processor asserts its signal, and this signal remains asserted until an instruction that does not access the shared bus is executed. We can therefore use the signal to determine when a shared bus owner has completed its usage of the shared bus (Figure 6.6 (a)).
m
m
m
m
m m
The rising edge of the line is used to detect when a processor releases the bus. To reduce the,number of signals going from the processors to the controller, we multiplexed the signals from all processors onto a common BR signal. The current busownerhas its outputenabledonto this common reverse signal; this provides sufficient information to the controller because the controller only needs to observe the line from the current bus owner. This arrangement is shown in Figure 6.6 (b); the controller grants access to a processor by asserting the corresponding line, and then it waitsfor an upper edge on the reverse line. On receiving a positive going edge on this line it grants the bus to the next processor in its list.
m m
m
One straightforward implementation of the above functionality is to use a counter addressing a R A M that stores the access order list in the form of processor IDS. We call this counter the schedule counter and the memory that stores the processor IDS is called the schedule ~~. Decoding the output of the RAM prolines. The counter is incrernented at the beginning of a provides the required cessor transaction by the negative going edge of the common signal and the output of the RAM is latched at the positive going edge of thus granting the bus to the next processor as soon as the current processor completes its shared memory transaction. The counter is reset to zero after it reaches the end of the list (Le., the counter counts modulo the bus access list size). This is shown in Figure goes low ensures enough time for 6.7. Incrementing the counter as soon as the counteroutputsand the RA outputs to stabilize. For a 33MHz processor withzerowait states, width is a minimumof 60 nanoseconds.Thusthe this time. counter incrementing and the RAM access must both finish before
m
m m,
m
m
Chapter 6
need a fast counterand fast static for the schedulememf the counter d e t e ~ i n e sthe maximum allowa~lesize of the unter width of size rz implies a maximum list size of 2" );a wider cou~ter,however, im lies a slower count~r.If, for a certain width, the counter (i~plementedon the part in our case) turns out to betooslow i.e., the outputof the schedul ory will not stabilize at least one latch set upperiod wait states m a have to be before the positive goingedgeof arrives inserted in the processor bus cycle to delay the positive edge of
rocessor gr
r\l
that a 10- it-wide counter does not re~uireany wait states, and allows a maxis in the access order list.
address
c
contains access list (address :~ r o c ~ D )
Deco E3G li
BGO BGI
BGn
addr.
countup
latch
latch out
Chapter 6
reset
CO
A single bus access list implies we can only enforce one bus access pattern at run-time. In order to allow for some run-time flexibility, we have implemented the QMA controller using a presettable counter. The processor that currently owns the bus can preset this counter by writing to a certain shared memory location. This causes the controller to jump to another location in the schedule memory, allowing the multiple bus access schedules to be maintained in the schedule RAM and switching between them at run-time depending on the outcome of computations in the program. The counter appears as an address in the shared memory map of the processors. The presettable counter mechanism is shown in Figure 6.8. An arbitrary number of lists may, in principle, be maintained in the schedule memory. This feature can be used to support algorithms that display data dependency in their execution. For example, a dataflow graph with a conditional construct will, in general, require a different access schedule for each outcome of the conditional. One of two different SDF subgraphs are executed in this case, depending on the branch outcome, and the processor that determines the branch outcome can also be assigned the task of presetting the counter, making it branch to the access list of the appropriate SDF subgraph. The access controller behaves as in Fig 6.8 (b). We discuss the use of this presettable feature in detail later in the book.
The function of the host interface is to allow downloading programs onto the QMA board, controlling the board, setting parametersof the application being run, and debugging from a host workstation. The host for the QMA board connects to the shared bus through the Xilinxchip, via one of the shared bus connectors. Since part of the host interface is configured inside the Xilinx, different hosts (32 bit, 16 bit) with different handshake mechanisms can be used with the board. The host that is being used for the prototype is a Motorola DSP56000based DSP board called the S-56X card, manufactured by Ariel Corp [Ari9 1). The S-56X card is designed to fit into one of the Sbus slots in a Sun Sparc workthe S-56X card via a unix station; a user level process can communicate with device driver. Thus the QMA board too can becontrolled (via the S-56X card) by a user process running on the workstation. The host interface configuration is depicted in Figure 6.9. Unlike the DSP56000 processors, the DSP96002 processors do not have built-in serial ports, so the S-56X board is also used as a serial I/Qprocessor for theQMAboard. It essentially performs serial-to-parallel conversion of data,
THE ORDERED-TRA~SACTIO~S STRATEGY
(W Figure 6.8. Presettable counteri ~ p ~ e ~ e n t a t i o n .
Chapter 6
buffering of data, and i n t e ~ u p tmanagement. The Xilinx on the OMA board implements the necessary transmit & receive registers, and synchronization Aags we discuss the details of the Xilinx circuitry in Section 6.6.5. The S-56X card communicates with the Sparc Sbus using DMA (direct memory access). A part of the ~ S P ~ bus 6 ~ and0control signals are brought out of the S-56X card through another XilinxFPCA ( C304.0) on the S-56X. For the purpose of i n t e ~ a c ~ nthe g S-56X board with the OMA board, the Xilinx on theS56X card is con~guredto bring out 16 bits of data and 5 bits of address from the ~ S ~ 5 processor 6 0 ~ onto the cable connected to the OMA (see Figure 6.9). In addition, the serialI/O port (the §SI port) is also brought out, for interface with V 0 devices such as A/Dand D/A convertors. By making the ~ § ~ 5 6 0 0 write 0 to appropriate memory locations, the 5 bits of address and 16bits of data going into A may beset and strobed for a read or a write, to or fromthe OMA board. In other words, the OMA boardoccupies certain locations in the ~ S ~ 5 6 0 ~ memory map; host communication is doneby reading and writing to these memory locations.
Figure 6.9. Host interface.
~ P 9 6 0 0processor, ~ local ress decoder,andsomeaddressdecoding $>isvery similar to the design of the ent ~ystem)board [ ht out into a 96 pin euro-connecconnector can be used for local m e ~ o r yexpansion; we have used it for providing local V0 interface to the processing el~ment(as an alternative to using the shared bus for VO). Port A of the processor forms the local bus, connecting to local ~ ~ m o and r y address decoding o contains address buffers, and logic to set up the b o o t u ~mode the processor is connected to the shared bus.
cating with one noth her on each processor can b
Chapter 6
Section 6.5.2, except that the FIFO is internal to each processor.
As mentioned previously, the X ~ 3 0 9 0 ilinx F ~ is ~usedAto implement the transaction controller as well as a simple I/O interface. It is also configured to provide latches and buffers for addressing the Host Interface (HI) ports on the ~ S ~ 9 6 0 during 0 2 bootup and down~oadingof code onto the processors. For this to work, the Xilinx is first configured to i ~ p l e ~ ethe n t bootup- and downloadrelated circuitry, which consists of latches to drive the shared address bus and to access the schedulememory.Afterdownloadingcode on the processors, and downloading the busaccessorder into the schedule RA ,the Xilinxchip is reconfigured to implementthe transaction controller and I/O interface. Thus the process of downloading and running a program requires configuring the inx chip twice. possible waysinwhich
a Xiiinxpartmaybeproitmap is downloaded bytecard). The bitmap file, genS read in by a function i .7, which describes the en into the appropriate mem-
erated and stored as a bin mented in the qdm softw software interface) and t ory location on the Sstrobes these bytes int user can reset and rec ng the Xilinx control pins by writing to a A board. Various con~gurationpins of the different values into this latch.
linx con~gurationlatch” inx chip are manipulated
~e use two different Xilinx circuits, one during bootup and theother during run-time. The ilinx configuration during bootup helps e l i ~ i n a t esome glue logic that would otherwise be requiredto latch and decode address and data from the S-56X host, This configuration allows the host to read and write from any of the HI ports of the processors, and also to access the schedule memory and the shared memory on board. un-time configuration onthe Xilinx consists of the transaction controller i ~ p ~ e m e n t easad presettable counter. unter can be preset through the shared bus. It addressesan external fas (8 n~nosecondaccesstime) that contains processor IDS corresponding to the bus access schedule. Output from the schedule memory is externally latched and decoded to yield bus grant lines (Figure 6.7).
(m)
schematic of the Xilinx con~gurationat run-time is given in Figure 6.1 1. This con~gurationis for YO with an S-56X (16 bit data) host, although it
THE ~ R ~ E R E ~ - T ~ A ~ S ASTRATE~Y CTIO~S
can easily be modified to work with a 32-bit host. -56X board reads data from the Transmit (Tx) register and writes register on the Xilinx. These registers are memory-mapped h that any processor that possesses the bus may write to the TX register or read from the Rx register. For a 16-bit host, two transactions are required to perform a read or write with the 32-bit Tx and Rx registers. The processors themselves need only one bus access to load or unload data fromthe TI0 interface. Synchronization on the S-56X (host) side is done by polling status bits that indicate an Rx empty flag (if true, the host performs a write, otherwise it busy-waits) and a Tx full flag (if true, the host performs a read, otherwise it busyA side, synchronization is done by the use of the TA (transfer acknowledge)pinon the processors. hen a processorattempts to read write Tx, the approp~atestatus flags a nabled onto the TA line, and wait are automatically inserted in the processor bus cycle whenever the TA line is not asserted, which in our implementation translates to wait states whenever the status flags are false. Thus, processors do not have the overhead of polling the TI0 status flags; an V 0 transaction is identical to a normal bus access, with zero or more wait states inserted automatically. SP56000 processor on the S-56X card is responsible for performing V0 with the actual ossibly asynchronous) data source and acts as the inte~upt processor for the 0 A board, relieving the board of tasks such as interrupt servicing and data buffering. This of course has the downside that the needs to be dedicated as an II0 unit for the 0 A processor board, and limits other tasks that could potentially run on the host. ory modules are provided, so that up to 512 n reside on board. The memory must have an access time of 25ns to achieve zero waitstate operation. S
Several features have been included in the design to facilitate connecting together multiple 0 A boards. The connectors on either end of the shared bus are compatible, so that boards may be connected together in a linear fashion (Figure 6.1 2). As mentioned before, the shared data bus goes to the “right side connector” through the Xilinx chip. By configu~ngthe Xilinx to “short” the external and internal shared data busses, processors on different boards can be m share one contiguous bus. Alternatively, busses can be “cleaved” on the chip, withcommunicationbetweenbussesimplementedon the Xilinxviaan
Chapter 6
r"""""""""
Shared Data Bus
Host Data Bus
TA
Host Address (by host)
I I I
" " " " "
" " " c .
I
Shared
1 I L"""""""""1
.l l. Minx confi~uration at r u n - t i ~ ~ ,
I
l
THE O R D E R E D - T ~ A ~ S A C T I OSTRATEGY ~S
asynchronous mechanism (e.g., read and write latches synchronized by “full” and “empty” flags). This concept is similar to the idea used in the SMART processor array [Koh90], where the processing elements are connected to a switchable bus: when the busswitches are open,processors are connectedonly to their neighbors (forming a linear processor array), and when the switches are closed, processors are connected onto a contiguous bus. Thus the SMART array allows formation of clusters of processors that reside on a common bus; these clusters then communicate with adjacent clusters. When we connect multiple OMA boardstogether, we get a similar effect: in the “shorted” configuration processors on different boards connect to a single bus, whereas in the “cleaved” configuration processors on different boards reside on common busses, and neighboring boards communicate through an asynchronousinterface. Figure 6.12 illustrates the above scheme. The highest 3 bits of the shared address bus are used as the “board ID” field. Memory, processor Host Interface ports, configuration latches, etc. decode the board ID field to determine if a shared memory or host access is meant for them. Thus, a total of 8 boards can be hooked onto a common bus inthis scheme.
esi ~e used single-sided through-hole printed circuit board technolo OMA prototype. The printed circuit board design was done using th system developed in Professor Brodersen’s group at University of California at Berkeley [Sri92]. Under this system, a design is entered hierarchically using a netlist language called SDL (Structure Description Language). ~eometricplacement of components can be easily specified in the SDL netlist itself. A ‘tiling’ feature is also provided to ease compact fitting of components. The SDL files were written in a modular fashion; the schematics hierarchy is shown in Figure 6.12. The SIERA design manager(RMoct) was then used to translate the netlists into an input file acceptable by Racal, a commercial PCB layout tool, which was then used to auto-route the board. Figure 6.13 showsa photograph of the board.
As discussed earlier, we use an S-56X card attached to a Sparc as a host for the OMA board. The Xilinx chip on the S-56X card is configured to provide l 6 bits of data and 5 bits of address. We use the q~~ [Lap911 software as an interface for the S-56X board; qdm is a debugger/monitor that has several useful built-in routines for controlling the S-56X board, for example data can be written and read from anylocation in the D S ~ 5 6 address ~0 space through functioncalls
Chapter 6
in gdm. Another useful feature of q d is~that it uses ‘TcZ’, an embeddable, extensible, shell-like interpreted command language [0us94]. Tcl provides a set of built-in functions (such as an expression evaluator, variables, control-flow statements, etc.) that can be executedvia user commands typed atits textual interface, or from a specified command file. Tcl can be extended with application-specific commands; in our case, these commands correspond to the de~ugging/monitor commandsimplemented in gdm as well as commands specific to the OMA. Another useful feature of Tcl is the scripting facility it provides; sequences of commands can be convenientlyintegrated into scripts, which are in turn executed by issuing a single command. Some functions specific to the 0 A hardware that have been compiled into qdm are the following:
Busses on different boards connected together, to have more than four processors on a single bus.
ure 6.12. connect in^ multiple boards.
Processors on separate busses with handshake between busses. Helpsin scalability of the system.
ilinx with configuration specified by file.bit ~ l e ~ a ~ e proc# . l o d:
load bootstrap monitor codeinto the specified processor
~ l e ~ u lod m epro& .
:
load DSP96002 .lod file into the specified processor a c c e s s ~ r ~:e r
A bus access schedule memory These functions use existing qdm functions for reading and writing values to the DSP56000 memory locations that are mapped to the 0 interface. Each processor is programmed through its Host Interface via the shared bus. First, a monitor pro~ram (oma~on.lod) consisting of interrupt routines is loaded and run on the selected processor. Code is then loaded into processor
memory by writing address and data values into the HI port and i n t e ~ p t i n gthe processor. The interrupt routine on the process is responsible for inserting data into the specified memory location. The S-5 host forces different i n t e ~ p t
prototyp~ board photo~raph.
Chapter 6
routines, for specifying which of the three ( ,Y, or P) memories the address refers to and for specifying a read or a write to or from that location. This scheme is similar to that employed in downloading code onto the S-56 Status and control registers on the A board are memory m a p p e ~to the address space and can be accessed eset, reboot, monitor, and debug the Tcl scripts were written to simplify commands that used are most often (e.g., ‘change y:fffO 0x0’ was aliased to ‘omareset’). tolemy multiprocessor hardwaretarget[Pro9 ] waswritten for the ,for automatic pa~itioning,code ene era ti on, and execution ofan m a block diagram s~eci~cation. A simple heterogeneous multiprocessor target was also written in Ptolemy for the 0 X card, and generates target generates ~SP56000code 9 6 0 multiprocessor ~ code for the 0
A mechanism has been implemented for the applications is periodic; samples (or blocks of samples) typically arrive at constant, periodic intervals, and the processed output is again required (by, say, a digital-to-a~alogc o n v e ~ o ~ at) periodic intervals, With this observation, it is in fact possible to schedule the I D operations within’ the ~ u l t i p r ~ c e s s schedule, or sequently determine when, relative to the other shared bus accesses due e shared bus is required for WO. This allows us to include bus accesses for YO in the busaccess order list. In our p a r t i c u l ~ i m ~ ~ e m ~ n d address locations that address the Tx and tion 6.6.5), which in turn communicate with processor accesses these registers as if they were a part of shared memory. It obtains access to these registers when the transaction controller grants access to the shared bus; busgrants for the purpose of I/O are taken into account when cons t ~ c t i n gthe access order list. Thus accesses to shared resources can be ordered much as accesses to shared bus and memory. emory access strategy can also be applied ~ n - t i m eparameter control we mean controll hm (gain of some component, bit-rate of a coder, pitch of synthesized music sounds, etc.) while the algorithm is ~ n n i n gin real time on the h ~ ~ ~ a r e . a feature ~ u c his obviously very useful and someti~esindispensable. Usually, one associates such parameter control with an async changes a parameter (ideally by means of a suitabl r) and this change causes ‘an inte~uptto occur on i n t e ~handler ~ t then performs the appropriate operations that cause the p a r ~ e -
ter change that the userre~uested. Forthe architecture.,however.,unpredictable i n t e ~ p t are s not desirable, as was noted earlier in this hapter; on the other hand, shared U0 and I are relatively inexpensive owin to the ordered-transactions mechanism. ntrol is implemented in the followin g user i n t e ~ p t swhenever ; a paraminte~uptand it modA board, on the other ifies a pa~icularlocatio ery s~heduleperiod, whether hand, receives the contents of rocessors never "see" a userally modified or not. Thus the o ~ e s p o n d i nto~ the value stored in rut;they in essence U ph. Since reading in the value of n every iteration of olved in this scheme is minimal. instruction cycles, the ove ed practical advantage of the above scheme is that the tclltkp~mitivesthat have been implemented in Ptolemy for the to98J) can be directly used with the ter control purposes.
a1 applications that are implemented usin
831 is a well known approach for synstring. "he basic idea is to pass a n a delay, a low pass filter, and a multipli of less than one. d e t e ~ i n e the s pitch of the generated sound, a multiplier gain determines the rate of decay. ultiple voices can b and com~inedby implementing one feedback loop for each voice a ing the outputs from all the loops.If we want to generate sound at a sampling rate z (compact disc sampl rate), we can im~lement7 voices on a single processor in re blocks from the Ptole~y ene era ti on library ( 7 voices consume 370 i n s ~ u c t 0 inst~ctioncycles available board, we impleme~ted2 hose output is multiplie four hierarchical blocks consisting of 7 copies of the basic feedbac~loop for eachvoice. "he outputs are added together, an this sum fed to an analog to di~italconvertor after being conve~edinto a fixed-
Chapter 6
ure 6.14.~ierarchicalspecification of the Kar~lus-Stron~ al~orithm in 28 voices.
THE ORDERED-TRANSACTIONS STRATEGY
point representation from a floating point representation. A schedule for this application is showninFigure 6.15. The makespan for this schedule is 377 instruction cycles, which is just within the maximum allowable limit of 380. This schedule uses 15 pairs of sends and receives, and is therefore not communication-intensive. Even so, a higher IPC cost than the three instruction cycles the OMA architecture affords us would not allow this schedule to execute inreal time at a 44.1 KHz sampling rate, because there is only a three-instruction-cycle marginbetween the makespan of this scheduleandthe m a x i ~ u mallowable makespan. To schedule this application, we employed Hu-level scheduling along with manual assignment of some of the blocks.
A Quadrature Mirror Filter (QMF) bank consists of a set of u~ulysisfilters used to decompose a signal (usually audio) into frequency bands, and a bank of synt~esisfilters is used to reconstruct the decomposed signal [Vai93]. In the analysis bank, a filter pair is used to decompose the signal into high pass and low pass components, which are then decimated by a factor of two. The low pass component is then decomposed again into low pass and high pass components, andthis process proceeds recursively. The synthesis bank performs the complementary operation of upsampling, filtering, and combining the high pass and low pass components; this process is again performed recursively to reconstruct the input
Chapter 6
igure 6.16(a) shows a block diagram of a synt esis filter bank followe by an analysis bank. filter banks are designed such that the analysis bank cascaded with bank yields a transfer function that is a pure delay (i.e., has unity response except for a delay between the input and the output). Such filter banks are also called ~ e ~ ereco~str~ctio~ ct filter banks, and they find applications in high quality audio compression; each fre~uencyband is quantized according to itsenergy content andits perceptual importance. Such a coding scheme is the audio portion of the implemented a perfect-re filter bank to decompose audio from a compact disc player into 15 bands. The synthesis bank was implether with the analysis part. T ~ e r eare a total of 36 m taps each. This is shown hi~rarchicallyin Figure 6.16(a). blocks are required in the first 13 output paths of the analysis ban for the delay t~roughsuccessive stages of the analysis filter bank.
tion cycles of co~putationper sarn ic Level (DL) scheduling heuristic ration period of 366 inst~ctionc tually constru~ted( ~ a n t tchart of F i ~ u r e mples ~ecausethis num~erof samples is h fire at least once; this makes manual scheduling very difficult. ~e found that heuristic p e r f o ~ sclose to 2 better than the classic Hu-level heuristic in this example, althou~hthe ute the schedule compared to
There are 1010 example. Using Sih’s able to achieve an av
*
processors had i n d e p e n d ~ access ~t to the s h a r ~ d m e r n(if o ~the shared ~ e m o r y were 4-ported, for example),we could achieve an ideal speedup of four, because
is independent of the others except for datainput and ~utput, For this example, data partitioning, shared memory allocation, scheduling, g the assembly program was done by hand, using the 256-point com6 as a building block.The Gantt chart block in the Ptolemy ~ G 9 domain
delay blocks
. , .. ~..
In
I
-:
l
rn for a 15band analysis and synthesis le on four~ r o ~ e s s o(using rs Sih's DL heuristic [Sih
Chapter 6
for the hand-generated schedule, including IPG costs through the 0 ler, is shown in Figure 6.17.
In this chapter, wediscussed the ideas behind the ordered-transactions schedulin~strategy. This strategy combines compile timeanalysis of the IPC pattern with simple hardware support to minimize interprocessor co~munication discussed the hardwaredesignand ntation details of a prothe OrderedAccess architecture totypeshared bus multiprocessor that uses the ordered-transactions scheduling statically assign the sequence of processor accesses to shared memory. External U0 and user-specilied control inputs can also be taken into account when scheduling accessesto the shared bus. ~e also discussed the software interface details of the prototype and illustrated some applications that were implemented onthe 0 *
1024 complex values read by each processor
che~ulefor the FFT example.
write result (256 complex .values)
n this chapter the limits of the ordered-transactions scheduling strategy are systematically analyzed. Recall that the self-timed schedule is obtained by first generating a fully-static schedule {C T J V ) , o“,(v),TFs} ,and then ignoring the exact firing times specified by the fully-static schedule; the fu~~y-static schedule itself is derived using compile time estimates of actor execution times of actors. A s defined in the previous chapter, the 0rdered”transactions strategy is essentially the self-timed strategy with added orderingconstraints 0 that force processors to communicateinanorderpredetermined at compile time. The questions addressed in this chapter are: What exactly are we sacrificing by imposing such a Is it possible to choose a transaction such that this penalty is miniat is the effect of variations of task (actor) execution times on the throughput achieved by a self-timed strategy and by an ordered transactions strategy? The effect of imposing a transaction order on a self-timed schedule is best illustrated by the following example. Let us assume that we use the d a t a ~ o ~ graph and its schedule that was introduced in Chapter 4 (Figure 4.3),and that we enforce the transaction order (obtained by sorting the o“,values) of Figure 6. l; we reproduce these for convenience in Figure7.1 (a)and (b). If we observe how the schedule “evolves” as it is executed in a self-timed manner (essentially a simulation in time of when each processor executesactors assigned to it), we get the 6 6 u n f o ~ d schedule e~’ of Figure 7.2; successive iterations of the HSDFG overlap in a natural manner. This is of course an idealized scenario where IPCcosts are ignored; we do so to avoid unnecessary detail in the diagram, since IPC costs can be included in our analysis in a straightforward 7.2 eventually settles to a manner.Note that the self-timed schedule in Fig periodic pattern consisting of two iterations of the DFG; the average iteration
Chapter 7
period under the self-timed schedule is 9 units. The average iteration period (which we will refer to as TSr)for such an idealized (zero IPC cost) self-timed on the iteration period achievable by any schedule represents a ~ o ~ e r schedule that ~ a i n t a i n the s same processor assignment and actor-ordering. This is because the only run-time constraint on processors that the self-timed schedule imposes is due to datadependencies: each processor executes actors assigned to it (includin~the communication actors) according to the compile-time-determined order. An actor at the head of this ordered list isexecuted as soon as data is available for it. Any other schedule that maintains the same processor assignment and actor ordering, and respects data precedences in G , cannot result in an execution where actors fire earlier than they do in the idealized self-timed schedule. In particular, the overlap of successive iterations of the HSDFG in the idealized self-timed schedule ensures that TSTS TFS in general. The self-timed schedule allows reordering among IPCs at run"ti~e.In fact, we observe from Figure 7.2 that once the self-timed schedule settles into a
Execution Times Proc 4
A, B, F
Proc 3
D
" " m
C,H E G
(a)
H ~ D "G' F ~
Proc 1 Proc 2 Proc 3 Proc 4 roc 5
Figure 7.1. Fully-static scheduleon five processors.
:3 :5 :6 :4
:2
ANA~YSISOF THE O R ~ E ~ E ~ - T R A N S A C T ST IO~S periodic pattern, IPCs in successive iterations are ordered di~erently:in the first iteration, the order in which IPGs occur is indeed the unique tion order shown in Figure 7.1 (b):
owever, once the schedule settles into a periodic pattern, the order alternates between:
and
In contrast, ifwe enforces the order ($1,
impose the transaction order in Figure 7,l(b) that rl,
s29
r27
s 3 7 r3t $41 r4t
r59
r6)
9
the result in^ ordered transactions schedule evolves as shownin otice that enforcing this schedule introduces idle time (hatched rect result, To, the average iteration period for the ordered transactions schedule, is 10 units, which is (as expected) larger than the it~rationperiod of the ideal selftimed schedule with zero bitr ration and synchronization overhead (9 units) but is smaller than TFs (1 1 units). In general T F s2 T,, 2 T,, :the self-timed schedule only has assignment and ordering constraints9the ordered transactions schedule has the transaction ordering constraints in addition to the constraints in the self-timed schedule, whereas the fully-static schedule has exact timing constraints that subsume the cons~aintsin the self-timed and ordered transactions schedules. The ~uestionwe would like to answer is: is it possible to choose the
?',,
ST =
Chapter ’7 transaction ordering more intelligently than the straightforward CT,-sorted order chosen in Figure 7.1(b)? As a first step towardsdetermininghowsucha “best” possible access order might be obtained, we attempt to model the se~f-timedexecution itself and try to determine the precise effect (e.g., increase in the iteration period) of adding transaction ordering constraints. Note again that as the schedule evolves in a selftimed manner in Figure7.2, it eventually settles into a periodically repeat in^ pattern that spans two iterations of the dataflow graph, and the average iteration period, T s T , is 9. We would like to determine these properties of self-timed schedules analytically without having to resort to simulation,
In a self-timed strategy a schedule S specifies the actors assigned to each processor, including the IPC actors send and receive, and specifies the order in which these actors must be executed. At run-time each processor executes the actors assigned to it in the prescribed order. Whena processor executes a send it writes into a certain buffer of finite size, and when it executes a receive, it reads from a co~espondingbuffer, and it checks for buffer overflow (on a send) and buffer underflow (on a receive) before it p e r f o ~ communication s op~rations;it blocks, or suspends execution, when it detects one of these conditions. model a self-timed schedule using an HSDFC G, = (V, EiF)derived from the application graph G = (V, E) graph G,, which we will refer to as the * for short,models the fact that actors of G assigned to
Proc
TOT=I 10
= idle time due to ordering co~str~nt
chedule evolution when thetrans~ctionorder of enforce^.
the same processor execute se~uentia~ly, and it models constraints due to interprocessor communication. For example, the self-timed schedule in Figure 7.1 (b) can be modeledby the IPC graph in Figure 7.4. The IPG graph has the same vertex set V as G , corresponding to the set of actors in G . The self-timed schedule specifies the actors assigned to each processor, and the order in which they execute. For example in Figure 7.1, processor 1 executes A and then model this in C, by draw in^ a cycle and E ?and placing a delay on the edge around the vertices CO from E to A .The delay-free edge from A to E represents the fact that the k th recedes the k th execution of E , and the edge from E to A with a delay represents the fact that the k th execution of A can occur only after the ( k -1) thexecutionof E hascompleted. Thus if actors v ] , v2, . v, are assigned to the sameprocessor in that order, then Gip wouldhave a cycle v2), (v*, v3)9 with ~ e z ~ y ( ( v , , (because is executed first). If there are P processors in the schedule, then we have P such cycles co~espondingto each processor. The additional edges due to these constraints are s ~ o w nas dashed arrows in Figure7.4. ((V19
?
As mentioned before, edges in G that cross processor boundaries after scheduling represent inter-processor communication.Communication actors (send and receive) are inserted for each such edge;these are shown in Figure 7.1,
raph for the schedulein Figure ’7.1.
@
critical cycle send
@
receive
Chapter 7
The IPC graph has the same semantics as an H~~~~~ and its execution models the execution of the co~espondingself-timed schedule. The following definitions are useful to formally state the constraints represented by the IPC graph. Time is modeled as an integer that can be viewed as a multiple of a base clock. ecall that the function start( v, k ) E +. represents the time at which the k th execution of actor v starts in the self-timed schedule. The function e n ~ ( vk,) E Z’ represents the time at which the k th execution of the actor v ends and v produces data tokens at its output edges, and we set start( v, k ) = 0 and end(v, k) = 0 for k <0 as the “initial conditions99,The s t a ~ ( v0) , values are specified by the schedule: star^( v, 0) = G,( v) . e~nition4.2, as per the semantics ofan ,each edge (vj9 vi) of G, re~resentsthe f o l l o ~ i ndata ~ dependence constr~int: start(vi9k) 2 end(vj,k
for all (vj, v;) E
.EiF9
-~ e Z ~ y ( ( vvij),) ) ,
for all k 2 ~ e Z ~ y ( vi). vj,
(7-4
e constraints in (7-4) are due both to com~unicationedges (representing synchroni~ationbetween processors) and to edges that represent se~uentialexecution of actors assigned to the same processor. Also, to model execution times of actors we associate execution time t(v ) with each vertex of the IPC graph; t(v ) assigns a positive integer execution time to each actor v (which can be interpreted as t ( v ) cycles of a base clock). Interprocessor communication cost be represented by assigning exec~tiontimes to the send and receive actors. wemay substitute
en^( vj, k ) = start( vj, k ) + t( v j )
in (7-4) to obtain
+
~ t ~ r t ( kv)i2 , .~tart(vj7 k-~eZ~y((vj, v;))) t(vj) ,for all (vj9vi) E
.EiF
(7-5)
In the self-timed schedule, actors fire as soon as data is available at all uch an “as soon as possible” (A start(v;7k ) = ~ ~ x ( { s t a r t ( k-~eZay((vj9 vj, vi))) + t(vj)i(vj, v;) E EiF}) (7-6)
n contrast, recall that in the fully-static schedule we would force actors to fire pe~odicallyaccording to
e IPC graph has the same semantics as a
A~ALYSISOF THE O R D E R E ~ - T R A ~ S A ~ T I OSTRATEGY NS
net theory [Pet$l][RCG$OJ the trunsiti~nsof a marked graph correspond to the nodes of the IPC graph, the places of a marked graph correspond to edges, and the i~itiuZ~ u r k i n gof a marked graph corresponds to initial tokens on the edges. The IPC graph is also similar to Reiter's computation graph [Rei68]. The same properties hold for it,and we state some of the relevant properties here. The proofs listed here are similar to the proofs for the co~espondingproperties in marked graphs and computation graphs in the references above. :The number of tokens in any cycle of the IPC graph is always con-
all possible valid firings of actors in the graph, and is equal to the path delay of that cycle. r each cycle C in the IPC graph, the number of tokens on C can only change when actors that are on it fire because actors not on C remove and place tokens only on edges that are not part of C . If and any actor vk (1 S k S n )fires, then exactly one token is moved from the edge (vk- vk) to the edge (vk,v k .,, ,where v. = v , and v , + = v1 .This conserves the total number of tokens on C . SDFG G is said to be if at least one of its actors cannot firean infinite number of times in any equence offirin S of actors in G . Thus, when executing a valid schedule for a deadlocked some actor v fires k C number of times, and is never enabled to fi 00
SDFG G (in particular, an IPC graph) is free of deadlock if and only if it doesnot contain delay free cycles.
roo^ Suppose there is a delay free cycle y Lemma 7.1 none of the e (v1, v*),(v2,vj), ...,(v,- v,), (v,, vl) ,can contain tokens during any valid execution of G .Then each of the actors v l , ...)v , has at least one input that never contains any data. Thus, none of the actors on C are ever enabled to fire, and hence G is deadlocked. Conversely, suppose is deadlocked, i.e., there is one actor v1 thatnever fires after a certain sequence of firings of actors in G . Thus, after this seque~ce of firings, there mustbean input edge (v2, v l ) that never contains data. This implies that the actor v2 in turn never gets enabled to fire, which in turn implies that there must be an edge (v3, v2) that never contains data. In this manner we can trace a path p = ((v,, v,. .,(v3, v2), (v2, v l ) ) for n = IVl backfrom
Chapter 7
v 1 to v, that never contains data on its edges after a certain sequence of firing of .Since C contains only IVI actors, p must visit some actor twice, and hence must containa cycle C . Since the edges of p do not contain data, C is a delay-free cycle. ~~~.
:A schedule S is said to be if after a certain finite time at least one processor blocks (ona buffer full or buffer emptycondition) and stays blocked.
If the specified schedule is deadlock-freethen the co~espondingIPC h is deadlock-free. This is because a deadlocked IPC graph would implythat a set of processors depend on data from one another ina cyclic manner, which in turn implies a schedule that displays deadlock. :The iteration period for a strongly connec~edIP actors execute as soon as data is available at all inputs is given by:
day( C ) >0 for an I
his result
has
been in ved
raph constructed from an admissible schedso many different contexts l]) that we do not present another proofof
this fact here. The quotient in (7-9), (7- 10)
tire quantity on the right hand of the strongly connected IPC graph C . If the IPC graph contains more than oneSCC, then different SCCs may have digerent asy~ptoticiteration periods, depending on their individual maximum cycle means. In such a case, the iteration period of the overall graph (and over the maxi mu^ cycle means hence the self-timed schedule) is the CCs of C ,because the cution of the schedule is constrained by the ponent in the system. ceforth, we will define the maximum cycle mean as follows. ”
~~~~~~~
Of all
ANALYSIS OF THE ORDERED-TRANSACTIONS STRATEGY
of G, :That is,
M C M ( G i ~=)
(7- 11)
Note that ~ C ~ (may G be ) a non-integer rational quantity. We will use the term instead of M C ~ ( G when ) the graphbeing referred to is clear from the context. A fundamentalcycle in G, whosecyclemean is equal to ~C~ is called a eri e of G,. Thus the throughput of the systemofprocessors executing a particular self-timed schedule is equal to the corresponding
For example, in Figure '7.4, G, has one SCC, and its maximal cycle mean is 7 time units. This corresponds to the critical cycle
((B,E),(E=,0 , (1,(3,(6, B)) have not included IPC costs in this calculation, but these can be included in a straightforward manner by appropriately setting the execution times of the send and receive actors. *
As explained in Section 3.15 the maximum cycle mean can be calculated
in time O(lEl I-),where T' is the sum of t( v) over all actors in the HSDFG.
es If we only have execution time estimatesavailable instead of exact values, and we set t ( v ) in the previous section to be these estimated values, then we obtain the e s t ~ ~ ~iteration ted period by calculating ~ CHenceforth ~ we . will t ~ r o ~ g ~ ~ ~ t assume that we know the e s t i ~ t e d
l calculated by setting the t ( v ) values to the available timing estimates. As discussed in Chapter 1, for most practical scenarios, we can only assume such compile time estimates, rather than clock-cycle accurate execution timeestimates. In fact, this is the reason we had to rely on self-timed scheduling, and we proposed the ordered transaction strategy as a means of achieving efficient IPC despite the fact that we do not assume knowledge of exactactor execution times. Section 4.9 discusses estimation techniquesfor actor execution times.
Chapter 7
an edge (vi, vi) with zero delays represents the constraint st^^( vi, k ) 2 end( vi, k ) .”he ordering constraints can therefore be expressed as a set of edges between communication actors. For example, the constraints = (sl, r l ,s2,r2,s3,r3,s4, r4, ss, rs, st;, rh) applied to the IPC graph of Figure y the graph in Figure 7.5. If we call these additional ordering (solid mows in Figure 7.3, then the graph ( V , Ei,U EoT) represents constraints in the ordered transa~tionsschedule, as it evolves inFigure 7.3. Thus, the maximum cycle mean of ( V , .Eip UEar) represents the effect of
receive
~r~ns~~ ordering t i o n constr~i~ts.
A~ALYSISOF THE O R ~ E R E ~ A- ~ S A ~ T I ST O ~ RATE^^ S adding the ordering constraints. The critical cycle C of this graph is drawn in Figure 7.5; it is different from the critical cycle in Figure 7.4 because of the added transaction ordering constraints. Ignoring communication costs, the C~ is 9 units, which was also observed from the evolution of the transaction nstrained schedule in Figure 7.3. The problem of ~ n d i n gan “optima transaction order can therefore be stated as: ete ermine a transaction order such that the resultant constraint edges .EOT do not increase the
noted earlier that as the self-timed schedule in Figure 7.2 evolves, it eriodic repeating pattern that spans two iterations of the irned schedule always settles down dataflow graph. Itcanbeshown that a into a ~ e ~ o dexecution ic pattern; in [B 2) the authors showthat the fi times of transitions in a marked graph odic asy~ptotically.Inte our notation, for any strongly connecte start( vi, k
+ N ) = start( vi,k ) +
~
for all vi E V ,and for all k > Thus, after a “transient” that lasts pattern. The period pattern itself spans The periodicity depends on the numbe critical cycles of G, ;it can be as high as in the critical cycles of G, [ has one critical cycle with t odicity of two for the schedule in Figure 7.2. The “transient” region define^ by (which is l in Figure 7.2) can also be ex~onential. e effect of transients followed by a periodic regime is essential~y due to of longest paths in weighted directed graphs. These effects have been xt of ins~uctionsche~ul -as-possible firing of transiti sche~ulesfor se~uentiallogic circui the authors note that if inst~ctionsin an iterative progr (represented as a dependency graph) are schedule^ fashion, a pattern of parallel instructions “e~erges’,a the authors show how determining t ern (essential1 b simulation leads p parallelization. In echni~uefor deterrni
Chapter 7
n [Chr85] the author studies periodic firing patterns of transitions in etri nets. The iterative algorithms for determiningclockschedules in 21 haveconvergence properties similar to the transients in self-timed (their algorithmconvergeswhenanequivalent self-timed schedule reaches a periodic regime). ~eturningto the problem of d e t e ~ i n i n gthe optimal transaction order, one possible scheme is to derive the transaction order from the repeating pattern that the self-timed schedule settles into. That is, instead of using the transaction order of Figure 6. l, if we enforce the transaction order that repeats over two iterations in the evolution of the self-timed schedule of Figure 7.2, the ordered transactions schedule would ‘6mimic”the self-timed schedule exactly, and we would obtain an ordered transactions schedule that performs as well as the ideal self-timed schedule, and yet involves low IPC costs in practice. However, as pointed out above, the number of iterations that the repeating pattern spans depends on the critical cycles of G,, and it can be exponential inthe size of the HSDFC [BCOQ92]. In addition the “transient” region before the schedule settles into a repeating pattern can also be exponential. Consequently, the memory requirementsfor the controller that enforces the transaction order can be prohibitively large in certain cases; in fact, even for the example of Figure 7.2, the doubling of the controller memory that such a strategy entails may be unacceptable. ~e therefore restrict ourselves to determining and enforcing a transaction order that spans only one iteration of SDFC; in the following section we show that there is no sacrifice in imposing such a r~strictionand we discuss how such an“opti~al’,transaction order is obtained.
In this section we show how to determine an order O* on the IPCs in the schedule such that imposing O* yields an ordered transactions schedule that has iteration period within one unit of the ideal self-ti~ed schedule (T,, 5 TOTS T , , l ) . Thus, imposing the order O* results in essentially no loss in p e r f o ~ a n c eover an unrestrained schedule, and at the same time we get the benefit of cheaper IPC.
r
r approach to d e t e ~ i n i n gthe transaction order O* is to modify a given c schedule so that the resulting fully-static schedule has TFS equal to rTsTl,and then to derive the transaction order from that modified schedule. Intuitively it appears that, for a given processor assignment and ordering of actors on processors, the self-timed approach always performs better than the fully>TOT>TST) simplybecause it static or ordered transactions approach allows successive iterations to overlap. llowing result, however, tells us that it is always possible to modify any given fully-static schedule so that it per-
ANALYSIS OF THE O R ~ E ~ E ~ - T ~ A N S A C STRATEGY TIO~S
forms nearly as well as its self-timed counterpart. Stated more precisely: :Given a fully-static schedule S {G,,( v ) , G,( v ) , TFS},let T,, be the average iteration period for the corresponding self-timed schedule (as mentioned before, TFs2 TS,). Suppose TFs>TST;then, there exists a valid fullystatic schedule S’ that has the same processor assignment as S , the same order of execution of actors on each processor, but an iteration period of r T s T l.That is, S’ = {crp(v),d f (v), rTST1} where, if actors v i , vi are on the same processor vi) = oP(vj) ) then G,( vi) >G,( v j ) ~ =G’,( j vi) >G’,( vj) . Furthermore, (i.e., op( S’ is obtained by solving the following set of linear inequalities for a’,: S’,(
vj) -G’,( vi) 5 r T S , i x d( vj, vi)-t( vi) for each edge (vj, vi)in Giv. (7- 14)
Proofi Let S’ have a period equal to T.Then, under the schedule S’ ,the k th starting time of actor vi is given by start( vi, k ) =
d t (
vi)
+ kT .
(7- 15)
Also, data precedence constraints imply (as in (7-5)) s t a ~ ( v ik, ) >start(vj, k-deZay(vj, vi))+ t ( v j ) , for all (vj, vi) E EiF
(7- 16)
Substituting (7- 15) in (7- 16), we have
+ kT 2a ‘ , ( v j ) + (k-deZay(vj, v,))T + t(vj) ,
dt(vj)
for all (vj, vi) E EiF.That is, d t ( v j )-o ’ , ( v j ) I T x d ( v j , v i )-t(vj) ,for all ( v j , v i )E EiF.
(7- 17)
Note that the construction of C, ensures that processor assign~entconstraints are automatically met: if G,(v;) = “,,(vj) and vi is to be executed immediately after vj then there is an edge (vj, vi) in G,. The relations in (7-17) represent a system of IEi,l inequalities in IVl unknowns (the quantities G’,(v;) ). The system of inequalities in (7-17) is a difference constraint problem that can be solvedin polyno~ialtime (Q( lEiFlI V i ) )using the ellm man-Ford shortestpath algorithm, as described in Section 3.14. Recall that a feasible solution to a given set of difference equations exists if and only ifthe corresponding constraint graph does not containa negative weight cycle; this is equivalent to condition
T>
cycle C in Gipc
and, from (7-g), this is equivalent to T 2 TS,.
(7- 18)
Chapter 7
p-,,]
If we set T = ,then the right hand sides of the system of inequali7-17 are integers, and the -Ford algorithm yields integer solutions l ( v ) . This is because the on the edges of the constraint graph, are equal to the right hand side of the difference c o ~ s t r ~ n tare s , integers if T is an integer; conse~uently,the shortest paths calculated on the constraint graph are integers.
Jv), a’~(~)~ rTST1) is a valid fully-static schedule. :Theorem 7.1 essentially states that a fully-static schedule can be skewing the relative starting times of processors so that the resulting sched~lehas iteration period less than (T,, + 1);the resulting iteration period lies within one time unit of its lower bound for the specified processor assignment and actor o r ~ ~ r i nItg .is possible to unfold the graph and generate a fullystatic schedule with average period exactly T,, ,but the resulting increase in code size is usually not worth the benefit of (at most) one time unit the iteration period. Recall that a “time unit” is essentially the cl therefore9one time unit can usually be neglected.
T ,,
or example9thestaticschedule
S co~espondingtoFigure
7.1 has
= l l >TST= 9 units. Using the procedure outlined in the proof of Theo-
rem 7.1, we can skew the starting times of processors in the schedule S to obtain a schedule S’ as shown in (7-16), that has a period equal to 9 units (Figure 7.6). ote that the processor assignment and actor ordering in the schedule of Figure .6 is identical tothatofthe schedule in Figure 7.1. The values a’,(v) are: ) = o’,(C) = 2 ,d t ( C ) = 6 , G’?( ) = 0 , a’?(~) =5 9
eor rem 7.1 may not seem useful at first sight: whynot obtain a fullystatic schedule thathas a period T , ~to b~ with, thus eliminating the postprocess in^ step suggested in Theorem 7.1 ? l1 from Chapters 4 and 5 that a
r
odified scheduleS.
ANALYSIS OF THE QRDERED~TRANSACTIQNSS T R A T ~ ~ Y
fully-static schedule is usually obtained using heuristic techniques that are either based on blocked non-overlapped scheduling (which use critical path based heuristics) [Sih91]or are based on overlapped scheduling techniques that employ list scheduling heuristics [d~H92][Lam88].None of these techniques guarantee that the generated fully-static schedule will have aniteration period within one unit of the period achieved if the same schedule were run in a self-timed manner. Thus for a schedule generated using anyof these techniques, we might be able to obtain a gain in performance, essentially for free, by performing the post-processing step suggested in Theorem 7.1. m a t we propose can therefore be added as an efficient post-processing step in existing schedulers. Of course, an exhaustive search procedure like the one proposed in[SI851will certainly findthe schedule S’ directly. We set the ans sac ti on order O* to be the transaction order suggested by the modified schedule S‘ (as opposed to the transaction order from S used in Figure 6.1). Thus for the example of Figure 7.1 (a), Q*
=
rl,
s3, r3,
$29
r2?
r4, s6? r61
r5)
*
Imposing the transaction order 8* as in Figure 7.6 results in TOT of 9 units instead of 10 that we get if the transaction order of Figure 7.1(b) is used. Under the transaction order specified by S’, T, S TO,S rTSTl;thus imposing the order Q * ensures that the average period is within one unit of the unconstrained self-timed strategy. Again, unfolding may be required to obtain a transacti0~ordered schedule that has period exactly equal to TST,but the extra cost of a larger controller (toenforce the transaction ordering) outweighs the small gain of at most one unit reduction in the iteration period. Thus for all practical purposes Q* is the u ~ transaction t ~order. ~ The “optimality” ~ ~ is in the sense that the transaction order O* we determine statically is the best possible one, given t timing information available at compile time. S
We recall that the execution times we use to determine the actor assignment and ordering in a self-timed schedule are compile time estimates, and we have been stating that static scheduling is advantageous when we have “reasonably good”’ compile time estimates of execution time of actors. Also, intuitively we expect an ordered transaction schedule to be more sensitive to changes in execution times than an unconstrained self-timed schedule. In this section we attempt to formalize these notions by exploring the effect of changes in execution times of actors on the throughput achieved by a static schedule. Compile time estimates of actor execution times may be different from their actual values at run-time due to errors in estimating execution times of actors that otherwise have fixed execution times, and due to actors that display
Chapter 7
run-time variations in their execution times, becauseof conditionals or datadependent loops within them, for example. The first case is simple to model, and we will show in Section 7.6.1 how the throughput of a given self-timed schedule changes as a function of actor execution times. The second case is inherently difficult; how do we model run-time changes in execution times dueto data-dependencies, or due to eventssuch as error-handling, cache misses, and pipeline effects? In Section 7.6.2 below we briefly discuss a very simple model for such run-time variations; we assume actors have random execution times accordingto some known probability distribution. We conclude that analysis of even such a simple model for the expected value of the throughput is often intractable, and we discuss efficiently computableupperandlowerbounds for the expected throughput.
Consider the IPC graph in Figure 7.7, which is the same IPC graph as in Figure 7.4 except that we have used a different execution time for actor H to make the example more illustrative. The number next to each actor represents execution times of the actors. We let the execution time of actor C be t( C) =I tc , and we determine the iteration period as a function of given a particular value of t , (TsT(tc)).The iteration period is given by ~ CGiw), ~the maximum ( cycle mean. The function TsT(tc)is shown in Figure 7.8. When 0 I tc 2 1 ,the cycle ((A,s6)(s6,r l ) ( r lE)(& , A)) is critical, and the ~C~ is constant at 7 , since C is not on this cycle; when 1I t , I 9 ,the cycle
((B,SI)(Sl, (S59
Mr1,
r5)(r5,B))
E ) ( E , sd(s47 rrl)(r.l,m ( Q s3)(s3, r 3 M - 3 , C ) ( C , s5)
is critical, and since this cycle has two delays, the slope of TST(tC)is 0.5 in this region; finally, when 9 I tc the cycle ((C, s5)( s5, G)( G,C)) becomes critical, and the slope now is onebecause there is only one delay on that cycle. Thus the iteration period is a piecewise linear function of actor execution times. The slope of this function is zero if the actor is not on a critical cycle, otherwise it depends on the number of delays on the critical cycle(s) that the actor lies on. The slope is at most one (whenthe critical cycle containing the particular actor has a single delay on it). The iteration period is a CO execution times. for every
:A function f ( x ) is said to be conve~over an interval (a,b) if and 0 I h I 1,
X],x2 E (a,b )
~eometrically,if we plot a convex function f ( x ) along x, a line drawn between two points on the curve lies above the curve (but it may overlap sections of the
ANALYSIS OF THE O R ~ ~ R E A ~ S A ~ T I O NSTRATEGY S
curve). It is easily verified geomet~callythat T S T ( t C is ) convex: since this function is piecewise linear with a slope that is positive and non-decreasing, a line joining two points on it must lie above (but may coincide with) the curve. can also plot TsT as a function of execution times of more than one actor (e.g., TST(tA, tg,...)); this function will be a convex surface consisting of intersecting planes. Slices of this surface along each variable look like Figure 7.8, which is a slice parallel to the tc axis, with the other execution times held constant ( t A = 3 ,tB = 3 ,etc.). The modeling described inthis section is useful for determining how “sensitive” the iteration period is to fixed changes in execution times ofactors, given a processor assignment and actor ordering. We observe that the iteration period increases linearly (with slope one) at worst, and does not change at all at best, when execution time of an actor is increased beyond its compile time estimate.
The effect of variations in execution times ofactors on the performance of statically scheduled hardware is inherent~ydifficult to quantify, because these
Figure 7.7. G ,,, of c.
where actorC has execution time tc,constant over all invocations
Chapter 7
variations could occur due to a large number of factors conditional branches or data-depen~entloops within an actor, error handling, user inte~upts,etc. and because these variations could have a varietyof different characteristics, from beingperiodic, to being dependent on the input statistics, and to being completely random. A s a result, thus far we have had to resort to statements like “for a static scheduling strategy to be viable, actors must not show signi~cantvariations in execution times.” In this section we point out the issues involved in modeling the effects of variations in execution times of actors.
A very simple model for actors with variable execution times is to assign to each actor an execution time that is a random variable (r.v.) with a discrete probability d i ~ t ~ b u t i o(p.d.f.); n successive invocations of each actor are assumed statistic all^ independent, execution times of di~erentactors are also assumed to be independent, and the statistics of the random execution times are assumed to be time-invariant. Thus, for example, an actor A could have execution time t , withprobability(w.p.) p and execution time t2 w.p. (1 -p ) .The model is essentially that A flips a coin each time it is invoked to decide what its execution time should be for that invocation. Such a model could describe a data-dependent conditional branch for example, but it is of course too simple to capture many real scenarios.
11
4 ” -
O
1
Figure 7.8. Tsdtc).
2
I I
slope= l12
4
6
S
I I
8
9
A ~ A L ~ S OF I S THE O ~ ~ E ~ E R ~ T ~ A ~ S ASTRATEGY CTIO~S
Dataflow graphs where actors havesuchrandom execution times have been studied by Olsder et ul. [Ols89][0 in the context of modeling dataS [KLL87]) where the multi~ly driven networks (also called wave-fro operations in the array display data-dependent execution times. The authors show that the behavior of such a system can be described by a discrete-time chain. "he idea behind this, briefly, is that such a system is described by a state space consist in^ of a set of state vectors s Entries in each vector s represent the k th starting time of each actor normalized with respect to one (any arbitrarily chosen) actor: L
0 sturt(v2,k ) -sturt(v,,k )
(7- 19) start( v,,, k ) -start( v t ,k )
The normalization (with respect to actor v , in the above case) is done to make the state space finite; the number of distinct values that the vector s (as fined above) can a s s u ~ eis shown to be finite in [ORV 901. The states of the arkov chain correspond to each of the distinct values of S . The average iteration period, which is defined as: T
= lim start( vi, K) K-+m
(7-20)
kl
can then be derived from the stationary distribution of the Markov chain. There are several technical issues involved in this definition of the average iteration period; for example, when does the limit in (7-20) exist, and how do we show that the limit is in fact the same for all actors (assuming that the H strongly connected)? These questions are fairly non-trivial because th process {start( vi,k may not even be stationary. These questions are answered 3, where it isshownthat: rigorously in [BC0 T = lim K-+=
start( vi, K )
K
= E[T]
YVjE
v.
(7-21)
us the limit T is in fact a constant u Z ~ o ssurely t [Pap91]. such exact analysis, however, is the very large state found that for an IPC Graph similar to Figure 7.4, with ion times, and as sum in^ that only t , is random (takes a weighted coin flip), we could get several thoutwo different valueon sand states for the chain. A graphwithmore vertices leads to aneven larger state space. The size of the state space can be exponential in the number of
Chapter 7
vertices (exponential in IVl ).Solving the stationary distribution for such chains would require solving a set of linear equations equal in number to the number of states, which is highly compute intensive. Thus we conclude that this approach has limited use in d e t e ~ i n i n geffects of varying execution times; even for unrealistica~lysimple stochastic models, computation of exact solutions is prohibitive. Ifweassume that all actors haveexponentially distributed execut~on es, then the system can be analyzed using continuous-time 1821. This is done by exploiting the memoryless property of dis~ibution~ when an actor fires, the state of the system at any moment does not depend on how long that actor has spent executing its function; the state changes only when that actor completes execution. The number of states for such a system is equal to the number of different valid token configurations on the edges of the ~ a t a ~ ograph, w where by “valid” we imply any token configuration that can be reached by a sequence of firings of enabled actors in the HSDFG.This is also equal to the number of valid ~ e ~ i[LS91] ~ i ~ thatg exist ~ for t number, unfo~unately,can again be exponential in the size o f t Analysis of such graphs with exponentiallydist~butede been extensively studied in the area of stochastic Petri nets (i provides a large and ~omprehensivelist of references on Petri a number of which focus on stochastic Petri nets). There is a considerable bodyofwork that attempts to cope h the state explosionproblem. Some of these works attempt to divide a given tri net into parts that can be solved separately (e.g., [VvV93]), some others propose simplified solutions when the graphs have particular structures (e.g., [CS93]), and others propose approxim tions for values such as the expected firing rate of transitions (e.g., [ e of these methods are general enough to handle even a significant class of graphs. Again, exponentially distributed execution times for all actors is clearly a crude approximation to any realistic scenario to make the com~utations involved in exact calculations worthwhile, As an alternative to d e t e ~ i n i n gthe exact value of E[T],we discuss how to determine efficiently computable bounds for it. G = V , E that has actors with random execution tunes, define G,,, = (V, E ) to be an equivalent graph with actor execution times equal to the expected value oftheir execution times in G . :[Dur91] (Jensen’s inequality) If f ( x ) is a convex function of x, then:
E [ f ( x ) l 2f ( E [ x l ) ’
ANALYSIS OF THE ORDERED-TRANSACTIONS STRATEGY
In [RS94] the authors use Fact 7.1to show that E[TI 2MCM(G,,,) .This followsfrom the fact that M C M ( G , " ~ )is a convexfunctionof the execution times of each of its actors. This result is especially interesting because of its generality; it is true no matter what the statistics of the actor execution times are (even the various independence assumptions we made can berelaxed!). One might wonder what
the relationship between E[T] and can again use Fact 7.1, along with the fact that the ~ a x i ~ ucycle m mean is a convex function of actor execution times, to show the following: (7-22)
However, we cannot say anything about E [ T ] in relation to E [ are IPC graphs where E [ T ] >E[MCM(C)] , and others
where
If the execution times of actors are all bounded (tmin(v) S t( v) 2 t,,,( v) V , e.g., if all actors have execution times uniformly distributed in some interval [a,b] )then we can say the following: tr'v E
where C,,, = ( V , E) is same as C except the random actor execution times are replaced by their upper bounds (t,n,,(v) ), and similarly Gmin= ( V , E ) is the same as G except the random actor execution times are replaced by their lower bounds (t, in (v) ). Equation (7-23) summarizes the useful boundsweknow for expected value of the iteration period for graphs that contain actors with random execution times. It should be noted that good upper bounds on E[T] are not known. baumand Sidi proposeupperbounds for exponentially distributed execution times [RS94]; these upper bounds are typically more than twice the exact value of E[T] ,and hence not very useful in practice. We attempted to simplify the Markov chain model (i.e., reduce the number of states) for the self-timed execution of a stochastic HSDFG by representing such an execution by a set of selftimedschedules of deterministic HSDFGs, betweenwhich the systemmakes transitions randomly. This representation reduces the number of states of the arkov chain to the number of different deterministic graphs that arise from the stochastic HSDFG. We were able to use this idea to determine an upper bound for E [ TI ;however, this bound also proved to be too loose in general (hence we omit the details of this construction here).
Intuitively, an ordered transactions schedule is more sensitive to variations
Chapter 7
in e~ecutiontimes; even though in a functional sense, the computations performed using the ordered transactions schedule arerobust with respect to execution time v~iations (the transaction orderensures correct sender-receiver sync~onization).The ordering rest~ctionmakes the iteration period more dependent on execution time vari~tionsthan the ideal ST schedule. This is apparent from our IPC graph model; the transaction ordering constraints add additional edges (EOT)to G, .For example, an IPC graph with transaction ordering constraints represented as dashed arrows is shown in Figure 7.9 (we use the transac* = (S,, r,, s3, r3,s2, r2,s4, r4, s6,r6,s5, r5) determined in Section ,communication times are not included). The graph for TOT(tC)is now ~ i ~ e r e and n t is plotted in Figure 7.8. Note that the TOT(tC) curve for the ordered transactions schedule (solid) is “above” the corresponding curve for the unconstrained ST schedule (dashed): this shows precisely what we mean by an ordered transact~onsschedule being more sensitive to v ~ a t i o n sin execution times of actors. The “optimal” transaction order O* we d e t e ~ i n e densures that the transaction cons~aintsdo not sacrifice t ~ r o u ~ h p u(ensures t TOT = Tsr ) when actor execution times are equal to t~eir com~ile timeestimutes; O* was t c ) = TST(tC) calculated using tc = 3 in Section 7.5, and sure enough, TOT( when tc = 3 .
Figure 7.9. IPC graph G, with transaction ordering constraints repres~nted as
A ~ A L Y S I SOF THE Q R ~ E R E D - T R A ~ S A C T I QSTRATEGY ~S
odel ling using random variables for the ordered transactions schedule can again be done as before, and since we have more constraints in this schedule, the expected iteration period will in some cases be larger than that for a selftimed schedule.
In this chapter we presented a quantitative analysis of self-timed and ordered transactions schedules andshowedhow to determine the effects of imposing a transaction order on a self-timed schedule. If the actual execution times do not deviate signi~cantlyfrom the estimated values, the difference in performance of the self-timed and ordered transactions strategies is minimal. If the execution times do in fact vary signi~cantly,then even a self-timed strategy is not practical; it then becomes necessary to use a more dynamic strategy such as static assignment or fully dynamic scheduling [LH89] to make the best useof computing resources. Under the assumption that the variations in execution times are small enough so that a self-timed or an ordered transactions strategy is viable maybe wiser to use the ordered transactions strategy rather than self-timed because of the more efficient IPC of the ordered transactio~sstrategy. This is because a transaction order O* can be efficiently determined such that the ordering constraints do not sacrifice performance; if the execution times of actors are
O
1
2
3
4
6
8
9
Figure 7.10. TsT(tc)and ToT(tC)for the example of Figure 7.9.
Chapter 7'
close to their estimates, the ordered transactions schedule with O* as the transaction order hasiteration period close to the minimum achievable period Ts, .Thus we make the best possible use of compile time information when we determine the transaction order O* . The complexities involved in modeling run-time variations in ~xecution times of actors were also discussed; even highly simplified stochastic models are difficult to analyze precisely. We pointed out bounds that have been proposed in Petri net literature for the value of the expected iteration period, and concluded that although a lower bound is available for this quantity for rather general stochastic models (using Jensen's inequality), tight upper bounds are not known to date, except for the trivial upper bound using maximum execution times ofactors
The techniques of the previous chapters apply compile time analysis to static schedules for application graphs that have no decision-making at the dataflow graph (inter-task) level. This chapter considers graphs with data-dependent ecall that atomic actors in an SDF graph are allowed to perform data-dependent decision making within their body, as long as their inpu~output behaviour respects SDF semantics. We show how some of the ideas we explored previously can still be applied to dataflow graphs containing actors that display data-dependent firing patterns, and therefore are not SDF actors.
)model was proposed by Lee [Lee911 and 931 for extending the SDF model to allow data-dependent control actors in the dataflow graph. BDF actors are allowed to contain a control input, and the number of tokens consumed and produced onthe arcs of a BDF actors can be a two-valued function of a token consumed at the control input. Actors that follow SDF semantics, i.e., that consume and produce fixed number of tokens on their arcs, are clearly a subset of the set of allowed BDF actors (SDF actors simply do nothaveany control inputs). Two basic dynamic actors in the BDF model are the SWITCH and SELECTactors shown in Figure 8.1. The switch actor consumes one Boolean-valued control token and another input token; if the control token is TRUE, the input token is copied to the output labelled T, otherwise it is copied to the output labelled F. The SELECT actor performs the complementary operation; it reads an input token from its T input if the control token is TRUE, otherwise it reads from its F input; in either case, it copies the token to its output. Constructs such as conditionals and datadependent iterations can easily be represented in a BDF graph, as illustrated in
Chapter 8
Figure 8.2. The vertices A, B, C, etc. in Figure 8.2 need not be atomic actors; they could also be arbitrary SDF sub-graphs. A BDF graph allows SWITCH and SELECT actors to be connected in arbitrary topologies. Buck [Buc93] in fact shows that any Turing machine can be expressed as a BDF graph, and therefore the problems of d e t e ~ i n i n gwhether such a graph deadlocks and whether ituses boundedmemory are undecidable. Buck proposes heuristic solutions to these problems basedonextensions of the techniques for SDF graphs to the BDF model.
Buck presents techniques for statically scheduling BDF graphs on a single processor; hismethodsattempt to generate a sequential programwithout a dynamic scheduling mechanism, using i structs where required. Because of the inherent undecidability of determining deadlock behaviour and bounded memory usage,these techniques are not always guaranteed to generate a static schedule, even if one exists; a dynamically scheduled implementation?where a run-time kernel decides which actors to fire, can be used when a static schedule cannot be found ina reasonable amount of time. Automatic parallel scheduling of general BDF graphs is still an unsolved problem. A naive mechanism for scheduling graphs that contain S ~ I T ~andH SELECT actors is to generate an Acyclic Precedence Extension Graph (APEC), similar to the APEG generated forSDF graphs discussed in Section 3.8, for every possible assignment of the Boolean valued control tokens in the BDF graph. For example, the if-then-else graph in Figure 8.2(a) could have two different APEGs, shown in Figure 8.3, and APEGs thus obtained can be scheduled individually using a self-timed strategy; each processor now gets several lists of actors, one
Figure 8.1. BDF actors SWITCH and SELECT.
E X T E ~ THE ~ IOMA ~ ~ARCHITECT~RE
list for each possible assignment of the control tokens. The problem with this approach is that for a graph with n different control tokens, there are 12" possible distinct APEGs, each co~espondingto each execution path in the graph. Such a set of APEGs can be compactly represented using the so-called Annotated Acyclic Precedence Graph ( PC) of [Buc93] inwhich actors and arcs are annotated with conditions under which they exist in the graph. Buck uses the AAPG construct to determine whether a bounded-len~thuniprocessor schedule exists. In the case of multiprocessor scheduling, it is not clear how such an AAPG could be used to explore scheduling options for the different values that the control tokens could take, without explicitly enu~eratingall possible execution paths.
Figure 8.2. (a) ~onditional(if-then-else)d~taflowgraph. The branch outcome is determined at run-time by actorB. (b) Graph represent in^ data-depend~ntiteration. The terminatiQnconditiQn for the loopis determined by actor D.
Chapter 8
A useful body of work in parallel scheduling of dataflow graphsthat have dynamic actors is the Section approach, discussed in 4.6. In this work, tech ed that statically schedule standarddynamic constructs such as data-dependent conditionals, data-dependent iterations, and recursion. Such a quasi-static scheduling approach clearly does not handle a general BDF graph, although it is a good starting point for doing so.
will consider only the conditional and the iteration construct here. We e that we are given a quasi-static schedule, obtained either ma techniques [HL97] that were described briefly in Section 4.6. explore how the techniques proposed inthe previous chapters for multiprocessors that utilize a self-timed scheduling strategy apply when we implement a quasistatic schedule on a multiprocessor. First, we propose an implementation of a quasi-static schedule on a shared memory multiprocessor, and then we show how we can implement the same program on the OMA architecture, using the hard-
ure 8.3. Acyclic precedence extension graphs (APEGs) corresponding to the if-the~-elsegraph of Figure 8.2. (a) corresponds to theTRUE assignment of the control token, (b) to the FALSE: assignment.
E X T ~ N ~ THE I N ~OMAARCHITECTUR~
ware support provided in the 0 A architecture prototype.
A quasi-static schedule ensures by means of the execution profile that the pattern of processor availability is identical regardless of how the data-dependent construct executes at run-time; in the case of the conditional construct this means that irrespective of which branch is actually taken, the pattern of processor availability after the construct completes execution is the same. This hasto be ensured by inserting idle time on processors when necessary. Figure 8.4 shows a quasistatic schedule for a conditional construct. Maintaining the same pattern of processor availability allows static scheduling to proceed after the execution of the conditional; the data-dependent nature of the control construct can be ignored at that point. In Figure 8.4 for example, the scheduling of subgraph-l can proceed independent of the conditional construct because the pattern of processor availability after this construct is the same independent of the branch outcome; note that “nops” (idle processor cycles) have been inserted to ensure this.
ltiprocessor i~plementationof a quasi-static schedule direct~y,howies enforcing global synchronization after each dynamic construct in order to ensure a particular pattern of processor availability. We therefore use a m e c h a n i s ~similar to the self-timed S gy; wefirstdeterminea quasi-static schedule using the methods of Lee and and then discard the timing information and the restrictions of maintaining a processor availability profile. Instead, we only retain the assignment of actors to processors, the order in which they execute, and also under what conditions on the Boolean tokens in the system the actor should execute. Sync~onizationbetween processors is done at run-time whenever processors communicate, This scheme is analogous to constructing a self-timed schedulefrom a fully-static schedule, as discussed in Section 4.3. Thus the quasi-static schedule of Figure 8.4 can be implemented by the set of programs in Figure 8.5, for the three processors. Here, {rcl,rc2,r17r 2 ) are the receive actors, and {s , ~ ,S,, S , > are the send actors. The subscript “c” refers to actors that communicate control tokens. The main difference between such an i~plementationand the self-timed implementation we discussed inearlier chapters are the control tokens. a conditional construct is partitioned across more than one processor, the control token(s) that determine its behavior must be broadcast to all the processors that execute that construct. Thus, in Figure 8.4, the value c , which is computed by ocessor 2 (since the actor that produces c is assigned to Processor 2), must be broadcast to the other two processors. In a shared memory machine this broadcast can be implemented by allowing the processor that evaluates the control
Chapter 8
token (Processor 2 in our example) to write its value to a particular shared mernory location preassigned at compile time; the processor will then update this location once for each iteration of the graph. Processors that require the value of a articular control token simply read that value from shared memory, and the processor that writes the value of the control token needs to do so only once. In
n
e
'OR
fp)
' *
proe I p ~ o 2e proe 3
proe
ic schedule for a con
i branch instruction^
E ~ T E N ~ I NTHE G OMA ARCHITEC~RE this way, actor executions can be conditioned upon the value of control tokens evaluated at run-ti~e.In the previous chapters, we discussed synchronization associated with data transfer between processors. Synchronization checks must also be performed for the control tokens; the processor that writes the value of a tokenmustnot overwrite the shared memory location unless all processors requiring the value of that token have in fact read the shared memory location, and processors reading a control token must ascertain that the value they read corresponds to the current iteration rather than a previous iteration. The need for broadcast of control tokens creates additional communication overhead that should ideally be taken into account during scheduling. The methods of Lee and Ha, and also prior research related to quasi-static scheduling that they refer to in their work, do not take this cost intoaccount. Static multiprocessor scheduling applied to graphs with dynamic constructs taking costs of distributing control tokens into account is thus an interesting problem for further study.
Recall that the QMA architecture imposes an order in which shared memory is accessed by processors in the machine. This is done to implement the OT
Proc 1 A receive c (rcl) if (c){ E receive (r, ) F }else { I receive (
Proc 2 B send c (scl) C if (c) send ( S , ) G else K sub~raph-l>
Proc 3 D receive c (rc2) if (c) { H }else L send (S$ ode for subgraph-l>
1
Figure 8.5. Programs on three processors for the quasi-static schedule of Figure 8.4.
Chapter 8
strategy, and is feasible because the pattern of processor communications in a self-timed schedule of an HSDFG is in fact predictable. What happens when we want to run a program derived from a quasi-static schedule, such as the parallel programinFigure 8.5, whichwasderivedfrom the scheduleinFigure 8.4? Clearly, the order of processor accesses to shared memory is no longer predictable; it depends on the outcome of run-time evaluation of the control token c . The quasi-static schedule of Figure 8.4 specifies the schedules for the TRUE and FALSE branches of the conditional. If the value of c were always TRUE,then we can determine fromthe quasi-static schedule that the transaction order would be (sei,rcl,rc2,sl, r l , caccess order for subgraph-l>) ,and if the value of c were always FALSE,the transaction order would be
(sei, r e l ,rc2,s2,r2,caccessorder for subgraph-l>)
,
(8- 1)
ote that writing the control token c once to shared memory is enough since the same shared location can be read by all processors requiring the value of c.
A architecture, a possible strategy is to switch between these two access ordersat run-time. This is enabled by the preset feature of the transaction controller (Section 6.6.2). Recall that the transaction controller is implemented as a presettable schedule counter that addresses memory containing the Ds corresponding to the bus access order. To handle conditional consche~ulefor subgraph-l proc l proc 2
proc 3
(scl,rcl,rc2,sl, rl,caccess o
proc I proc 2 proc 3
E X T E ~THE ~ IOMA ~ ~A R ~ H I T E ~ ~ R E
structs, we derive two bus access lists co~espondingto each path inthe program, and the processor that determines the branch condition (processor 2 in our example) forces the controller to switch between access lists by loading the schedule counter with the appropriate value (address“7” in the bus access schedule ofFigure 8.7). NotefromFigure 8.7 that there are two points where the schedule counter can be set; one is at the completion of the TRUE branch, andthe other is a jump into the FALSE branch. The branch into the FALSE path is best taken careof by processor 2, since itcomputes the value of the control token c , whereas the branch after the TRUE path (which bypasses the access list of the FALSE branch) is best taken care of by processor 1, since processor 1 already possesses the bus at the time when the counter needs to be loaded. The schedule counter load operations are easily incorporated into the sequential programs of processors 1 and 2. The mechanism of switching between bus access orders works well when the number of control tokens is small. But if the number of such tokens is large,
bus access list
Addr
forces co~trollerto ju ccess listfor the E branch if c is proc 1 forces c o ~ ~ o l lto e rbypass the
access list thatis stored in the schedule RA ure 8.6. L o a ~ i oper n ~ tion of the schedule countercon~itionedon value of c is also s h o ~ n .
Chapter 8
.
then this mechanisms breaks down, even if we can efficiently compute a quasistatic schedule for the graph. To see why this is so, consider the graph in Figure 8.8, which contains k conditional constructs in parallel paths going fromthe input to the output. The functions “fi” and “g,” are assumed to be subgraphs that are assigned to more than one processor. In Ha’s hier~chical scheduling approach, each conditional is scheduled independently; once scheduled, it is converted into an atomic node in the hierarchy, and a profile is assigned to it. Scheduling of the other conditional constructs can thenproceed based on these profiles. Thus, the scheduling complexity in terms of the number of parallel paths is O(k) if there are k parallel paths. If we implement the resulting quasi-static schedule in the manner stated in the previous section, and employ the OMA mechanism above, we would need one bus access list for every combination of the b,,...,bk.This is because each fi and Q will have its ownassociated bus access list, which then has to be combined with the bus access lists of all the other branches to yield one list. For example, if all Booleans 6, are true, then all the fi’S are exe-
6
8.8. ~on~it~onal constr~ct~ in parallel paths.
E X T E ~ THE ~ IOMA ~ ~ARCHITEC~~RE
cuted, and we get one access list. If b1 is TRUE, and b2 through b, are FALSE, is executed, and f2 through fk are executed. This corresponds to another cess list. This implies 2k bus access lists for each of the combination of fi that execute, i.e., for each possible execution path in thegraph.
Although the idea of maintaining separate bus access lists is a simple mechanism for han ’ const~cts,itcan sometimes be i~practical,as in the example above. an alternative mechanism based on ~u~~~~~that handles arallel conditional constructs more effectively. e main idea behind masking is to store an ID of a along with the processor ID in the bus access list. The Boolean ID determines whether a p ~ i c u l a bus r grant is “enabled.” This ows us to combine the access lists of all the nodes f, through fk and g, through The bus grant co~esponding to eachfi is tagged with the boolean ID of the corresponding b,, and an additional bit indicates that the bus grant is to be enabled when bi is “RUE. Similarly, each bus grant co~espondingto the access list of gi is tagged with the ID of bi, and an additional bit indicates that the bus grant must be enabled only if the correspon~ing control token has a FALSE value. At run-time, the controller steps through the bus access list as before, but instead of simply granting the bus to the processor at the head of the list, it first checks that the control token corre§ponding to field of the list is in its correct state. If it is in the correct state for a bus grant corresponding to an fi and FALSE for a bus grant corresponding to a gi), then the bus grant is performed, otherwise it is masked. Thus the run-time values of the Booleans must be made available to the transaction controller for it to decide whether to mask a pa~icularbus grant or not. rant should be enabled by a product the dataflow graph, and the completed conditionals in parallel branches of the g hus, in general we need to implement an ~ ~ ~ busoaccess t list ~ of~the eform~ {(cl)Pr~cZDl, ( c ~ ~ P r ~ c.. Z.}D; ~each , bus access i s annotated with a valued condition ci ,indicating that the bus should be granted to the processor corr~spondingto ProcZDi when ci evaluates to
UE; c i could be an arbitr
uct function of the ooleans {b,, b,, ...,b,} in the system, and the complements ooleans (e.g., c j = b2 complement),
a
K ,where the bar over a variable indicates its
Chapter 8
This scheme is implemented as shown in Figure 8.9. The schedule memfields corresponding to eachbus access: CCondiCID> instead of the field alone that wehad before. The n> field encodesa unique product ci associated with that particular bus access. In the OMA prototype, we can use 3 bits for , and 5 bits for the This would allow us to handle 8 processors and 32 product ooleans. Therecanbe up to m = 3” productterms in the worst case corresponding to y1 Booleans in the system, because for each Boolean b, ,a product term could contain bi ,or it could contain its complement h , , or else b, could be a “don’t care”. It is unlikely that all 3” possible product terms ory now containstwo
shared address bus
memory maps the flags Clthrough C , to the shared bus
Signal indicating whether to mask currentBG or not
BGO BGl
ccess mechanism that ontrol tokens that are evalua
BGn
will be required in practice; we therefore expect such a scheme to be practical. The ne cess^ product terms (cj ) can be implemented within the controller at compile time, based on the bus access pattern of the particular dynamic dataflow graph to be executed. In Figure 8.9, the flags bl,b,, ...,6, ,are l-bit memory elements (Aipflops) that are m e ~ o r ymapped to the shared bus, and store the values of the oolean control tokens in the system. The processor that computes the value of each control token updates the conesponding b, by writing to the shared memcation that maps to bi The product combinations c,, c,, ... c, ,are just functions of the b,S and the complement of the b, S , e.g. c j could be .As the schedule counter steps through the bus access list, the bus grant is ranted only if the condition conesponding to that access evaluates to us if the entry appears at the head of the bus access list, b2- G , then processor 1 receives a bus grant only if the control token E and b, is FALSE, otherwise the bus grant is masked and the schedu~e counter moves upto the next entry in the list. This schemecanbe inco~oratedinto the transaction controller inour A architecture prototype, since the controller is implemente product terms cl,c2,...,c, may be programmed into the F compile time; when we generate p r o g r a ~ s ate the a n n ~ t a ~ ebus d access list (a and a hardware description for the L, say) that imple~entsthe required product terms.
bus daccess
list
straightforward, even if inef~cient,mechanism for obtaining such a list is to use enumeration; we simply enumerate l possible combinations of determine system (2" combinations for y1 ooleans), and sequence(sequence of 'S) for each combination.achcombination conesponds to an execution path in the graph, an we can estimate the time of occurrence of bus accesses conesponding to eac combination from the quasi-static schedule. For example, bus accessesc o ~ e s p o n d i nto~one sche two execution paths in the quasi-static schedule of Figure 8. along the time axis as shown in Figure 8.10 (we have igno co~espondingto su~graph-lto keep the illustration simple). he bus access schedules
for each of the combinations can now be col-
Chapter 8
sed into one annotated list, as in Figure 8.10; the fact that accesses for each S u s to enforce a global order combination are ordered with respect to time a on the accesses in the collapsed bus ac list are annotated with their respective The collapsed list obtained above canbe used, as is, in the maskedcontroller scheme; however, there is a potential for optimi~ingthis list. Note, however, that the same transaction may appear in the cess list c o ~ e s ~ o n d i ntogdi~erent olean combinations, because a particular that bus access. For example, the first t in both execution paths, because they ar t ofthe value of C , In the worst case, a bus access that end isill ind up a p ~ e ~ n g in the bus access lists all of the these bus accesses appear contiguously in the collapsed bus access sequence, we can combine them into one. For example, “(c )Proc2, (1;)P~ocLZ” inthe annotated schedule of Figure 8.10 can be combined into a single “Proc 2” entry, which is not conditioned on any control token. Consider another example: ifwe get contiguous entries
c=T
t
C
= FAL
ccess lists and the annotated listc o r r e s ~ o nto~ i ~ ~
-G
b, )Proc3” and “(b, b2 )Proc3” in the collapsed list, we can replace the two entries with a singleentry “(bl )Proc3”’. ‘ 6 (
ore generally, if the collapsed list contains a contiguous segment of the form: k,
(c ~ ) ~ r o c I L.)..) k 7(c,)ProcIL)k7...},
each of the contiguous segments can be written as:
{...)(c1 + c2+ ...+ c,)~rocIL)k7 ...},
where the bus grant condition is an expression (c, + c2+ ...+ c,) ,which is a sum of products SO^) function of the Booleans in the system. Two-level logic minimization can then be applied to determine a minimal representation of each of these expressions. Such 2-level minimization can be done by using a logic ‘mization tool such 0 ~ B H ~ S V c ~ which 4 1 , simpli~esa given expression into an entation with minimal number of product ( c , + c2+ ...+ c,) can be minimized into another terms. Suppose the expre SOP expression (c1 + c2’ + ..+ cpF),where p I .The segment *
k,
(c2)~rocIL),, ( c ~ ) ~ r o c I L.)..) k ,( c , ) P r o c ~
can then be replaced with an equivalent segment of the form: k)
(cj’) ProcIL)k7... (c p F ) ~ r u c I
This procedure results in a minimal set of contiguous appearances of a bus grant to the same processor. noth her optimization that can be performed is to combine annotated bus access lists with the switching mechanism of Section 8.2.1. Suppose we have the following an~otatedbus access list:
{...,( b , ~ ) ~ r o c I L( )bi, -, ~ ) ~ r o c (Ib ~, bj 4, .b S ) ~ r o c I L . ).~.} ).
Then, by “factoring” b , out, the above list may be equivalently written as:
{...)( b ~ ) { ( G ) ~ ~ u c( ~~ )L~) ri o, c I L(b4 ) j 7bS)ProcIL),},...1. W, all the three bus accesses may be skipped whenever the Boolean b l is
LSE by loading the schedule counter and forcing it to increment its count by three, instead of evaluating each access separately, and s ~ p p i n gover each one individually. This strategy reduces overhead, because it costs an extra bus cycle S access when a condition co~espondingto that bus access evalu;by s ~ ~ p i over n g three bus accesses that we know are going to be disabled, we save three idle bus cycles. There is an added cost of one cycle for loading the schedule counter; the total savings in this example is therefore two bus cycles. One of the problems with the above approach is that it involves explicit enumeration of all possible combinations of ooleans, the complexity of which
Chapter 8
limits the size of problems that can be tackled with this approach. An implicit mechanism for representing all possible execution paths is therefore desirable. One such mechanism is the use of Binary Decision Diagrams (BDDs), which have been used to efficiently represent and manipulate Boolean functions for the purpose of logic minimization [Bry86]. BDDs have been used to compactly represent large state spaces, and to perform operations implicitly over such state spaces when methods based on explicit techniques are infeasible. One difficulty encountered in applying BDDs to the problem of representing execution paths is that it is not obvious how precedence and orderingconstraints can be encoded in representation. The execution paths co~espondingto the various Boolean co~binationscan be represented usinga BDD, but it isn't clear how to represent esponding to the different execution
in Figure 8.2(b). A quasistatic schedule for such a construct may look like the one in Figure 8.1 l. The ,C, and D of Figure 8 4 b ) are assumed to be subgraphs rather than Such a quasi-static schedule can also be i~plementedin a s t r a i g h t f o r w ~ ~ A architecture, provided that the data-dependentconstruct spans all the processors in the system. The bus access schedule c o ~ e s p o n ~ i ntog iterated subgraph is simply repeated until the iteration construct t e r ~ ~ n a t e s . processor responsible for determining when the iteration t e ~ i n a t e scan be made to force the schedule counter to loop back untilthe termination condition is reached. This is shown in Figure 8.12.
proc I proc 2 proc 3
u~si-staticschedule for the data-dependent iteration graph of Fig-
E X T E N ~ I NTHE ~ OMA A ~ C H I T E C ~ ~ E
This chapterhas dealt withextensionsoftheordered-transactions approach to graphswithdata-dependent control flow. The BooleanRataflow model was briefly reviewed, and the quasi-static approach to scheduling conditional anddata-dependent iteration constructs. A schemewasthendescribed whereby the Ordered Transactions approach could be used when such control constructs are included in the dataflow graph. In this scheme, bus access schedules are computed for each set of values that the control tokens in the graph evaluate to, and the bus access controller is made to select between these lists at runtime based on which set of values the control tokens actually take at any given time. This was also shown to be applicable to data-dependent iteration constructs. Such a scheme is feasible when the number of execution paths in the graph is small, A mechanis~based on masking of bus accesses depending on run-time values of control tokens may be usedfor handling the case when there are multiple conditional constructs in “parallel.”
Bus access list
~
Processor that determines termination condition of the iteration can also reinitialize the schedule counter
Figure 8.1 2. A possible access order list corresponding to the quasi-static schedule of Figure 8.1 1.
This Page Intentionally Left Blank
e previous three chapters have been concerned with the actions s~ategy,which is a hardware approach to reducingI tion costs in self-timed schedules. In this chapter and the fol we discuss software"based strategies for minimizin~S nchronization costs in the final implementation of a given self-timed schedule. ese software-based techniques are widely-applicable to sh~ed-memorymu1 rocessors that consist of eous or heterogeneous collections of processors, and they do not require bility of hardware support for employing the OT approach or any other form of specialized hardware suppo~. Recall that the self-timed scheduling s~ategyintroduces sync checks whenever processors communicate. A straightforward imple timed schedule would require that for each inter-processor c o ~ ~ u n i c ~ t i o n the sending processor ascertain that the buffer it is writing to is e receiver ascert~nthat the buffer itis reading from is not empty. cessors block (s~spendexecution) when the appro~riatecondition is not met. Such sender-receiversynchronization can be implementedin any W ing on the p ~ i c u l a r h a r ~ w aplatform re under consideration: in sh machines, such synchronization es testing and setting S sharedmemory;in machines that synchronization in h ~ d w a r e(such as ~ ~ i e r sspecial ), sync~onization ins~uctions are used; and in the case of systems that consist of a mix of p r o g r ~ m a b l eprocessors and custom hardware ments, sync~onization is a c ~ e v e dby ~mployinginterfaces that support bloc reads and writes. In each type of platform, each IPC that requires a synchronization check costs performance,and sometimes extra hardware com~lexity. ~ e m a ~ h o r e checks cost execution time on the processors, synchronization ins~uctionsthat
Chapter 9
make use of special synchronization hardware such as barriers also cost execution time, and blocking interfaces between a programmable processor and custom hardware in a combined hardware/software implementation require more hardware than non-blocking interfaces [H+93]. In this chapter, we present algorithms and techniques that reduce the rate at which processors must access shared memoryfor the purpose of synchronization in multiprocessor implementations of SDF programs. One of the procedures we present, for example, detects when the objective of one s y n ~ ~ o n i z a t i oopern ation is guaranteed as a side effect of other synchronizations in the system, thus enabling us to eliminate such superfluous sync~onizationoperations. The optimization procedure that we propose can be usedas a post-processing step to any static scheduling technique (for example, to any one of the techniques presented in Chapter S) for reducing synchronization costs in the final implementation. As before, we assume that “good” estimates are available for the execution times of actors and that these execution times rarely display large variations so that selftimed scheduling is viable for the applications under consideration. If additional timing information is available, such as guaranteed upper and lower bounds on the execution times ofactors, it is possible to use this information to further optimize synchronizations in the schedule. However, use of such timing boundswill be left as future work; we mentionthis again in Chapter 13.
Among the prior work that is most relevant to this chapter is the barrierprinciple of Dietz, Zaafrani, and O’Keefe, which is a combined hardware and software solution to reducing run-time sync~onizationoverhead [DZ092]. In this approach, a shared-memory MIMD computer is augmented with hardware support that allows arbitrary subsets of processors to synchronize precisely with respect to one another by executing a sync~onizationoperation called a barrier. If a subset of processors is involved in a barrier operation, then each processorin this subset will wait at the barrier until all other processors in the subset have reached the barrier. After all processors in the subset have reachedthe barrier, the co~espondingprocesses resume execution inexact s y n c ~ r ~ n y . In [DZ092], the barrier mechanism is applied to minimize synchronization overhead in a self-timed schedule with hard lower and upper bounds on the task execution times. The execution time ranges are used to detect situations where the earliest possibleexecutiontimeof a task that requires datafrom another processor is guaranteed to be later than the latest possible time at which the required data is produced. M e n such an inference cannot be made, a barrier is instantiated between the sending and receiving processors. In addition to performing the required data synchronization, the barrier resets (to zero) the uncer-
SYNCHRONIZATIO~IN SELF-TI
tainty between the relative execution times for the processors that are involved in the barrier, and thus enhances the potential for subsequent timing analysis to eliminate the need for explicit synchronizations. The techniquesof barrier IMD do notapply to the problem that we addressbecausetheyassume that a hardware barrier mechanism exists; they assume that tight boundson task executiontimes are available; they do not address iterative, self-timed execution, in which the execution of successive iterations of the dataflow graph can overlap; and even for non-iterative execution, there is no obvious correspondence between an optimalsolution that uses barrier synchronizations and an optimal solution that employs decoupled synchronization checks at the sender and receiver end ( point is illustrated in Figure 9.1. Here, in the absence of execution time bounds, an optimal application of barrier synchronizations can be obtained by inse~ing two barriers one barrier across A I and A 3 ,and the other barrier across A4 and A 5.This is illustrated in Figure 9,l(c). However, the corresponding collection of directed sync~onizations( At to A3,and A, t0A4 )is not sufficient since it does not guaranteethat the data required by A, from AI is available before A, begins execution.
In [Sha89], Shaffer presents an algorithm that minimizes the number of directed synchronizations in the self-timed execution of a dataflow graph. However, this work, like that of Dietz et al., does not allowthe execution of successive iterations of the dataflow graph to overlap. It also avoids havingto consider dataflow edges that have delay. The technique that we discuss in this chapter for removing redundant synchronizations can be viewedas a generalization of Shaffer’s algorithm to handle delays and overlapped, iterative execution, and we will discuss this further in Section 9.7. The other major software-based techniquesfor sync~onizationoptimization that we discuss in this book handling the feedforward edges of the synchruni~ationgraph (to be defined in Section 9.5.29, discussed in Section 9.8, and “resynchronization”, discussed in Chapters 10 and l 1 are fundamentally different from Shaffer’s technique since they address issues that are specific to the more general context of overlapped,iterative execution. As discussed in Chapter 4, a multiprocessor executing a self-timed schedule is one where each processor is assigned a sequential list of actors, some of which are send and receive actors, which it executes in an infinite loop. When a processor executes a communication actor, it synchronizes with the processor(s) it communicates with. Thus, exactly when a processorexecuteseach actor depends on when, at runtime, all input data for that actor is available, unlike the
Chapter 9
oc 2:
A3,A4
Proc 3: A,, A,
(W
A t~re~-~rocessor se~f- time^ s ~ ~ e ~ for u l e (a). (c) lace~~nt of
~arriers,
S Y N ~ H R ~ N I Z IN ~ TSELFI ~ ~TIME^ SYSTEMS
fully-static casewhere no such ~ n - t i m echeck is needed. In this chapter we use “processor” in slightly general terms: a processor could be a pro~rammablecomponent, in which case the actors mapped to it execute as software entities, or it could be a hardware component, in which case actors signed to it are implemented and execute in hardware. See [KL93] for adiscussion on combined hardware/software synthesis from a single dataflow specification. Examples of ap~lication-specificmultiprocessors that use programmable processors and some form of static scheduling are described in [ +$$][Koh90l9which were also discussed in Chapter 2. ~nter-processor communication between processors is assumed to take place via shared memory. Thus the sender writes to a particular shared memory location and the receiver reads from that location. The shared memoryitself could be global memorybetween all processors, or it could be distribute between pairs of processors (as hardware FIFO queues or dual ported memo~es for example). Each inter-processor communication edge inan EIS translates into abuffer of a certain size in shared memo^. Sender-receiver synchronization is also assumed to take flags in shared ~ e m o r y .Special hardware for synchronization phores implemented in hardware, etc.) would be prohibitive multiprocessor machines for applications such as DSP that we are conside~ng, Interfaces between h a r d w ~ eand software are typically implemented using memory-mapped registers in the address space of the progra~mableprocessor (again a kind of shared memory), and synchronization is achieved using flags that can be tested and set by the programmable component, and the same can be done by an inte~facecontroller on the hardware side [H+93]. Under the model above, the benefits of sync~onizationoptimization become obvious. Each sync~onizationthat is eliminated directly results in one less s y n c ~ o n i ~ a t i ocheck, n or, equivalently, one less shared memory access. example, where a processor would have to check a flagin shared m e ~ o r ybefore executing a receive p~mitive,eliminating that synchronization implies there is no longer need for such a check. ”his translates to one less shared memory read. Such a benefit is especially signi~cantfor simplifying interfaces between a programmable component and a hardware component: a send or a receive without the need for synchronization implies that the interface can be implemented in a non-bloc~ingfashion, greatly simplifying the interface controller. As a result, eliminating a sync~ronizationdirectly results in simpler hardware in this case. Thus, the metric for the optimizations we present in this chapter is the total number of accesses toshared memory that are needed for the purpose of synchronization in the final multiprocessor implementation of the self-timed schedule. This metric will be defined precisely in Section 9.6.
Chapter 9
We model synchronization in a self-timed implementation using the IPC graphmodelintroducedin the previous chapter. A s before, an IPC graph G,( V,EiF) is extracted from a given HSDFC G and multi-processor schedule; Figure 9.2 shows one such example, which we use throughoutthis chapter. We will find it useful to partition the edges ofthe IPC graph in th ingmanner: Eiw Ei,,U E,,,, , where -Eco, are the CO (shown dashed in Figure 9.2(d)) that are directed from the send to the receive actors in G,, and Ei,,,are the “internal” edges that represent the fact that actors assigned to a particular processor (actors internal to that processor) are executed sequentially according to the order predetermined by the self-timed schedule. A communication edge e E E,,,, in G, represents two functions: 1) reading and writing of data values into the buffer represented by that edge; and 2) synchronization between the sender and the receiver. A s mentioned before, we assume the use of shared memory for the purpose of sync~onization;the synchronization operation itself mustbeimplementedusingsomekindofsoftware protocol between the sender and the receiver. We discuss these sync~onizationprotocols shortly.
sti
t
Recall from Lemma 7.3 that the average iteration period co~espondingto is given by the maximum cycle mean (G,) .If we only have execution time estimates available instead of exact values, and we set the execution times ofactors t(v) to be equal to these estimated values, then we obtain the e s t i ~ ~iteration t e ~ period by computing CM( G,) .Henceforth we will assume that we know the e s t i ~ ~ t e ~ t ~ r o ~ a self~timedschedule with anIPC graph G,
CM
”
calculated by setting the t ( v ) values to the available timing estimates.
In all the transformations that we present in the rest of the chapter, we will preserve the estimated throughput by preserving the maximum cycle mean of G,, with each t ( v ) set to the estimated execution time of v . In the absence of more precise timing information, this is the best we can hopeto do.
In dataflow semantics, the edges between actors represent infinite buffers. Accordingly, the edges of the IPC tially buffers of infinite size. However, from Lemma 7.1, every (an edge that belongs to a strongly connected component, and hence to some cycle) can only have a finite number of tokens at any time during the execution ofthe IPC graph. We will call
S Y ~ C H R O ~ I Z A T IIN O ~SELF-TI
Proc 1
Execution TimeEsti~ates
A, C, H, F :2
. " " -
B 1
E:
G1 I
:3
:4
Proc
on four processors
I I
Proc 2
l I
I I I I
Proc 4
comm int
Figure 9.2. Self-timed execution.
Chapter 9
this constant the s ewill we represe timed buffer bound:
of that edge, and for a feedback edge emm ma 7.1 yields the following self-
& ( e ) = min ({Delay (C) I C is a cycle that contains e ))
(9- 1)
es (edges that do not belong to any SG bound on buffer size; therefore for practical implementations we need to i ~ ~ Q ~ e a bound on the sizes of these edges. For example, Figure 9.3(a) shows an IPC graph where the communication edge (S, r) could be unbounded when the execution time of,A is less than that of B. In practice, we need to bound the buffer edge; we will denote such an impose^' bound for a feedforward e ) Since the effect of placing such a rest~ctionincludes “artificially” constraining src ( e ) from getting more th ,(e) invocations ahead of S ~ ~ (, e its ) effect on the estimated t ~ o u g h p u t reverse edge that has m delays on it, where m = ed e in Figure 9.3(b)). Since the addition of this e potential to reduce the estimated throughput; to prevent such a e) must be chosen to be large enough so that the m a x i ~ u mcycle mean remains unc~angedupon adding the reverse edge with m delays.
ure 9.3. An IPC g r a with ~ ~ a f e e ~ f o ~edge: ~ r d (a) ori~inai ~ ~ ~ u f f eur ~ . n ~ ~ ~
NIZATI~~ IN S E L F - T I ~SE ~ S T E ~ S
Sizing buffers optimallysuch that the maximumcyclemeanremains unchanged has been g, Lewis and LofKLL873, in where the authors propose ramming in an f o ~ u l a t i othe nof prob~em,with the numberof constraints equal to the n u ~ b e roffundamental cycles in the H S ~ F G(potentially an esponential number ofconstraints). cient heuristic procedure to determine
holds for each feedforward edge e , then the maximum cycle mean of the resulting graph does not exceedM Then, a binarysearch (e) for eachfeedforwardedge,whilecomputing t search ach step and ascertaining that it is less than i,),results buffer ain assignment for the feedforward edges. ~ l t h o u g hthis procedure is efficient, it is suboptimal because the order that the edges e are chosen is ar~itraryand may effect the quality of the final solution. we will see in Section 9.8, however, imposin~such a bound roach for bounding buffer sizes, because such a bound entails a sync~onizationcost. In Section 9.8 we show that there is a better technique for bounding buffer sizes; this technique achieves bounded buffersizes by t r a n s f o ~ ing the graph into a strongly connected graph by adding a minimal number of additional sync~onizationedges. Thus, i the final algorithm, it is notin fact ne cess^ to use or compute these bounds
define two basic synchronization protocols for a communication edge hether or not the length of the co~espondingbuffer is gu~anteedto be bounded from the analysis presented in the previous section. Given an I graph G ,and a communication edge e in G , if the length of the co~esponding buffer is not bounded that is, if e is a f e e d f o ~ a r dedge of G then we a l a synchronization protocol called which guarantees that (a) an invocation of snk( e) never atte om an empty b u ~ e r and ; (b) an invocation of src(e) never attem~tsto write data into the buffe nless the number of tokens in the buffer is less than some pre-specified limit (e) ,which is the amount of memo^ allocated to the buffer as discusse~in the previous section. On the other hand, if the topology of the IPC graph guarantees that the
Chapter 9
th for e is bounded by some valu
,then we use a simpler protocol, calle
),that only explicitly ensures (a)above. chroni~ationprotocols defined S
n this mechanism, a write ~ointerwr(e) for e is maintained on the processor that executes src ( e ) ;a read ~ointe ( e ) for e is maintained on the wr(e) is maintainedinsome processor that executes snk( e) ;and a cop shared memo^ location sv (e) .The pointer ) and wr(e) are initiali%edto deZ~y( e ) ,respectively. Just after each execution of src( e) ,the new data value produced onto e is written into the shared memory buffer for e at offset wr( e) ; wr(e) is updated following the by operation wr(e) +"(wr(e) + l ) mo tb( e) ;and sv( e) is updated to contain the new value of wr( e) .Just before each execution of s n ~ ( e, ) the sv( e) is repeatedly examined until it is found to be not e q ~ t~ Z shared memory bu echanism also uses the r e a d / ~ ~pointers te and these are initialized the same way; however, rather than maintaining a copy of wr( e) in the shared memo^ location sv ( e ) ,we ~ a i n t a i na count (initiali%ed to ~eZay(e) )of the number of unread tokens that currently reside in the buffer. executes, sv( e) is repeatedly examined until its value is found R( e) ; then the new data value producedonto e is written into the sharedmemory buffer for e at offset wr( e) ; (e) is updated as in B (except that the newvalue is not w ~ t t e nto shar memo^); and the count in sv (e) is incremented. Just before each execution snk(e) ,the value contained in sv( e) is repeatedlymineduntil it is found to benonzero;then the data value residing at offset count in sv(e) is decre there is enough shared memory to hold a feedforwardcommunicationedge e of communication edge some of the buffers feedforward edges, roughput. Note that feedback edge e , f optimally choosing which edges should besubject to stricter buffer bounds when there is a shortage of s h a r ~ d ~ e m oand r y ,the selection of these stricter bounds is an interesting area for further investigation. An impo~antparameter in an implementation of
S Y N C ~ R O N I ~ A T IN I O SEL~-TI~ED ~ SYSTE
e T b.If a receiving processor finds that the correspondi~gIPC buffer is full, then the processor releases the shared memory bus, and waits Tb time units before requesting the bus again to re-check the shared memory synchronization variable. Similarly, a sending processor waits T b time units between successive accesses of the same synchronization variable. The back-off time can be selected experimentally by simulating the execution of the given synchronization graph (with the available execution time estimates) over a wide range of candidate back-off times, and selecting the back-off time that yields the highest simulated throughput.
As we discussed inthe beginning of this chapter, some of the communication edges in G, need not have explicit synchronization, whereas others require synchronization, which needto be implemented either using the UBS protocol or S protocol. Allcommunicationedges also represent buffers in shared memory. Thus we divide the set of communication edges as follows: Ec,,, = Es UE, ,where the edges E, need explicit synchronization o Er need no explicit synchronization. Recall that a communication edge (vj, vi) of Gi, represents the S s ? a ~v;, ( k ) 2 end(vj, k -delay ((vj, v ; ) ) ) V k >delay ((vi, vi))
(9-3)
Thus, before we perform any optimization on synchronizations, Ecomm Es and Er= Q,, because every communication edge represents a synchronization owever, in the following sections we describe how we can move certain thus reducing synchronization operations inthe final implem E* to Er, mentation. After all synchronization optimizations have been applied, the communication edges of the IPC graph fall into either Es or E,.At this point the edges EsUEr in G, represent buffer activity, and must be implemented as buffers insharedmemory,whereas the edges Es represent synchronizationconstraints, and are implemented using the UBS and BBS protocols introduced in the previous section. For the edges in E,, the synchronization protocol is executed before the buffers corresponding to the communication edge areaccessed so as to ensure sender-receiver synchronization. For edges in Er,however, no synchronization needs to be done before accessing the shared buffer. Sometimes we will also find it useful to introduce synchronization edges without actually communicating data between the sender and the receiver (for the purpose of ensuring finite buffers for example), so that no shared buffers need to be assigned to these edges, but the corresponding sync~onizationprotocol is invoked for these edges.
l1 optimizations that move edges from E, to E, must respect the syn-
Chapter 9
chronization constraints implied by G,. If we ensure this, then we only need to implement the synchronizati = ( V ,EintUE$) the syn G, represents the sync~onization ~o~straints ensured, and the algo~thmswe present for minimizing synchronization costsoperate on G,, efore any synchronization-related optimizations are performed G, G, , ecause Ecom= E, at this stage, but as we move communication edges from E, to ,G, has fewer and fewer edges. 1: moving edges from E, to enever we remove edges from G, we viewed as removal of edges from G,. haveto ensure, of course, that the syn ization graph G, atthat step respects all the synchronization constr~ntsof G, ,because we only implement synchronizations represented by the edges Es in G , , The following theorem is ~ s e f uto l formalize the concept of when the sync~onization constr~nts represented by ~ of another one synchronization graph G,' imply the s y n c ~ o n i z a t i oconstraints graph G: .This theorem provides a useful constraint for synchronization optimization, and it underlies the validity of the main techni~uesthat we will present in this chapter. : The synchronization constraints in a synchronization graph
= (V, tiongraph
U Esl)imply
.Ei,,$
the synchronization cons~aintsof the sync~roniza-
GS2= ( V , EiatU ES2) if the following condition holds:
Es',p,(
'V'E
s.t.
src (E),snk( E))5 delay (E) ;that is, if for each edge E that
CS2but not in
G,' there is a mini mu^ delay path from src( E) to
,'that has total delay of at most deZay(E) . ote that since the vertex sets for the two graphs ar
entical, it is meaningfu~
to refer to src( E) and snk( E) as being vertices of
even though there are
edges
E
sat. E E Es2,E P E,' .)
First we prove the following lemma. :If there is a path p
= (el, e2,e3, ...,e,,) in
stffrt(snk (e,,), k ) 2end( src( el), k -
Gsl,then
9. l : e following constraints hold along such a path p (as per (4-1))
rouf of ~e~~
imilarly,
start( snk( e2),k ) 2 end( src(e2),k -~eZay( e 2 ) ).
S Y ~ ~ H ~ O N I Z A TINI O S ~E L ~ - T I SYSTEMS ~E~
oting that src (e2) is the same as snk (e ,we get start(snk(e2),k ) 2 end(snk(ei),k -delay(e2)).
~ a ~ s a l i implies ty end(v, k ) 2 start(v, k ) ,so we get start( snk( e2), k ) 2start( snk( e
k -deZay( e2)) .
(9-5)
~ ~ b s t i t u t i n(9-4) g in (9-S), start(snk(e2),k ) 2 end(src(e,),k -delay(e2)-d e Z a y ( e ~ ) ) .
~ontinuingalong p in this manner, it can easily be verified that start(snk(e,)9k ) 2 end(src(e,), k -deZay(e,) -delay(e,-i) ...-deZay(e,))
that is,
start((snk (en), k ) 2 end( src ( e k -Delay ( p ) ) ).QED.
Proof of ~ ~ e o9.1: r eIf ~ EE E :, E E Esi,then the synchronization constraint due to the edge E holds in both graphs. But for each E s.t. E E Es2,E E,' we need to show that the constraint due to E : start( snk( E),
k ) >end( src( E), k -delay( E ) )
(9-6)
holds in G,' provided,,p (src (E), snk (E ) ) 2 delay (E ) ,which implies there is at least one path p = ( e l , e2,e3, ..,e,) from src( E ) to snk( E) in 6,' (src(e,) = s r c ( ~ and ) snk(e,) = snk(E) )such that DeZay(p) 2 deZay(&). From Lemma 9.l existence of such a path p implies s~art((snk(e,), k ) 2 end(src(e~), k - DeZay(p))).
that is,
start( (snk (E), k ) 2 end( src (E), k -DeZay ( p ) ) ).
(9-7)
If elay ( p ) 2 deZay (E), then end( src( E), k -DeZay(p)) 2 end( s r c ( ~ )k,-delay(&)).Substituting this in (9-7)we get
start( (snk( E),k ) 2 end( src( E), k -delay( E ) ) ) . e above relation is identical to (9-6),and this proves the Theorem.
Chapter 9
The above theorem motivatesthe following definition. :If G,’ = (V , Ei,,UEsl) and GS2= ( V , E,,,, U ES2)are synchronization graphs with the same vertex-set, we say that G,’ reser~esG , ~if YE s.t. E E E*,E E E l ,we have p , ( s r c ( ~ )snk(~)) , 2 delay(&). G,
Thus, Theorem 9. l states that the synchronization constraints of (V, Eli,,, U E$’) imply the synchronization constraints of (V, E,,, UES2)if ( V ,E,,,, U Es’)preserves ( V , E,,,~ U . Given an IPC graph G,, and a synchroni~ationgraph G, such that G, preserves G,, suppose we implement the synchronizations corresponding to the synchronization edges of G,. Then, the iteration period of the resulting system is determined by the maximum cycle mean of G, ( ~ ~ ~ ).(This G isJbecause the synchronizationedgesalonedetermine the interaction between processors; a communication edge without synchronization does notconstrain the execution of the corresponding processors in anyway.
e ~e refer to each access of the shared memory ‘‘Synchronization variable” ~ccessto shared memory. If synchronization for e is implemented using UBS, then we see that on average, 4 s y n ~ ~ o n i z a t i oaccesses n are required for e in each iteration period, while BBS im lies 2 synchronization accesses per iteration period. ~e define the sy cost of a synchronization graph G, to be the average number of synchronizationaccessesrequiredper iteration period. Thus,if n f f denotes the number of synchronizationedgesinG,$ that are feedforwardedges,and nfb denotes the number of synchronization edges that are feedback edges, then the synchronization cost of G, can be expressedas ( 4 n , + 2 n f b ) .In the remainder of this chapter, we develop techniquesthat apply the results and the analysis framework developed in the previous sections to minimize the synchronization cost of a self-timed implementation of an HSDFG withoutsacrificing the integrity of any inter-processor data transfer or reducing the estimated throughput. sv(e) by src(e) and snk( e ) as a s y n ~
Note that in the measure defined above of the number of shared memory accesses required for synchronization, some accesses to shared memory are not taken into account. In particular, the “synchronization cost” metric does not consider accesses to shared memory that are performed while the sink actor is waiting for the required data to become available, or the source actor is waiting for an “empty slot” in the buffer. The number of accesses required to perform these “busy-wait,’ or “spin-lock” operations is dependent on the exact relative execution times of the actor invocations. Since in the problem context under consideration, this i n f o ~ a t i o nis not generally available to us, the best case number of
SYNC~~ONIZ~TION IN S E L ~ - T ISYSTEMS ~E~ accesses the number of shared memory accesses required for synchronization assuming that IPC data on an edge is always produced before the co~esponding is used as an approximation. sink invocation attempts to execute In the remainder of this chapter, we discuss two mechanisms for reducing sync~onizationaccesses. The first (presented in Section9.7) is the detection and removal of redundunt synchronization edges, which are synchronization edges whose respective sync~onizationfunctions are subsumed by other synchronization edges, and thus need not be implemented explicitly. This technique essentially detects the set of edges that can be moved from the E$ to the set Er.In Section 9.8, we examine the utility of adding additional synchronization edges to convert a synchronization graph that is not strongly connected into a strongly connected graph. Such a conversion allows us to implement all synchronization BS. We address optimization criteria in performing such a conversion, and we will showthat the extra synchronization accesses requiredfor such a conversion are always (at least) compensated by the number of synchronization accesses that are saved by the more expensive UBS synchronizations that are converted to BBS sync~onizations. Chapters 10 and l 1 discuss a mechanism, called resynchrunizutiu~,for inserting synchronization edges in a way that the number of original synchronization edges that become redundant exceedsthe number of new edges added.
The first technique that we explore for reducing sync~onizationoverhead edges from the sync~onizationgraph, is removal of redu~dunt sy~chru~izutiun i.e., finding a minimal set of edges E$ that need explicit synchronization. :A synchronization edge is re ant in a synchronizationgraph G if its removal yields a sync~onizationgraph that preserves G .Equivalently, from definition 9.1, a synchronization edge e is redundant in the synchronization graph G if there is a path p f (e) in G directed from src ( e ) to snk( e ) such that DeZuy(~)I: deZuy( e) .The synchronization graph G is re tains no redundant synchronization edges. Thus, the sync~onizationfunction associated witha redundant synchronization edge ‘‘comes for free” as a by-product of other synchroniz~tions.Figure 9.4 shows an example of a redundant synchronization edge. ere, before executing actor D ,the processor that executes {A,B, C, D} does not need to synchronize with the processor that executes {E, F , G, H } because,due to the sync~onizationedge x1,the corresponding invocation of F is guaranteed to complete before each invocation of D is begun. Thus, x2 is redundant in Figure 9.4 and can be removed fromEs into the set Er.It is easily verified that the path
Chapter 9
P
= ((K G), ( G H),x17 (
4 C ) , (C9 D ) )
is directed from src(x,) to snk( x,) ,and has a path delay (zero) that is equal to the delay on x2. In this section, we discuss anefficient algorithm to optimally remove redundant sync~onizationedges from a synchronization graph.
The following theorem establishes that the order inwhichweremove redundant synchronization edges is not important; therefore all the redundant sync~onizationedges can be removed together. :Suppose that G, = ( V , .Ei,, U .Es) is a sync~onizationgraph, e , and e, are distinct redun~ant synchronization~dges in G, (i.e., these are edges that could be indivi~uallymoved to E,),and G, = (V , Ein,U ( E -{e l})).Then e2 is redundant in G,. Thus both e , and e, can be moved into Ertogether,
roofi Since e, is redundant in G,v,there is a path p st ( e , ) in G, directed from src( e,) to snk( e,) such that
Delay (p')
i_<
(9-9)
delay(e ,).
"
synch. edges internal edges
Figure 9.4. x2 is an example of a redundant syn~hronizationedge.
S Y ~ C H ~ O N I Z A T IIN O ~S E L F - T I ~ SE Y~ S T E ~ S m
Now, if p does not contain e , ,then p exists in G,, and we are done. ~therwise, let p’ = (xt, x2, ...,x,) ;observe that p is of the form p =
( Y l , Y2,
Yk-l,
e19 Y k , Y k + l t
* V * ?
Ym>;
and define
P’’= ( y l , y2,
x27 x , Y k , Y k + 1, ym)* Clearly, p” is a path from src(e2) to snk(e2) in G,. Also, ***?
Y k - 1 v x19
* * * P
M
eZay (p’)+ (~ e Z a(yp )-deZay ( e ~ ~ e Z( ap )y
(from (9-9)) (from (9-8)).
Theorem 9.2 tells us that we can avoid implementing sync~onizationfor
aZZ redundant synchronization edges sincethe “redundancies” are not interdepen-
dent. Thus, an optimal removal of redundant sync~onizationscan be obtained by applying a straightforward algorithm that successively tests the synchronization edges for redundancyin some arbitrary sequence, and since computing the weight of the shortest path in a weighted directed graph is a tractable problem, we can expect such a solution to be practical.
Figure 9.5 presents an efficient algorithm, based on the ideas presented in the previous subsection, for optimal removal of redundant sync~onizationedges. In this algorithm, we first compute the path delay of a minimum-delay path from x to y for each ordered pair of vertices (x,y ) ;here, we assign a path delay of whenever there is no path from x to y .This computation is equivalent to solving an instance of the well known aZZ pui~tsshurtest paths pru~Zem(see Section 3.13). Then, we examine each sync~ronization edge e in some arbitrary sequence and determine whether or not there is apath from src ( e ) to snk ( e ) that does not contain e ,and that has a path delay that does not exceed deZay (e ) . This check for redundancy is equivalent to the check that is performed by the if statement in RemuveRedundantSynchs because if p is a pathfrom src(e) to snk( e ) that contains more than one edge and that contains e ,then p must contain a cycle c such that c does not contain e ;and since all cycles must have posQO
Chapter 9
itive path delay (from Lemma 7. l), the path delay of such a path p must exceed deZay(e) .Thus, if eo satisfies the inequality in the if statement of ~ e m o v e ~ e d~nduntSynchs,and p* is a path from snk( eo) to snk( e ) such that eZuy(p~)= p(snk(e,), snk( e ) ) ,then p* cannot contain e .This observation allows us to avoid havingto recompute the shortest paths after removing a candidate redundant edge from C,. From the d e ~ n i t ~ oofn a redundant synchronizatio~edge, it is easily verified that the removal of a redundant synchronization edge does not alter any of the minimum-delay path values (path delays). That is, given a redundant synchronization edge e, in G,, andtwo arbitrary vertices x, y E V, ifwe let G, = ( V , Eint U (E-{e,.})) , then pGI(x,y ) = P ~ , ~ ( y X), Thus,none of the minimum-delay path values computed in Step 1 need to be recalculated after removing a redundant sync~onizationedge in Step 3. A
Observe that the complexity of the function RemoveRedundantSynchs is dominated by Step 1 and Step 3. Since all edge delays are non-negative, we can repeatedly apply Dijkstra’s single-source shortest path algorithm (once for each vertex) to carry out Step l in O( IVI ’ ) time; we discussed Dijkstra’s algorithm in
chroni~ationgraph C, = E i U , Es raph G,* = ( V , Ein,U (Es-E,))
re 9.5. An algorithm thatoptima~iyremoves redundantsyn~~ronization
S Y N ~ ~ ~ O N I Z A T IIN ON SELF-TI~ED SYSTE~S
Section 3.13. A modification of Dijkstra’s algorithm can be used to reduce the complexity of Step 1 to Q( I V/210g2(1 VI) + I VI IEI) [CLR92]. In Step 3, IEI is an upper bound for the number of synchronization edges, and the in worst case, each vertex has an edge connecting it to every other member of V. Thus, the timecomplexity of Step 3 is Q( l l r l [El),and if we use the modification to Dijkstra’s algorithm mentioned above for Step 1, then the time-complexity of R e ~ u v e R e dundantSyn&hsis 3(Iv1210g,(lvl) + IVllEl + IVllEI) = Q(lVl2l0g2(lVI) + IVllEI). In [Sha89], Shaffer presents an algori inimizes the number of directed synchronizations in the self-timed execution of an HSDFG U (implicit) assumption that the execution of successive iterations of the are not allowed to overlap. In Shaffer’s technique, a construction identical to the sync~onizationgraph is used except that there is no feedback edge connecting the last actor executed on a processor to the first actor executed on the same processor, and edges that have delay are ignored since only intra-iteration dependencies are significant. Thus, Shaffer’s synchronization graph is acyclic. Re~uveRed~ndantSynchs can be viewedas an extension ofShaffer’s algorithm to handle self-timed, iterative execution of an HSDFG;Shaffer’s algorithm accounts for self-timed execution only within a graph iteration, and in general, it can be applied to iterative dataflow programs only if all processors are forced to synchronize between graph iterations.
In this subsection, we illustrate the benefits of removing redundant synchronizations through a practical example. Figure 9.6(a) shows an abstraction of a three channel, multi-resolution quadrature mirror (QMF)filter bank, which has applications in signal compression [\rai93]. This representation is based on the general (not homogeneous) SDF model, and accordingly, each edge is annotated with the number of tokens produced and consumedby its source and sink actors. Actors A and F represent the subsystems that, respectively, supply and consume data tolfrom the filter bank system; B and C each represents a parallel combination of decimating high and low pass FIR analysis filters; D and E represent the corresponding pairs of inte~olatingsynthesis filters. The amount of delay onthe edge directed from B to E is equal to the sum of the filter orders of C and D . For more details on the application represented by Figure 9.6(a), we refer the reader to [Vai93]. To construct a periodic parallel schedule, we must first determine the num-
( N ) that each actor N must be invoked in the periodic schedule, the precedence relation-
as described in Section 3.6. Next, we must determine
Chapter 9
Figure 9.6.(a) A multi-resolution QMF filter bank usedto illustrate the benefitsof (b) The precedence gra removin~redundant synchroni~~tions. self-ti~ed,two-processor, parallel schedule for (a). (cl) The initialsynchroni~ation graph for (c).
SYNCHRO~IZATI~N IN SELF-TI~EDSYSTEMS
ships between the actor invocations. In d e t e ~ i n i n gthe exact precedence relationships, we must take into account the dependence of a given filter invocation on not only the invocation that produces the token that is “consumed” by the filter, but also on the invocations that produce the n preceding tokens, where n is the order of the filter. Such dependence can easily be evaluated with an additional dataflow para~eteron each actor input that specifies the number of past tokens that are accessed [Pri91]’. Using this information, together with the invocation counts specified by ,we obtain the precedence relationships specified by the graph of Figure 9.6( ,in which the i th invocation of actor N is labeled Ni ,and each edge e specifies that invocation snk( e ) requires data produced by invocation src( e ) delay( e ) iteration periods after theiteration period in which the data is produced.
A self-timed schedule for Figure 9.6(b) that can be obtained from Hu’s list scheduling method [Hu6l] (described Section 5.3.2) is specified in Figure 9.6(c), and the synchronization graph that corresponds to the IPC graph of Figure 9.6(b) and Figure 9.6(c) is shown in Figure 9.6(d). All of the dashed edges in Figure 9,6(d) are synchronization edges. If we apply Shaffer’s method, which considers only those synchronization edges that do not have delay, we can eliminate the need for explicit synchronization along only one of the 8 sync~onizationedges 2). In contrast, ifwe apply ~ e ~ ~ v e ~ e d u n d a n t $ y nwe c hcan s, detect the redundancy of (A ,,B2)as well as four additional redundant synchronization edges (A3,B,),(A4,B,),( B 2 ,E , ) ,and (B,, E 2 ) .Thus, ~ e ~ ~ v e ~ e a d~ndant$ynchsreduces the number of synchronizations from 8 down to 3 reduction of 62%. Figure 9.7 shows the synchronization graph of Figure 9.6(d) after all redundant sync~onizationedges are removed. It is easily verified that the sync~onizationedges that remain in this graph are not redundant; explicit sync~onizationsneed only be implemented for these edges.
Y
In Section 9.5.1, we defined two different sync~ronizationprotocols bounded buffer synchronization (BBS), which has a cost of 2 synchronization accesses per iteration period, and can be used whenever the associated edge is contained in a strongly connected component of the synchronization graph; and l. It should be noted that some SDF-based design environments choose to forgo paralleli~ation across multiple invocations of an actor in favor of simplified code generation and scheduling. For example, in the GRAPE system, this restriction has been justifiedon the grounds that it simplifies inter-processor data manageof efficient scheduling ment, reduces code duplication, and allows the derivation algorithms that operate directlyon general SDF graphs without requiring the use of the acyclic precedence graph (APG) [BELP94).
Chapter 9
unbounded buffer synchronization (UBS), which hasa cost of 4 synchronization accesses per iteration period. We pay the additional overhead of UBS whenever the associated edge is a feedforward edge of the synchronization graph. One alternative to implementing UBS for a feedforward edge e is to add synchronization edges to the synchronization graph so that e becomes encapsulated in a strongly connected component; sucha transformation would allow e to beimplementedwithBBS.However, extra synchronizationaccesseswillbe required to implement the new synchronization edges that are inserted. In this section, we show that by adding synchronization edges through a certain simple procedure, the synchronization graph can be transformed into a strongly connected graph in a way that the overhead of implementing the extra synchronization edges is always compensated by the savings attained by being able to avoid the use of UBS. That is, the conversion to a strongly connected synchronization graph ensures that the total number of sync~onizationaccesses required (per iteration period) for the transformed graph is less than or equal to the number of synchronization accesses required for the original synchronization graph. T'hrough a practical example, we show that this transformation can signi~cantly reduce the number of required synchronization accesses. Also, we discuss atechnique to compute the delay that should be added to each of the new edges added
synch. edges internal edges
Figure 9.7. The synchronization graphof Figure 9.6(d) after all redundant synchronization edges are removed.
SYN~H~ONI~ATION IN S E L F - ~ ISYSTEMS ~E~
in the conversion to a strongly connected graph. This technique computes the delays in a way that the estimated throughput of the IPC graph is preserved with minimal increase in the shared memory storage cost required to implement the communication edges.
Figure 9.8 presents an efficient algorithm for transforming a synchronization graph that is not strongly connected into a strongly connected graph. This algorithmsimply“chainstogether” the source SCCs, and similarly, chains together the sink SCCs. The construction is completed by connecting the first SCC of the “source chain” to the last SCC of the sink chain with anedge that we .From each source or sinkSCC, the algorithm selects a execution time to be the chain “link” co~espondingto tion time vertices are chosen in an attemptto minimize the amount of delay that must be inserted on the new edges to preserve the esti-
chronization graphG that is not strongly connected. rongly connected graph obtained by adding edges between the SCCs of G . enerate an orderingC,, Cz,...,C,,, of the source SCCs of G , and similarly, generate an ordering D l , D*,...,D,,of the sink SCCs of G . ,E C, that minimi~est ( * ) over C,. lect a vertex vi E cithat minimizes t ( * ) over ci. tantiate edge the vi). t a vertex W , E D , that minimizes t ( * ) over D,. Selectavertex wi E. that mini~izest ( * ) over ~nstantiatethe edge do(wi-,,w i ) .
.~nstantiatethe edge ~ ~ ( w v, l, ), ., Figure 9.8. An algorith~ for converting a synchronization graph that is not strongly connected into a strongly connected graph.
Chapter 9
mated t ~ o u g h p u of t the original graph. In Section 9.9, we discuss the selection of delays for theedges introduced by Convert-to-SC-grap~. It is easily verified that algorithm Convert-to-~C-gr~ph always produces a strongly connected graph, and that a conversion to a strongly connected graph cannot be attained by adding fewer edges than the number of edges added by Conve~-to-SC-gra~~, Figure 9.9 illustrates a possible solution obtained by algoWere, the black dashed edges are the synchronizarithm Convert~to"SC"gra~h. tion edges contained in the original sync~onizationgraph, and the grey dashed The dashed edge edges are the edges that are added by Convert~to"SC-gra~h. labeled e, is the sink-source edge. ~ s s u m i n gthe synchronization graph is connected, the number of feedforward edges nf must satisfy (nf 2( n , -1)) ,where n, is the n~mberof SCCs. This follows from the fundamental graph theoretic fact that in a connected graph (V*,E*) ,IE.1 must be at least (1V.l -1) .Now, it is easily verified that the number of new edges introduced by Convert-to-~C~grap~ is equal to (nsrc+ n,,k -1) where n,,., is the number of source secs,and It,,k is the number of sink SCCs. Thus, the number of syn~hronizationaccesses per iteration period, S, that is required to implement the edges introd~cedby CoItvert-t0-S~graph is (2 x (n,,.,+ nsnk-1)) ,while the number of sync~onizationaccesses, ?
~
~
~
-
An illustrat~onof a possible solution obtaine~by a ~ ~ o r i t h ~ ~ r a ~ ~ ~
~
N IN § E L F - T I ~ SYSTE E~ S-, eliminated by Convert-to-SC-grup~ (by allowing the feedforward edges of
original sync~onizationgraphtobe
implemented with
S) equals 2n,. It follows that the net change (S+ -S-) in th
nization accesses satisfies
+
(S+ -S-) = 2(n,,., n%ynk -1 )
and thus, (S+-S-) S 0 .
-2n, = 2(n, -1 -n , f )S 2(n, -1 -( n , -l ) )
have established the following result.
the graph uppose that G is a sync~onizationgraph, andis om applying algorith~Conve~-to-SC-grup~ to G . Then the synchronization cost of & is lessthan or equal to the synchronization cost of G . For example, without the edges added by Convert-to-SC-grup~(the dashed grey edges) in Figure 9.9, there are 6 feedforward edges, which require 24 synchronization accesses per iteration period to implement. The addition of the 4 dashed edges require ynchronization accesses to implement these new edges, but allows us to use for the original feedforward edges, which leads on accesses for the original feedforward edges. to a savings of 12 synchr Thus, the net effect achieved by ~onvert-to-SC-grap~ in this example is a reductionofthe total number f sync~onizationaccesses by (12 -8 ) = 4 . As another example, consider igure 9.10, which shows the synchronization graph topology (after redun~antsynchronization edges areremoved) that results from a four-processor schedule of a esizerfor plucked-s~ingmusical inst~ments insevenvoicesbasedonthe us-~trongtechnique. This algorithm was also discussed in Chapter 3, as an example application that was implemented on the ordered memo^ access archit~ctureprototype. This graph contains ni = 6 synchronization edges (the dashed edges), all of which are feedforward edges, so the S nc~onizationcost is 4ni = 24 sync~onizationaccesses per iteration period. nce the graph has one source SCC and one sink SCC, only one edge is added and adding this edge reduces the synchronization cost by Convert-to-SC-grup~, to 2ni + 2 = 14 a 42% savings. Figure 9-11 shows the topology of a possible solution computed by Conve~-to-SC-grap~ on this example. Here, the dashed edges represent the synchronization edges in the synchronization graph returned by C o n v ~ ~ - t o - S ~ - g r a p ~ .
ne impo~antissue that remains to be addressed !i the conversion of a s y n c ~ o ~ i z a t i ograph n G, into a strongly connected graph G,? is the proper insertion of delays so that is not deadlocked, and does nothavelower estimated throughput than G,. The potential for deadlock and reduced estimated throughput arise because the conversion to a strongly connected graph must necessarily introduce one or more new fundamental cycles. In general, a new cycle may be
Chapter 9
delay-free, or its cycle mean may exceedthat of the critical cycle in C,s.Thus, we may have to insert delays on the edges added by Co~vert-to-SC-gra~~. The location (edge) and magnitude of the delays that we add are significant since they affect the self-timed buffer bounds of the communication edges, as shown subsequently in Theorem 9.4. Since the self-timed buffer bounds determine the amount of memory that we allocate for the corresponding buffers, it is desirable to prevent deadlock and decrease in estimated throughput in a way that the sum of the self-timed buffer bounds over all communication edges is minimized. In this section, we outline a simpleand efficient algorithm called ~ e t e r ~ i ~ e ~ efor zays z a y s an optimal result addressing this problem. Algorithm ~ e t e r ~ i ~ e ~ eproduces
roc2
roc4
roc3
--"
synch. edges internal edges
igure 9.1 0. The synchronization graph, after redundant synchronization edges a four-processor schedule of a musicsynthesizer areinducedby bas ~ a r p ~ u s - ~ t r algorithm. ong
SY~CHR~~I~A INTSEL~-TIME~ IO~ SYSTEMS if G, has only one source SCC or only one sink SCC; in other cases, the algorithm must be viewed as a heuristic. Our algorithm produces an optimal result if G, has only one source SCC or only one sink SCC; in other cases, the algorithm must be viewed as a heuristic. In practice, the assumptions under which we can expect an optimal result are frequently satisfied. For si~plicityin explaining the opti~alityresult that has been established y s first , specify a restricted version of the algofor Algorithm ~ e t e r ~ i n e ~ e z awe rithm that assumes only one sink SCC. After explaining the optimality of this restricted algorithm, we discuss how it can be modified to yield an optimal algorithm for the general single-source-SCC case, and finally, we discuss how it can be extended to provide a heuristic for arbitrary synchronization graphs. Figure 9.12 outlines the restricted version of Algorithm ~ e t e r ~ i n e ~ e z a y s that applies when the synchroni~ationgraph G, has exactly one source SCC. Here, ~ e Z l ~ is~ assumed ~ ~ o to r be ~ an algorithm that takes a synchronization graph 2 as input, and repeatedly applies the Bellman-Ford algorithm discussed in Section 3.13 to return the cycle mean of the critical cycle in 2;if one or more cycles exist that have zero path delay, then ~ e l l ~ returns ~ n ~ ~. r ~ 00
Figure 9.1 1. A possible solution obtained by applying ~ o f f v ~ r t - t o - ~to~the -gr~~~ example of Figure 9.10.
Chapter 9
unction ~ e t e r ~ i ~ e ~ e l a y s
nput: Synchronizationgraphs G, = (V,E ) and G S ,where is thegraphcomputed by Conve~t-fo-SC-g~a~~ when applied to G,. The ordering of source SCCs generated in is denoted Cl, Cz,...,C,, . For i = 1, 2, ...m -1 , ei Step 2 of Converf-fo-SC-gra~~ denotes the edge instantiated by Converf-fo-SC-gra~~ from a vertex in Ci to a vertex in .The sink-source edge instantiated byConverf-fo-SC-g~a~~ is denoted e , .
put: on-negative integers do,d l , ...,d,- such that the estimatedthrou~hputwhen delay (ei) = di ,0 5' i 5' m -1 ,equals estimated throughputof G, .
G,[ e , -+ 00,...)e,,,-l -+-1 P set delays on each edge to be infinite*/ h,,,= ~ e ~ / ~ aX ,~) ~ o r ~ ( P compute the max. cycle mean of G, *I
X,
P an upper bound on the delay required for any ei "I r i = 0, 1, ...,m - 1
Si = ~ ~ ~ ~ e L a ei, y (~X i , Cdub)~ = Xi[ei-+ Si]
AYi+)
turn S,
, P the fix delay on
eibe to
Si *I
SI,...,S,-1 .
ion ~ i n ~ eX,l e, a h, ~B) .A synchronization graphX , an edge e in X , a positive real numberh ,and a positive integer B e
utput: Assuming X [ e -+B] has estimated throughput no less thanh-',det~rminethe minimum d E (0, 1, ...,B} such that the estimated throughput of X[e dl is no less than h" .
-+
~ e ~ o ram binarysearch in therange [0, 1, ...,B] tofindtheminimumvalueof r E {0, 1, ...,B} such that ~ e / / ~ a n ~ X[e o r-~ + r] ( ) returns a value less than or equal to h . Return this minimum value ofr
Figure 9.12. An ~~gorithm for determining thedelays on the edgesi n t r o ~ u ~ ebyd ~ i g o r i t hC~Q ~ ~ e ~ - f Q - S C - g ~ ~ ~ ~ .
S Y ~ ~ H R O ~ I IN ~ ~SELFT I O TIME^ ~ SYSTEMS
In developing the optimality properties of Algorithm D e t e r ~ i n e D e l ~ y s , we will use the following definitions: :If G = ( V , E ) isaDFC; (e,,e,, ...,e,-,) isasequenceofdistinct members of E ; and
then C [e,
”+
A,, ...,e, -l -+An- denotes the DFC
whereeach ei’ is defined by src( e,’) = src( e,) , snk( e,’) = snk( e,) , and ~ e l a ye,’) ( = A,.Thus, G[eo”+ A,, ...,e, -l ”+ A, - is simply the DFG that results from “changing the delay” on each ei to the corresponding new delay value A,. at G is a synchronization graph that preserves G,. n G is a minimum-delaypath in G directed from S an IPC edge (in G, ) . otivation for Algorithm ~ e t e ~ i n e D e l a is y sbased on the observations e paths introduced by C o n v e r t - t o - ~ C - ~ can r a ~be ~ p ~ i t i o n e dinto m non-empty subsets P,, P , , ..,P,- such that each member of P, contains eo, e ,,...,e, and contains no other members of {e,, e ...,e, - }, and similarly, the set of fundamental cycles introduced by Deter~ineDelayscan be p~titionedinto W O , W ,.,..,W,,,- such that eachmemberof W , contains e,, e,,... e, and contains no other members of{e,, e,, ..,e, - }.
,
y const~ction,a nonzero delay on any of the edges eo, e ,,..,,e, tributes to reducing the cycle means of all members of W, ”. Algorithm ~ i n e ~ e lstarts ~ y s(it~rationi = 0 of the For loop) by determining the minimum delay 6,) on eo that is required to ensure that none of the cycles in W O has a cycle mean that exceeds the maximum cycle mean h,,, of G,. Then (in iteration i = 1 )the algorithm determines the minimum delay on e, that is required to guarantee that no member of W Ihas a cycle mean that exceeds h,,, ,assuming that ~ e l a yeo) ( = 6, . ow, if delay (e,) = 6, ,~ e l ( e~ y = 6, ,and 6, >0 ,then for any positive integer k S 61,k units of delay can be “transferred from e l to eo without violating the property that no member of ( W o U W , ) contains a cycle whose cycle mean exceeds h,,, .However, such a transfo~ationincreases the path ”
1,See Figure 9.12 for the specification of what the e, S represent.
Chapter 9
delay of each member of P, while leaving the path delay of each member of P I unchanged,and therefore such a transformationcannotreduce the self-timed buffer bound of any IPC edge. Furthermore, apart from transferring delay from e , to e o , the only other change that can be made to delay( e,) or delay ( e,)without introducing a member of (W, U W , ) whose cycle mean exceeds h,,, is to increase one or both of these values by some positive integer amount(s). Clearly, such a change cannot reduce the self-timed buffer bound on any IPC edge. Thus, we see that the values 6,) and 6, computed by Dete~ineDeZaysfor delay(eo) and deZay(e,),respectively, optimallyensurethat no member of ( W 0 U W , ) has a cycle mean that exceeds h,,,,.After computing these values, Determi~eDelayscomputes the minimum delay 6,on e, that is required for all members of W, to have cycle means less than or equal to h,,, ,assuming that delay( e,) = 6,) and delay ( e , ) = 6, . Given the “configuration” (deZuy(e,) = 6(), delay(e,) = a,, delay(e,) = 6,), transferring delayfrom e2 to e , increases the path delay of all members of P , ,while leaving the path delay of each member of (POU P,) unchanged; and transferring delay from e, to eo increases the path delay across ( P o U P,) ,while leaving the path delay across P, unchanged. Thus, by an argument similar to that given to establish the optimality of (6,),6,) with respect to ( W oU W , ) , we can deduce that (1) the values computed by Determine~elaysfor the delays on e,,, e , , e , guarantee that no member of ( W OU W , U W , ) has a cycle mean that exceeds h,,, ;and (2) for any other assignment of delays (6()’,6,’, 6,’) to (e,,,e , ,e 2 ) that preserves the estimated throughput across ( W oU W ,U W,), and for any IPC edge e such that an IPC sink-source path of e is contained in ( P o U P , U P,) ,the self-timed buffer bound of e under the assignment (6,)’, 6,’,6,’) is greater than or equal to self-timed buffer bound of e under the assignment ( 6,, 6,) computed by iterations i = 0, l , 2 of Determi~eDeZays. After extending this analysis successively to each of the remaining iterawe arrive at the foltions i = 3,4, ...,m -l of thefor loop in Determine~eZays, lowing result. : Suppose that G, is a sync~onizationgraph that has exactly one
be as in Figure 9.12; let sink SCC; let G, and ( e oe, l , ...,e,(do,d l , ...,din- ,) be the result of applying DetermineDeZays to G,sand ;and let (do’, d l ’ , ...,dm- beanysequence of m non~negativeintegers such that [eo -+do’, ...,e , -+d, - ’3 has the same estimated throughputas G, .Then _ I
,
,
*.*,e,-, ~d,-,’l)r:Q1.(6,[e,~do, ?e,-,-+d,-,l), (X) denotes the sum of the self-timed buffer bounds over all IPC edges
( ~ * s ~ % - + ~ o ’ ,
e . .
SYNCHRONIZATION IN SEL~-TIMED SYSTEMS
in G, induced by the sync~onizationgraph X. Figure 9.13 illustrates a solution obtained from ~ e t e r ~ i n e ~ e L aHere y s . we assume that t ( v ) = 1 ,for eachvertex v, and we assume that the set of IPC edges is {e,, e b } (for clarity, we are assuming in this example that the IPC edges are present in the given synchronization graph). The grey dashed edges are the edgesadded by Convert-to-SC-~rap~. We see that h,,,, is determined by the cycle in the sink SCC of the original graph, and inspection of this cycle yields h,,, = 4 , Also, we see that the set W O the set of fundamental cycles that contain e, ,and do not contain e l consists of a single cycle cO that contains three edges. By inspection of this cycle, we see that the minimum delay on e,, required to guarantee that its cycle mean does not exceed h,,,v is 1. Thus, the i = 0 iteration of the For loop in ~ e t e r ~ ~ n e ~ ecomputes Z a y s 6, = l .Next, we see that Wl consists of a single cycle that contains five edges, and we see that two delays must be present on this cycle for its cycle mean to be less than or equal to h,,, .Since one delay has been placed on e , ,~ e t e r ~ i n e ~ e Lcomays putes 6, = l in the i = l iteration of the For loop. Thus, the solution determined by ~ e t e r ~ i n e ~ e z for a y sFigure 9.13 is (6,, 6,) = (1, l ) ;the resulting self-timed buffer bounds of e, and eb are, respectively, 1 and 2 ; and
Figure 9.13. An example used to illustratea solution obtained byalgorit~m ~eter~i~e~ei~ys.
Chapter 9
=2+1=
3.
ow (2,O) is an alternative assignmentofdelayson (eo,e l ) thatpreserves the estimated throughput of the original graph. However, in this assignhe self-timed buffer bounds of e, and eb are identically equal = 4 , one greater than the c o ~ e s p o n d i n g ~ ufrom m the delay assignment (1, 1) computed by DetermineDeZays. Thus, if C , denotes the graph returned by Cu~vert-tu-SC"graphfor the example of Figure9.13, we have that
( X ) denotes the sum of the self-timed buffer bounds over all IPC edges
A~gorithmDeter~ineDeZayscan easily be modified to optimally handle general graphs that have only one suurce SCC. Here, the algorithm s~ecification remains essentially the same, with the exception that for i = 1 2, . .) ( m -l ) ei denotes the edge directed from a vertex in D , - i to a vertex in D, -;+ ,where D2, ...,D, is the ordering of sink SCCs generated in tep 2 of the corresponding invocation of Cunve~-tu-SC-graph(eo still denotes the sink-source edge instantiated by Cunvert-tu-SC-graph),By adapting the reasoningbehind Theorem 9.4, it is easily verified that when it is applicable, this modified algorithm always yields an optimal solution. As far as we are aware, there is no straight for war^ extension of Deterdays to general graphs (multiple source SCCs and multiple sink SCCs) guaranteed to yield optimal solutions. The fundamental problem for the eneral case is the inability to derive the partitions W O , W , . , ..,W,,,(), P , , . ..,P, - )of the fundamental cycles ( P C sink-source paths) introduced by ~ U ~ v e r t - t u - S C - g rsuch ~ p h that each Wi ( P ; )contains eo, e,, . .,e; ,and contains no other members of E, = {eo, e,, ...,e,- },where E, is the set of edges added by C u ~ v e r t - t u - ~ C - g rThe ~ ~ hexistence . of such pa~itionswas crucial to our development of Theorem 9.4 because it implied that once the minimum values for eo, e,, ...,e,. are successively computed9 " t r a n ~ f e ~ i n delay g ' ~ from some ei to some e j ,j
,
,
e t e ~ i n e D e ~ a ycan s beextended to yield heuristics for the eneral case in which the original synchronization graph C, contains more than ,",a k ) ne source SCG and more than one sink SCC. For example, if ( a l , a 2.. denoteedges that were instantiated by C ~ n v e ~ - t u - ~ C - g r f"between" fph the CCs with each ai representing the i th edge created and similarly,
SYNCHRONIZATION IN S E L ~ - T ISYSTE~S ~E~ (b l , b2t ...,6,) denote the sequence of edgesinstantiated between the sink SC thenalgorithm ~ e t e ~ i n e ~ e ~can abeyapplied s with the modi~cation that m = k-tZ+l,and
where e, is the sink-source edge from C o n v e r t ~ t o - ~ C - g r a ~ ~ z . The derivation of alte~ativeheuristics for general synchroni~ationgraphs appears to be an interesting direction for further research. It should be noted, though, that practical synchronization graphs frequently contain either a single such as the example of Figure 9.10 so source SCC or a single SGG, or both that algorithm ~ e t e r ~ ~ n e ~ e ztogether a y s , with its counte~artfor graphs that have a single source SCC, form a widely-applicable solution for optimal~ydeterini in^ the delays on the edges created by C u n v e ~ - t o - ~ C - g r u ~ ~ .
f
Figure 9.14.A synchronization graph, afterprocess in^ by C o ~ v e ~ - t o such that there is no m -way partition WO,W1, ...,W,- of the fundamenta by C o n v e r ~ - t o - ~ C -that ~ r asatisfies ~~ both (1). Each W , contains e, et, .,.,ei and (2). Each W i does not contain any memberof ei+ ei+2,... e,- I .Here, the fundamental cycles introduced by C ~ ~ v ~ r ~ - i o - ~ y dashed edges are the edges instantiated by ~ aare~(eo,~a19 a,) ) (eo, e,, a29 a,') (eo, e , , a2, a47 e21 is easilyv ~ r i f i e that ~ these cycles cannot be decom~osed e dreorder thee, S. n if we are ~ l ~ o w to t
1
Chapter 9
re exist constants 7' and such that t ( v ) 5 7 ' , for all v , and delay( e ) S D for all edges e ,then the complexity of ~eZZi~un~ord is O( I VI /Ellog,( 1 V I ) ) (see Section 3.13.2); and we have
and t ( v ) 5 TIVI
f
so that drtb5 DTIV/ .Thus, each invocation of ~ i n ~ e l runs u y in
y s and any of the variations of ~ e t e r M i n ~ ~ e l a y s It follows that D e t e ~ i n e ~ e l adefined above is O(ml VI IEI (log2(IV1))2),where m is the number of edges instantiated by Convert-to-SC-gra~h. Since M = (nsrc+ nSnk-1) ,where a, is the number of source SCCs, and ~,,,l~ is the number of sink SCCs, it is obvious that M IVI .With this observation, and the observation that IEI 5 /VI2,we have that ~eterMine~eluys and its variations are O( IV14(log,( IVi))') .Furthermore, it is easily verified that the time complexity of ~eterMine~elays dominates that of Conve~-to-SC~gruph, so the time complexity of applying Convert-to-SC-gruph etermine~ezuysin succession is also o ( ~ v ~ ~ ( ~ o ~ , ( ~ v ~ ) ) * ) .
Although the issue ofdeadlock
does not explicitly arise inalgorithm
eterMineDeluys,the algorithm does guaranteethat the output graph is not dead-
locked, assuming that the input graph is not deadlocked. This is because (from Lemma 7.1) deadlock is equivalent to the existence of a cycle that has zero path delay, and is thus equivalent to an infinite maximum cycle mean. Since Deterelays does not increase the maximum cycle mean, it follows that the algorithm cannot converta graph that is not deadlocked into a deadlocked graph.
Converting a mixed grain HSDFG that contains feedforward edges into a stron ly connected graph has been studied by Zivojnovic, 941 in the context of retiming when the assignment of actors to processors is fixed beforehand. In this case, the objective is to retime the input graph so that the number of communication edges that have nonzero delay is maximized, and the conversion is performed to constrain the set of possible retimings in such a way that an integer linear programmin~formulation can be developed. Thetechnique generates two dummyvertices that are connected by an edge; the sink vertices of the original graph are connected to one of the dummy vertices, while the other d u ~ m yvertex is connected to each source. It is easily verified that in a self-
S~NCH~ONIZATIO~ IN S E L F - T I ~SYSTEMS E~
timed execution, this scheme requires at least four more synchronization accesses per graph iteration than the method that we have proposed.We can obtain further relative savings if we succeed in detecting one or more beneficial resynchronization opportunities. The effect of Zivojnovic’s retiming algorithmon synchronizationoverhead is unpredictable since, on onehand, a communicationedge becomes “easier to make redun~ant”when its delay increases, while on the other hand, the edge becomes less useful in making other communication edges redundant since the path delay of all paths that contain the edge increase.
This chapter has developed two software strategies for minimizing synchronizationoverheadwhenimplementing self-timed, iterative dataflowprograms, These techniques rely on a graph-theoretic analysis framework based on two data structures called the interprocessor communication graph and the synchronization graph. This analysis framework allows us to determine the effects on throughput and buffer sizes of modifying the points in the target program at which sync~onizationfunctions are carried out, and we have shown how this framework can be used to extend an existing technique removal of redundant synchronization edges for non-iterative programs to the iterative case, and to the conversion develop a new method for reducing synchr~nizationoverhead of a sync~onizationgraph into a strongly connected graph so that a more efficient sync~onizationprotocol can be used. As in Chapter’7,the main premise of the techniques discussed in the chapter is that estimates are available for the execution times of actors such that the actual execution time of an actor exhibits large variation from its corresponding estimate only with very low frequency. Accordingly, our techniques have been devised to guarantee that if the actual execution time of each actor invocation is always equal to the corresponding execution time estimate, then the throughput of an implementation that incorporates our synchronization minimization techniques is never less than the throughput of a corresponding unoptirnized implementation that is, we never accept an opportunity to reduce synchronization overhead if it constrains execution in such a way that t ~ o u g h p u is t decreased. Thus, the techniques discussed in this section are particularly relevant to embedded DSP applications, where the price of synchronization is high, and accurate execution time estimates are often available, but guarantees on these execution times do not exist due to infrequent events such as cache misses, interrupts, and error handling. In the nexttwo chapters, we discuss a third software-basedtechnique for reducing synchronization overhead in applicationcalled r~sync~runizatiun, specific multiprocessors.
This Page Intentionally Left Blank
This chapter discusses a technique, called resync~roniz~tio~, for reduci~g synchronization overheadin application-specific multiprocessor implementations. The t e ~ h n i ~ uapplies e to arbitrary collections of dedicated, programmable or configurable processors, such as combinations of programmable DSPs, ASICS, and FPGA subsystems. synchronization is based on the concept of redundant synchronization operations, which was defined in the previous chapter. The objective of resynchronization is tointroduce new synchroni~ationsin such a way that the number of original synchroni~ationsthat consequently become redundant is significantly more than number of new sync~onizations.
Intuitively, resync~onizationis the process of adding one or more new sync~onizationedges andremoving the redundant edges that result. Figure lO,l(a) ill~strateshow this concept can be used to reduce the total numberof synchronizations in a multiprocessor implementation, Here, the dashed edges represent synchronization edges. Observe that if the new synchronization edge do( C, H ) is inserted, then two of the original synchronization edges ( and ( E , J)become redundant. Since redundant synchronization edges can be removed from the synchronization graph to yield an equivalent synchronization graph, we see that the net effect of adding the sync~onizationedge do(C, H ) is to reduce the number of synchroni~ationedges that need to be imple~entedby 1 .Figure lO.l(b) shows the sync~onizationgraph that results from inserting the r ~ ~ y n c ~ r o n i ~edge ~ t ~ do( o n C, H ) into Figure 1O.l(a), and then ~emovingthe redundant sync~onizationedges that result. ~ e ~ n i t i o10.1 n gives a formal definition of resynchronization. This considers resynchronization only “across” feedforward edges. Resynchroni~ation that includes inserting edges into SCCs is also possible; however, in general, such resynchronization may increase the estimated throughput (see Theorem 10.1 at
Chapter 10
the end of Section 10.2). Thus, for our objectives, it must be verified that each new synchronization edge introduced in an SCC does not decrease the estimated throughput. To avoid this complication, which requires a check of significant complexity (0(IVl IEllog,( I V I ) ) ,where ( V ,E) is the modified synchronization graph, using the Bellman Ford algorithm described in Section 3.13.2) for each candidate resynchronization edge, we focus only on “feedforward” resynchronization in this chapter. Future research will address combining the insights developed herefor feedforward r e s y n c ~ o ~ ~ z a twith i o n efficient techniques to estimate the impact that a given feedback resynchronization edge has on the estimated throughput. Opportunities for feedforward resynchronization are pa~icularlyabundant in the dedicated hardware implementation of dataflow graphs. If each actor is mapped to a separate piece of hardware, as in the VLSI dataflow arrays of Kung, 8’71, then for any application graph that is acyclic, every communication channel between two units will have an associated feedforward sync~onizationedge. Due to increasing circuit integration levels, such isomorphic mapping of dataflow subsystems into hardware is becoming attractive for a growing family of applications. Feedforward synchronization edges often arise naturally in multiprocessor software implementations as well. A software exam-
”
\
\ \
\
I ”
Figure 10.1. An example of resync~ron~zation.
ple is reviewed in detail in Section 10.5.
itio : Suppose that G = ( V , E) is a synchronization ra h, and {e,, e2,...,e,,} is the set ofallfeedforwardedges in G .A resy aof G is a finite set R = {e,’,e2’, ...,e,’} of edges that are not lY contained in E , butwhosesourceand sink vertices are in V , such that a) e ,e2 ,...,e,’ are feedforward edges in the HSDFG G* = ( V , ((E-F ) + R ) ); andb) G* preserves G that is, pG*(src(e,), snk(ei))S deZay(ei) for all i E 1 2, ..,,n } .Each member of R that is not in E is called a re e of the resynchronization R ,G* is called the resy d with R ,and this graph is denoted by “(R, G) . ’
f
If we let G denote the graph in Figure 10.1, then the set of feedforward edges is F = {(B,G), (E,J), (E,C ) , (H,I)}; R = id&, H>(E, , C ) , (H, 1)) is a resync~onizationof G ;Figure 10.1(b) shows the HSDFG
G*
=I
(V,((E--F)+R));
and from Figure lO.l(b), it is easily verified that F , R , and G* satisfy conditions (a)and (b) of Definition 10.1. Typically, resynchro zation is meaningful only in the context of synchronization graphs that are not that is, synchronizationgraphs that do not contain any delay-free cycles, or equivalently, that have infinite estimated throughput. In the remainder of this chapter and throughout Chapter 11, we are concerned only with deadlock-free synchronization graphs. Thus, unless otherwise stated, we assume the absence of delay-free synchronization graph cycles. In practice, this assumption is not a problem, since delay-free cycles canbe detected efficiently [Kar’78].
This section reviews a number of useful properties ofsynchronization redundancy and resynchronization that we will apply throughout the developments of this chapter and Chapter 11.
.l: Suppose that G = ( V , E) is a synchronizationgraphand S is a redundant synchronization edge in G . Then there exists a simple path p in G directed from src(s) to snk( S ) such that p does not contain S , and DeZay(p) S deZay(s) . Proofi Let G’ = ( V , (E -{ S } ) ) denote the synchronization graph that results when we remove S from G . Then from Definition 9.2, there exists a path p’ in G’ directed from SE( S ) to snk (S) such that
Delay (p’) S delay (S) .
(10-1)
Chapter 10 Now observe that every edge in C’ is also contained in C ,and thus, C contains the path p’ .If p’ is a simple path, then we are done. ~therwise,p’ can be expressed as a concatenation
where each qi is a simple path, at least one qi is non-empty, and each C j is a (not necessarily simple) cycle. Since valid synchronization graphs cannot contain delay-free-cycles (Section lO.l), we must have Delay( C,) 2 1 for l 5 k 5 n . Thus, since each C i originates and terminates at the same actor,thepath p’’ = (qo, ql, ...,qn) is a simple path directed from src(s) to ~ n k ( such ~ ) that ~ e l a y ( p ”<) Delay( p’) .~ombiningthis last inequality with (10-1) yields
( l 0-3)
~ e l a y ~ p ”delay( ) S).
F u r t h e ~ o r esince , p’ is contained in G ,it follows from the construction of p’’ , that p’’ must also be contained in G . Finally, since p’ is contained in C’, C’ does not contain S ,and the set of edges contained in p” is a subset of the set of edges contained in p’ we have that p” does not contain S . QED. :Suppose that G and G’ are synchronization graphs such that G’ preserves G ,and p is a path in G from actor x to actor y .Then there is a path p’ in G’ from x to y such that Delay ( p ’ ) 5 Delay( p ) ,and tr( p ) G tr( p’) where tr( U>) denotes the set of actors traversed by the path cp.
Thus, if a synchronization graph G’ preserves another synchronization graph C and p is a path in C from actor x to actor y ,then there is at least one path p’ in G’ such that 1) the path p’ is directed from x to y ;2) the cumulative delay on p’ does not exceed the cumulative delay on p ;and 3) every actor that is traversed by p is also traversed by p’ (althoug~p’ may traverse one or more actors that are not traversed by p ). For example in Figure lO.l(a), if we let x = B ,y = I and ?
p
= ( ( B ,C ) , (G,
W
?
(H7I)), (10-4)
in Figure lO.l(b) confirms Lemma 10.1 for this example. Here tr(p)
= { B , G, H,I } and tr(p’) = {A,B, C, G, H,I } .
Proof of ~~~a IO.I: Let p = ( e l,e2, ...,e,) .By definition of the preserves relation, each ei that is not a synchronization edge in G is contained in G’. For each ei that is a synchroni~ationedge in G there must be a path pi in C’ from
src(e;) to $&(ej) such that DeZay(p;)2 delayte,). Let e,,, e,,, ...,eim, i ,
Clearly, i; is a path in G’ from x to y ,and since Delay (p,) I delay( ei) holds whenever e; is a synchronization edge, it follows that Delay( I;) I Delay( p ) . Furthermore, from the const~ctionof p,it is apparent that every actor that is traversed by p is also traversed by I; .QED, The following lemma states that if a resynchronization contains a resynchronization edge e such that there is a delay-free path in the original synchronization graph from the source of e to the sink of e , then e must be redundant in the resychronized graph. :Suppose that G is a synchronization graph; R is a resynchronizad ( x , y ) is a resynchronization edge such that pc(x, y ) = 0 . Then (x, y ) is redunda~tin (R,G) . Thus, a minimalresynchronization(fewest number of elements) hasthe property that pG(x’, y’) >0 for each resynchronization edge ( X ’ , y ’ ) .
Proofi Let p denote a minimum-delay path from x to y in G . Since ( x , y) is a resynchronization edge, ( x , y ) is not contained in G , and thus, p traverses at least three actors. FromLemma 10.1, it follows that there is a path p’ in Y ( R , G) from x to y such that
DeZay(p’) = 0 ,
(10-6)
and p’ traverses at least three actors. Thus, (10-7)
Delay (p’)2 ~ e Z u y ( ( xy)) ,
and p’ # ( ( x , y ) ) .Furthermore, p’ cannot properly contain (x, y ) .To see this, observe that if p’ contains ( x , y ) but p’ # ( ( x , y ) ) ,then from (10-6), it follows that there exists a delay-free cycle in G (that traverses x and hence that our assumption of a deadlock-free schedule (Section 10.1) is violated. Thus, we conclude that ( x , y ) is redundant in Y ( R , G ) . QED. )7
A s a consequence of Lemma 10.1, the estimated throughput of a given synchronization graph is always less than or equal to that of every synchronization graph that it preserves.
eore :If G is a synchronizationgraph,and graph that preserves G , then h,,,( G’) 2 h,,,( G) .
G’ is a synch ronization
Chapter 10
Suppose that C is a critical cyclein G .Lemma10.1guarantees that there is a cycle C’ in G’ such that a) Delay( C’) I DeZay( C) ,and b) the set of actors that are traversed by C is a subset of the set of actors traversed by C’, Now clearly, b) implies that
Pruo)
t(v) 2 v is traversed by C‘
t(v> v is traversed by C
9
(l 0-8)
and this observation together with a) implies that the cycle mean of C’ is greater than or equal to the cycle mean of C . Since C is a critical cycle in G , it follows that h,,,( G’) 2 h,,,( G) .QED.
Thus, any saving in synchronization cost obtained by rearranging synchronization edges may come at the expense of a decrease in estimated t ~ o u g h ~ u t . A s implied by Definition 10.1, weavoid this complication by restricting our attention to feedforward synchronization edges. Clearly, resynchronization that rearrangesonlyfeedforwardsynchronizationedgescannotdecrease the estimated t ~ o u g h p u since t no new cycles are introduced and no existing cycles are altered. Thus, with the form of resynchronization that is addressed in this chapter, any decrease in synchronization cost that we obtain is not diminished by a degradation of the estimated throughput.
nization with the fewest .In Section 10.4, it isformally shown that the resynchronization problem is NP-hard, which means that it is unlikely that efficient algorithms can be devised to solve the problem exactly, and thus, for practical use, we should search for good heuristic solutions [CJ79]. In this section, we explain the intuition behind this result. To establish the NPhardness of the resynchronization problem, we ex when there are exactly two SCCs, which we call t ,and we derive a polynomial-time reduction from the classic set cuvering ~ r o ~ [CLR92], Z e ~ a well-known NP-hard problem, to the pairwise resynchronization problem. Inthe set-covering problem, one is given a finite set X and a family T of subsets of X,and asked to find a minimal (fewest number of members) subfamily T,c: T such that u ta
t T,
=X.
A subfamily of T is said to cover X if each member of X is contained in some member of the subfamily. Thus, the set-covering problem is the problem of finding a minimal cover.
:Given a synchronization graph G , let ( x , , x,) be a sync~roni-
,and let ( y , ,y,) be an ordered pair of actors in G .We say that es (x17 x,) in G if
p(”,, y1) + Po‘),,)2‘ delaY((x17 ’ Thus,everysynchronizationedgesubsumes itself, and intuitively, if ( x , , x,) is a synchronization edge, then (y , , y , ) subsumes ( x , , x,) if and only if a zero-delay synchronization edge directed from y , to y2 makes ( x , , x,) redundant. The following fact is easily verified from Definitions 10.1 and 10.2. :Suppose that G is a synchronization graph that contains exactly two SCCs, F is the set of feedforward edges in G , and F’ is a resynchronization of G . Then for each e E F , there exists e’ E F’ such that (src(e’), snk( e’)) subsumes e in G .
An intuitive correspondence between the pairwise resynchronization problem and the set covering problem can be derived from Fact 10.2. Suppose that G is a synchronization graph with exactly two SCCs, Cl and C, ,such that each feedforward edge is directed from a member of C, to a member of C,. We start by viewing the set F of feedforward edges in G as the finite set that we wish to cover, and with each member p of {( x , y ) I ( x E C , , y E C , ) },we associate the subset of F defined by ~ ( p=){e E F1 ( p subsumes e ) } .Thus, ~ ( p is) the set of feedforward edges of G whose corresponding synchronizations can be eliminated if we implement a zero-delay synchronization edge directed from the first vertex of the ordered pair p to the second vertex of p . Clearly then, {el’, e,’, ...,e,,’} is a resynchronization if and only if each e E F is contained in at least one X( (src( ei’), snk( e;’))) - that is, and ifonly if src( ei’), snk( e;’)))I 1 S i S n } covers F. Thus, solving the pairwise resynchronization problem for G is equivalent to finding a minimal cover for F given the family of subsets { ~ ( x ,y ) I( x E Cl, y E C,)} .
{x((
Figure 10.2 helps to illustrate this intuition. Suppose that we are given the set X = { x 1 ,x, x3,x4}, and the familyof subsets T = {t l , t,, t3}, where t1 = {x1,x3} ,t 2 = { x , , x,}, and t3 = (X,, x 4 } .To construct an instance of the pairwiseresynchronizationproblem,wefirst create two vertices andan edge directed between these vertices for each member of X ;we label each of the edges created in this step with the corresponding member of X. Then for each t E: T ,we create two vertices vsrc (1) and vsnk( t ) . Next, for each relation xi E ti (there are six such relations in this example), we create two delayless edges one directed from the sourceof the edge corresponding to xi and directed to vsrc( t i ) ,and another directed from vsnk( t j ) to the sink of the edge pair corresponding to x i . This last step has the effect ofmakingeach
Chapter 10
(vue( t i ) , v ~ n k (t i ) ) subsume exactly those edges that correspond to members of t i ;in other words, after this construction, (vsrc( t i ) , vsnk( t i ) ) ) = t i , for each i Finally, for each edge created in the previous step, we create a corresponding feedback edge oriented in the opposite direction, and having a unit delay.
x(
W
V
-W-
vsrc(t,)
W
Figure 10.2.(a) An instance of the pairwiseresynchronizatiofl problem thatis derived from an instanceof the set-covering problem; (b) theWSDFG that results from a solutionto this instanceof pairwise resyflchronization.
Figure 10.2(a) shows the synchronization graph that results from this construction process. Here, it is assumed that each vertex corresponds to a separate processor; the associated unit delay, self loop edges are not shown to avoid clutObserve ter. that the graph contains two SCCs - the SCC ({src(xi)} U {vsrc(ti)}) and the SCC ({snk(xi)} U {vsnk(ti)}) and that the set of feedforward edgesis the set of edges that correspond to members of X. Now, recall that a major correspondence betweenthe given instance ofset covering and the instance of pairwise resynchronization defined by Figure 10.2(a) is that (vsrc( t i ) , vsnk( t i ) ) ) = t i , for each i .Thus, if we can find a minimal resynchronization of Figure 10.2(a) such that each edge in this resynchronization is directed from some vsrc( t k ) to the corresponding vsnk( t k ),then the associated tk ‘S form a minimum cover of X , For example, it is easy, albeit tedious, to verify that the resync~onizationillustrated in Figure 10.2(b),
x(
{do(vsrc(t,), vsnk(t1)), do( vsrc(t,), vsnk(t,))}
,
is a minimal resynchronization of Figure 10.2(a), and from this, we can conclude that {t,, t3} is a minimal cover for X . From inspection of the given sets X and T , it is easily verified that this conclusion is correct. This example illustrates howan instance of pairwise resynchronization can be constructed (in polynomial time) from an instance of set covering, and how a solution to this instance of pairwise resynchronization can easily be converted into a solution of the set covering instance. The formal proof of the NPhardness of pairwise resync~onizationthat is given in the following section is a generalization of the example in Figure 10.2.
In this section, the NP completeness of the resynchronization problem is established. This result is derived by reducing an arbitrary instance of the setcovering problem, a well-known NP-hard problem,to an instance ofthe pairwise resynchronizationproblem,which is a special case of the resynchronization problem that occurs when there are exactly two SCCs. The intuition behind this reduction is explained in Section 10.3 above. Suppose that we are given an instance (X,T ) of set covering, where X is X that covers X.Without loss of generality, we assume that a finite set, and T is a family of subsets of
T does not contain a proper nonempty subset T’ that satisfies
( U
tE (7.-7”)
O N U t )= t E
7“
0.
(10-9)
We can assume this without loss of generality because if this assumption does not hold, then we can apply the construction below to each “independent subfamily”
Chapter 10
separately, and then combine the results to get a minimal cover for X. The following steps specify how we construct anWSDFG from (X,T ) . Except where stated otherwise, no delay is placed on the edges that are instantiated. l . For each x E X,instantiate two vertices vsrc(x) and vsnk(x) ,and instantiate an edge e( x ) directed from vsrc ( x ) to vsnk( x ) .
2. For each
tE
T
(a)Instantiate two vertices vsrc( t ) and vsnk( t ) .
(b) For each x E
t
* Instantiate an * Instantiate
edge directed from vsrc( x ) to vsrc( t ) .
an edge directed from vsrc( t ) to vsrc(x) ,and
place one delay on this edge. 4
Instantiate an edge directed from vsnk (t ) to vsnk ( x ) .
* Instantiate an
edge directed from vsnk(x) to vsnk( t ) ,and
place one delay on this edge. 3. For each vertex v that has been instantiated, instantiate an edge directgd from v to itself, and place one delay onthis edge.
Observefromour construction, that whenever x E X is contained in T ,there is an edge directed from vsrc( x ) (vsnk( t ) )to vsrc( t ) (vsnk( x ) ), and there is also an edge (having unit delay) directed from vsrc(t) (vsnk(x) )to vsrc( x ) (vsnk( t ) ).Thus, from the assumption stated in (l 0-g), it follows that {vsrc(z)Iz E ( X U T )} f o m s one SCC, {vsnk(z) Iz E ( X U T ) 1forms another SCC, and F = {e(x)lx E X} is the set of feedforward edges. t
E
Let G denote the HSDFG that wehave constructed, and as in Section 10.3, define ~ ( p ){e E r;l(p subsumes ( s r c ( e ) ,$ & ( e ) ) ) } for eachordered pair of vertices p = (yi, y2) such that y1 is contained in the source SCC of G , and y2 is contained in the sink SCC of G . Clearly, G gives an instance of the pairwise resync~onizationproblem. 1: By construction of G , observe that
{ x E X 1((vsrc ( t ) , vsnk( t ) ) subsumes (vsrc( x ) , vsnk( x ) ) ) } = t for all t E T Thus,forall t E T , ~ ( v s r c ( t )vsnk(t)) , = {e(x)lxE t } . ?
:For each x E
X, all input edges of vsrc(x) have unit delay on
a
them. It follows that for any vertex y the in X( vsrc(x), y) c:{ e E F1 src(e) = vsrc(x)} = {e(x)} .
sink SCC of G ,
3: For each t E T ,the only vertices in G that have a delay-free t) are those vertices contained in {vsrc(x)Ix E t } .It follows that
for any vertex y in the sink SCC of G , X(vsrc(t),y) C:X(vsrc(t),vsnk(t)) = {e(x)lxE t } .
Now suppose that j-’= {f , , f 2 , ..,,: , f is a minimal resynchronization of G . For each i E {1,2, ...,m } ,exactly one of the following two cases must apply: Case 1:vsrc( fi)= vsrc ( x ) for some x E X .In this case, we pick an arbitrary t E T that contains x , and we set vi = vsrc( t ) and W, = vsnk( t ) .From Observation 2, it follows that
c ( ( s r c ( f , >W, f
W;
i ) ) )
c:M x ) } c : X ( V ; ,
W;).
Case 2: vsrc( f ; ) = vsrc ( t ) for some t E T . We set vi = vsrc (t) and = vsnk (t) .From Obse~ation3, we have
~ ( ( s r ~ (s fn ;k)(,f ; ) ) > C: ~ ( v W;). ;, From our
definition
of
the
vi S
{d,(vi,W;) I(i E {1,2, ...,m})} is a minimal resynchronization of each (v,,W;) is of the form (vsrc(t ) , vsnk( t ) ) ,where t E T I
and W; S, G , Also,
Now, for each i E {1,2, ...,m} ,we define
Z , = { x E X I( V ; , :
W;)
subsumes (vsrc(x), vsnk(x)) }.
{Z1,Z2, ...,Z,} covers X .
Proo$ From Observation 4, we have that for each 2;,there exists a t E T such that Z , = { x E XI( vsrc(t), vsnk(t)) subsumes (vsrc(x), vsnk(x))}.Thus, each Z; is a member of T.Also, since {d,(vi,wi)l(i E {1,2, ...,m})} is a resynchroni~ationof G , each member of {(vsrc(x), vsnk(x))Ix E X } must be preserved by some (v;, W;) ,and thus each x E X must be contained in some Z; . QED. :
{Z , ,Z2, ,..,Z,} is a minimal cover for X .
Proofi (By contraposition). Suppose there exists a cover {Y , , U*,...,U,,,?} (among the members of T ) for X , with m’ <m . Then, each x E X is contained in some Y , ,and from Observation 1, (vsrc( Y,), vsnk( Y,)) subsumes e(x ) . Thus, {(vsrc( Y ; ) , vsnk( U,)) I ( i E {l , 2, ...,m‘})} is a resynchronization of C . Since m’ C m ,it follows that j-’= {f , , f 2, ...,f is not a minimal resynchronization of G . QED.
: ,
Chapter 10
In summary, we have shown how to convert an arbitrary instance (X,T) of the set-covering problem into an instance C of the pairwise resynchronization problem, and we have shown how to convert a solution 7’ = {f ,,f 2, ..,fm: of this instance of pairwise resync~onizationinto a solution {Z,,Z2,...)2 , ) of (X,7“).It is easily verified that all of the steps involved in deriving C from (X, T),and in deriving {Z,, . Z2, .,Z,} from F’ can be performed in polynomial time. Thus, from the NP hardness of set covering [CLR92], we can conclude that the pairwise resynchronization problem is NP hard.
S
A heuristic framework for the pairwiseresynchronizationproblem emerges naturally from the relationship that was established inSection 10.3 between set-covering and pairwise resync~onization.Given an arbitrary algothat solves the set-covering problem, and given aninstance of paironization that consists of two SCCs Cl and C*, and a set S of feedforw~dsynchronization edges directed from members of C, to members of C*, this heuristic framework first computesthe subset x((u, v))={e E Sl(pdsrc(e), U ) =I 0) + (PC(V, snk(e)) 2 dela~(e))l
for each ordered pair of actors
(U,v ) that is contained
in the set
7’ EE {(U’,v‘)I( U’ is in Clan
and then applies the algorithm C O V to ~ the ~ instance of set covering defined by ’, v’) E T)}. If E the set S together with the family of subsets { ~ ( ( u v’))l((u’, denotes the solution returned by COVER, then a r~sync~onization for the given instance of pairwiseresync~onizationcan be derivedby Cdo(4 V > I X ( ( U ,
m E 21
This resynchronization is the solution returned by the heuristic framework. From the correspondence between set-covering and pairwise resynchronization that is outlined in Section 10.3, it follows that the quality of a resynchronization obtained by the heuristic framework is determined entirely by the quality of the solution computed by the set-covering algorithm that is employed; that is, if the solution computed by C ~ is XV% worse ~ (X% ~ more subfamilies) than an optimal set-covering solution, then the resulting resynchronization will be X% worse (X%more synchronization edges) than an optimal resync~onizationof the given instance of pairwise resynchronization. The application of the heuristic framework for pairwise resynchronization to each pair of SCCs, in some arbitrary order, in a general synchronization graph yields a heuristic framework for the general resynchronization problem. How-
ever, a major limitation of this extension to general sync~onizationgraphs arises from its inability to consider resync~onizationopportunities that involve paths that traverse more than two SCCs, and paths that contain more than one feedforward synchronization edge. Thus, in general, the quality of the solutions obtained by this approach will be worse than the quality of the solutions that are derived by the particular set covering heuristic that is employed, androughly, this discrepancy canbe expected to increase as the number of SCCs increases relative to the number of sync~onizationedges in the original sync~onizationgraph. For example, Figure 10.3 showsthe sync~onizationgraph that results from a six-processor schedule of a synthesizer for plucked-string musical instruments in 11 voices based on the Karplus-Strong technique. Here, exc represents the excitation input, each vi represents the computation for the i th voice, and the signs , specify adders. Execution time estimates for the actors marked with ‘“I-’ actors are shown in the table at the bottom of the figure. In this example, the only pair of distinct SCCs that have more than one sync~onizationedge between them is the pair consisting of the SCC containing {exc,v1 } and the SCC containing v2, v3 ,five addition actors, and the actor labeled out. Thus, the best result that canbe derived from the heuristic extension for general synchronization graphs described above is a resync~onizationthat optimally rearranges the synchronization edges between these two SCCs in isolation, and leaves all other synchronization edges unchanged. Such a resynchronization is illustrated in Figure 10.4. This synchronization graph has a total of nine synchronization edges, which is only one less than the number of synchronization edges in the original graph. In contrast, it is shownin the following subsection thatwith a moreflexible approach to resynchronization, the total synchronization cost of this example can be reduced to only five synchroni~ationedges.
This subsection presents a more global approach to resync~onization, called Algorithm ~lobal-resync~onize, which overcomes the major limitation of the pairwise approach discussed in Section 10.5.1. Algorithm ~lobal-resynchronize is based on the simple greedy approximation algorithm for set-covering that repeatedly selects a subset that covers the largest number of re~aining e~e~ents? where a remaining element is an element that is not contained in any of the subsets that have already been selected. In [Joh74, Lov753 it is shown that this setcovering technique is guaranteed to compute a solution whose cardinality is no greater than (1n( IXl) + 1) times that of the optimal solution, where X is the set that is to be covered. To adapt this set-covering technique to resync~onization,we construct an instance of set covering by choosing the set X ,the set of elements to be covered,
Chapter 10
actor
exc
I
execution time
I
Figure 10.3. The synchronization graph that results from a six-processor schedule of a music synthesizer based on the Karplus-~trong techni~ue.
to be the set of feedforward synchronization edges, and choosing subsets to be
the family of
graph. The constraint where C = ( V , E ) is the input synchronization pG(v2, vl) = in (10-10) ensures that inserting the resynchronizatio~ edge (v,, v2) does not introduce a cycle, and thus that it does not introduce deadlock or reduce the estimated throughput. 00
Algorithm ~lobal-resynchronizeassumes that the input synchroni~ation graph is reduced (a reduced synchronization graph can be derived efficiently, for example, by using the redundant synchronization removal technique discussedin the previous chapter). The algo~thmdetermines the family of subsets specified by (10-lo), chooses a member of this family that hasmaximum cardina~ity, inserts the corresponding delayless resynchronization edge, removes all synchronization edges that it subsumes, and updatesthe values pG(x,y ) for the new synchronization graph that results. This entire process is then repeated on the new sync~onizationgraph, and it continues until it arrives at a sync~onizationgraph that is, for which the computation defined by (10-10) produces the empty set -
Figure 10.4. The synchronization graph that results from applying the heuristic fra~eworkbased on pairwise resynchronization to the example of Figure 10.3.
Chapter 10
~lo~~/-r~sync~roniz~ educed synchro~izationgraph G = (V,E ) :an alternative r ~ ~ u $~y necdh r o n i ~ a t j ~ ~ that ~ r apprhe ~ ~ r v eG$.
the algorithm terminates when no more resynchronization edges can be added. Figure10.5 gives a pseudocode specification of this algorithm(withsome straightforward modifications to improve the ~ n n i n gtime). To analyze the complexity of Algorithm6lobal-resync~onize? the following definition is useful. :Suppose that G is a sync~onizationgraph. The denoted DC(G) ,is the number of distinct ordered vertex-pairs ( x , y ) in G that satisfy pG(x,y )
= 0 .That is,
D C ( G ) = IS(G)l, where S(G) = { ( x , y ) l ( p c ( x , y ) = 0 ) ) .
(10-11)
The followinglemmashows that as long as the input synchronization graph is reduced, the resynchronization operations performed in Algorithm 610bal-resynchronize always yield a reduced synchronization graph. : Suppose that G = ( V , E ) is a reducedsynchronizationgraph; pair of vertices in G suchthat (x,y ) 4 E , and ( x , y ) is anordered (pc(y, x ) = 00) ,and 1x(x, y)l 2 l . Let G’ denote the synchronization graph obtained by inserting do(x,y ) into G and removing all members of~ ( xy ,) ;that is, G’ = ( V , E’), where
E’ = ( E Y ) ) + M x , Y )l Then G’ is a reduced synchronization graph. In other words, G’ does not contain any redundant synchronizations. Furthermore, DC( G’) >DC(G) .
-x(&
*
Proofi We prove the first part of this lemma by contraposition. Suppose that
there exists a redundantsynchronization edge S in G’ and first suppose that S = ( x , y ) .Then from Fact 10.1, there exists a path in C’ directed from x to y such that DeZay( p ) = 0 and contain pnot does
(x, y ) .
(10-12)
Also, observe that from the definition of E’,
It follows from (10-12) and (10- 13) that G also contains the path p . Now let
(X’,
y’) be an arbitrary member of %(x, y ) .Then
s path p , we have pG(x,y ) = 0 , and thus, from the t ~ a n g l e Since G c o n t ~ n the ine~uality(3-4) together with (10- 14), (10-15)
Chapter 10
We conclude that G is reduced.
(X’,
y’) is redundant in G , which violates the assumption that
If, on the other hand, S f ( x , y ) ,then from Fact 10.1, there exists a simple path p sf (S) in G’ directed from SYC( S) to snk( S) such that
Delay( p,) I:delay(s)
(10-16)
Also, it follows from (10-13) that G contains S . Since G is reduced, the path ps must contain the edge ( x , y ) (otherwise S would be redundant in G ).Thus, p s can be expressed as a concatenation p s = ((p l , ((x, y)), p2)),where either p 1 or p2 may be empty, but notboth. Furthermore, since p s is a simple path, neither p 1 nor p2 contains (x,y ) .Hence, from (10- 13), we are guaranteed that both p 1 and p2 are also contained in G . Now from (10-16), we have ~elay(p+ , ) Delay(p2)S delay(s).
(10-17)
F u r t h e ~ o r efrom , the definition of p I and p 2 , ~ G ( s ~ c ( s ) ,x)
S DeZuy(p*) and pc(y, snk(s)) I: Delay(p2).
(10-18)
~ombining(10-17) and (10-18) yields
which implies that S E ~ ( xy ), .But this violates the assumption that G’ does not contain any edges that are subsumed by (x,y ) in G . This concludes the proof of the first part of Lemma 10.3. It remains to be shown that DC( G’) >DC( G) .Now, from Lemma 10.1 efinition 10.3, it follows that
S( G ) c:S(G’).
(10-20)
Also, from the first part of Lemma 10.3, which has already been proven, we know that G’ is reduced. Thus, from Lemma 10.2, we have
But, clearly from the construction of G’ ,pG>(x,y ) = 0 , and thus, ( x , y) E W ’ )
(10-22)
From (30-20), (10-21), and (10-22), it follows that S(G) is a proper subset of ence, DC( G’) >DC( G ) . QED.
Clearly from Lemma 10.3, each time a Algorithm Global-resynchronize performs a resynchronization operation (an iteration of the 10.5), the number of ordered vertex pairs (x, y ) that sat increased by at least one. Thus, the number of iterations of the ure 10.5 is bounded above by I VI2 .The complexity of one loop is dominated by the computation in the pair of nestedf tation of one iteration of the inner loop is dominated by the time required to compute ~ ( xy, ) for a specific actor pair (x, y ) Assuming pG(x’, y’) is available for all X’, y’ E V ,the time to compute ~ ( xy ,) is O(s,),where s, is the number of feedforwardsynchronizationedgesinthe current synchronization graph. Since the numbe forward synchronization edges never increases fromone iteration of the op to the next, it follows that the time-complexity of the overall algorithm is O(sl VI4) ,where s is the number of feedforward . tiongraph.In practice, however, synchronizationedgesin the input sync the number of resynchronization steps loop iterations) is usually much lower than IV12 since the constraints on the introduction of cycles severely limit the number of resynchronization steps. Thus, the O(sj VI4) bound can be viewed as a very conservative estimate.
AlgorithmGlobal-resynchronize as long as a resynchronizationedgecanbe found that subsumes at least two existing synchronization edges. However, in general it may be advantageous to continue the resynchronization process even if each resynchronization candidate subsumes at most one synchronization edge. This is because although such a resynchronization candidate does not lead to an immediatereduction in synchronization cost, its insertion maylead to future resynchronization opportunities in which the number of sync~onizationedges can be reduced. Figures 10.6 and 10.7 illustrate a simple example. In the synchroniza graphshowninFigure 10.6(a), there are 5 synchronizationedges, (A, (B,C) ,( D , F ) ,(G, F ) ,and (A,E ) Self-loop edges incident to actors A C , and F (each of these four actors executes on a separate processor) are omitted from the illustration for clarity. It is easily verified that no resynchronization candidate in Figure 10.6(a) subsumes more than one synchronization edge.If we terminate the resynchronization process at this point, we must accept a synchronization cost of 5 synchronization edges. *
However,suppose that we insert the resynchronization which subsumes (B,C) ,and then we remove the subsumed edge ve at the synchronization graph of Figure 10.6(b). In this graph, resynchronization candidates exist that subsume upto two synchronization edgeseach.
Chapter 10
For example,insertion o f the resynchronizationedge ( F , E) , allows us to remove synchroni~ationedges (G, F ) and (A,E) .The resulting synchronization graph,shown in Figure l0.’7(a), contains only foursync~ronizationedges.
ure 10.6. An example in which inserting aresynchroni~ation edge that subs ~ m e sonly one existing synchronization edge eventually leads to a reduction in the total numberof synchronizations.
Alternative~y?from Figure 10.6(b), we could insert the resynchronization edge (C, E ) and remove both (D, F ) and (A,E ) .This gives us the synchronization graph of Figure 10.7(d), which also contains four synchronization edges.
Figure 10.7. A continuation of the example in Figure10.6.
Chapter 10
This is the solution derived by an actual implementation of ~lgorithmGlobalresynchronize [BSL96b] when it is applied to the graph of Figure 10,6(a).
e Figure 10.8 shows the Optimized synchronization graph that is obtained when ~lgorithm ~lobal-resync~onize is applied to the example of Figure 10.3 (using the implementation discussed in [BSL96b]). Observe that the total number of synchr~nizationedges has been reduced from 10 to 5. The total number of "resynchronization steps" (number of while-loop iterations) required by the heuristic to complete this resynchronization is 7. Table 10.1 shows the relative t ~ o u g h p u timprovement delivered by the optimized synchronization graphof Figure 10.8 over the original synchron.ization graph as the sharedmemoryaccesstime varies from 1 to 10 processor clock cycles. The assumed synchronization protocol is WS, and the back-off time for each s i ~ u l a t ~ oisnobtained by the experimental procedure discussed in Section
--" ---"_ ----" \
\ \
\ \
\ \
Figure 10.8. The o p t i m i ~ ~ syn~hronization d graph thatis obtained whenAlgorith~ ~~~a~-res~n~ is happlied r o n i zto~the exampleof Figure 10.3.
9.5. The second and fourth columns show the average iteration period for the original synchronization graph and the resynchronized graph, respectively. The average iteration period, which is the reciprocal of the average throughput, is the average number of timeunits required to execute an iteration of the synchronization graph. From the sixth column, we see that the resynchronized graph consistently attains a throughputimprovement of 22% to 26%. This improvement includes the effect of reduced overheadfor maintaining synchronization variables and reduced contention for shared memory. The third and fifth columns of Table 10.1 show the average number of shared memory accesses per iteration of the sync~onizationgraph. Here we see that the resynchronized solution consistently obtains at least a 30% improvementoverthe original synchronizationgraph. Since accesses to shared memory typically require significant amounts of energy,
em access time
Original graph
Resynchronized graph Decrease in iter. period
1
22%
2
26%
3
26%
4
26%
5
22%
6
24%
7
22%
8
22%
9
22%
P
10
24%
Table 10.1. ~ e ~ o r m a n ccomparison e between the resynchronized solution and the original synchronization graphfor the example of Figure 10.3.
Chapter 10
p a ~ i ~ u l a r for l y a multiprocessor. system that is not integrated on a single chip, this reduction in the average rate of shared memory accesses is especially useful when low power consumption is an important implementation issue.
The simulation is written in C makinguse of a package called CSIM [Sch88] that allows concurrently running processes to be modeled. Each CS1 process is “‘created,” after which it runs concurrently with the other processes in the simulation. Processes communi~ateand synchronize through events and ~ u i l ~ u (which ~ e s are FIFO queues of events betweentwo processes). Time delays are specified by the function hold. ~ o l d i n gfor an appropriate time causes the process to be put into an event queue, and the process “wakes up” when the simulation time has advanced by the amount specified by the hold statement. Passage of time is modeled in this fashion. In addition, ~ S allows ~ Mspecification of ~ u ~ i l i t ~which e s , can be accessed by only one process at a time. Mutual exclusion of access to shared resources is modeled in this fashion. For the multiprocessor simulation, each processor is made into a process, and synchronization is attained by sending and receiving messages from mailboxes. The shared bus is made into a facility. Polling of the mailbox for checking the presence of data is done by first reserving the bus, thenchecking for the message count on that particular mailbox; if the count is greater than zero, data can be read from shared memory, or else the processor backs off for a certain duration, and then resumes polling. When a processor sends data, it increments a counter in shared memory, and then writes the data value. When a processor receives, it first polls the corresponding counter, and if the counter is non-zero, it proceeds with the read; otherwise, it backs off for some time and then polls the counter again. Experimentally d e t e ~ i n e dback-off times are used for each value of the m e ~ o r yaccess time. For a send, the processor checks if the corresponding buffer is full or not. For the simulation, all buffers are sized equal to 5; these sizes can of course be jointly m~nimizedto reduce buffer memory. Polling time is defined as the time required to access the bus and check the counter value.
es In this section,itis shown that although optimal resynchronization is intractable for general synchronization graphs, a broad class of synchronization graphs exists for which optimal resync~onizationscan be computed using an efficient polynomial-ti~ealgorithm.
S
C in a synchronization graph C , and x
f C if for each feedforward synchronisink actor is in C , we have pc(x, snk( e)) = 0 .Simof G if for each feedforward synchronjzation edge e in C, we have pc(src(e), x ) = 0 . We say that C is x, y in C such that x is an ,y is anoutput synchronization graph is c each if SCC is *
linkable. For example, consider the SCC in Figure 10.9(aj, and assume that the dashed edges represent the sync~onizationedges that connect this SCC with other SCCs, This SCC has exactly one input hub, actor A ,and exactly one output hub, actor F , and since p(A, F) = 0 , it follows that the SCC is linkable. However, if we remove the edge (C, F ) , then the resulting graph (shown in Figure 10.9(bjj is not linka~lesince it does not have an output hub. A class of linkable SCCs that occur commonly in practical sync~ronizationgraphs are those SCCs that correspond to only one processor, such as the SCC shown in Figure 10.9(c). In such cases, the first actor executed on the processor is always an input hub and the last actor executed is always an output hub. In the remainder of this section, we assume that for each linkable SCC, an input hub x andoutputhub y are selected such that p(x, y ) = 0 , and these actors are referred to as the selec and the se the associated SCC. Which input hub are ch ones makes no difference to our discussion of the techniques in this section as long they are selected so that p(x,y ) = 0 . An important propertyof linkable synchronization graphs is that if C, and C2 are distinct linkable SCCs, then all synchronization edges directed from C, to C2 are subsumed by the single ordered pair (I , , Z2) ,whete l , denotes the selected output hub of C , and l2 denotes the selected input hub of C2. Furthermore, if there exists a pathbetweentwo SCCs C,’, C2’ of the form ( ( o , ,i2), ( 0 2 ,i3), ...,(on-I , in)), where o 1 is the selected output hub of C,’, i, is the selected input hub of Cz’, and there exist distinct SCCs ,,- 2 {c,’,c,’> such that for k = 2, 3, ...,(n -1) , i,, 0, are respectively the selected input hub and the selected output hub of 2,- ,then all sync~onizationedges between C,’ and C*’ are redundant.
,
From these properties, an optimal resynchronization for a chainable synchronization graph can be constructedefficiently by computing a topological sort of the SCCs, instantiating a zero delay synchronization edge from the selected output hub of the i th SCC in the topological sort to the selected input hub of the ( i + l) th SCC, for i = 1,2, ...,( n -l ) , where n is the total number of SCCs,
Chapter 10
and then removing all of the redundant synchronization edges that result. For example, if this algorithm is applied to the chainable synchronization graph of Figure lO.lO(a), then the synchronization graph of Figure lO.lO(b) is obtained, and the number of synchronization edges is reduced from 4 to 2 . This chaining technique can be viewed as a form of pipelining, where each SCC in the output synchronization graph corresponds to a pipeline stage. As discussed in Chapter 5, pipelining can be used to increase the throughput in multiprocessor DSP implementations through improved parallelism. However, in the form of pipelining that is associated with chainable synchronization graphs, the load of each processor is unchanged, and the estimated throughput is not affected
Figure 10.9. An il~ustrationof input and output hubs forsyn~hronizationgraph.
(since no new cyclic paths are introduced), and thus, the benefit to the overall throughput of the chaining technique arises chiefly from the optimal reduction of synchronization overhead. The time-complexity of the optimal algorithm discussed above for resychronizing chainable synchronization graphs is O(v*) ,where v is the number of synchroni~ationgraph actors.
It is easily verified that the original synchronization graph for the music synthesis example of Section 10.5.2, shown in Figure 10.3, is chainable. Thus, the chaining technique presented in Section 10.6.1 is guaranteed to produce an optimal resynchronization for this example, and since no feedback synchronization edges are present, the number of synchronization edges in the resynchronized solution is guaranteed to be equal to one less than the number of SCCs in the original synchronization graph; that is, the optimized synchronization graph contains 6 -1 = 5 synchronization edges. From Figure 10.8, we see that this is precisely the number of synchronization edges in the synchronization graph that results from the implementation of Algorithm Global-resynchronize that was dis-
" " "
Figure 10.1 0.An i~lustratioflof an algorithm for optimalresyflchroflizatiofl of chainable syflchroflizatiofl graphs. The dashed edges are syflchroflizatiofl edges.
Chapter 10
m ~lobal-resynchronizedoes not always produce optimal results for chainable synchronization graphs. For example, consider the synchronization graph shown in Figure 10.1 l(a), which corresponds to an eightprocessor schedule in which each ofthe following subsets of actors are assigned arate processor -{ I } , {J), {G, K } , {C ,and {B} The dashed edges are synchronization S connect actors that are assigned to the same processor. The total number of synchronization edges is 14. Now it is easily veri d that actor K is both an input hub and an output hub for the SCC {C, G, H , J, },and similarly, actor L is both an input and output hub for the SCC { A , D, ,F , L } .Thus, we see that the overall sync~onizationgraph is chainable. It is easily verified that the chainingtechniquedevelopedinSection10.6.1uniquely yields the optimal resynchronization illustrated in Figure 10.l l(b), which contains only 11 synchronization edges.
In c o n ~ a s tthe ~ quality of the resynchronizationobtained for Figure 10.1l(a) hm by ~lobal-resync~onize on the order which in the actors are tr by each of the two nested in Figure 10.5. For example, ifbothloops traverse the actors inalphab r, then ~lobal-resynchronize obtains the sub-optimal solution shown in Figure 10.1l(c), which contains l 2 synchronization edges. owever, actor traversal orders exist for which ~lobal-resynchronize achieves optimal resynchronizations of Figure 10.1 l(a). Onesuch ordering is
loops traverse the actors in this order, then ~lobal-resynchronize yields the same resynchronized graph that is computed uniquely by the chaining technique of Section 10.6.1 (Figure 10.1 l(b)). It is an open question whether or notgivenan arbitrary chainable sync~onizationgraph, actor traversal orders always exist with which ~lobal-resynchronizearrives at optimal resynchroniza(ions. Furthermore, even if such traversal orders are always guaranteed to exist, it is doubtful that theycan, in general, be computed efficiently.
The chaining technique developed in Section 10.6.1 can be generalized to imally resync~onizea somewhat broader class of synchronization graphs. 1s class consists of all sync~onizationgraphs for which each source SCG has an output hub(but not necessarily an input hub), each sinkSCG has an input hub an output hub), and each internal SCC is linkable. In this Gs are pi~elinedas in the previous algorithm, and then for
ure 10.11.A c h a i n ~ ~synchronization le ~ r ~for p hwhichAI resynchronize fails to produce an optimal solution.
Chapter 10
each source SCC.,a synchronization edge is inserted from one of its output hubs to the selected input hub of the first SCC in the pipeline of internal SCCs, and for each sink SCC, a synchronization edge is inserted to one of its input hubs from the selected output hub of the last SCC in the pipeline of internal SCCs. If there are no internal SCCs, then the sink SCCs are pipelined by selecting one input hub from each SCC, and joining these input hubs with a chain of synchronization edges. Then a synchronization edge is inserted from an output hub of each source SCC to an input hub of the first SCC in the chain of sink SCCs.
In addition to guaranteed optimality, another important advantage of the chainingtechnique for chainablesynchronizationgraphs is its relatively low time-complexity (0(v2) versus O(sv4) for ~lobal-resync~onize), where v is the number of synchronization graph actors, and s is the number of feedforward sync~onizationedges. The primarydisadvantage is, of course, its restricted applicability. An obvious solution is to first check if the general form of the chaining technique (described above in Section 10.6.3) can be applied.,apply the chaining technique if the check returns an affirmative result, or apply Algorithm ~lobal-resynchronizeifthecheck returns a negative result. Thecheckmust determine whether or not each sourceSCC has an output hub, eachsink SCC has an input hub, and each internal SCC is linkable. This check can be performed in O(n3) time, where n is the number of actors in the input synchronization graph, using a straightforward algorithm. A useful direction for further investigation is a deeper integration of the chaining technique with algorithm ~lobal-resynchronizefor general (not necessarily chainable) synchronization graphs.
e studied synchronization rea~angementin context the of minimizing for hardware synthesis synchroof nization digital circuitry [ and significant differences in the models prevent these techniques from applying to the conDF implementation. In the graphical hardware model of on~traintgraph model, each vertex corresponds to a separate hardware device and edges have arbitrary weights that specify sequencing en the source vertex hasboundedexecution time, a positive r d imposes the constraint weight W ( e ) ~ o ~ a cunstrai~t)
start( snk( e ) ) 2 W ( e ) + start( src( e ) ) ,
(10-24)
while a negative weight (~ackward constraint) implies start( snk( e ) ) S W ( e ) + start( src( e ) ) .
(10-25)
If the source vertex has unbounded execution time, the forward and backward constraints are relative to the cQ~~ZetiQn time of the source vertex. In contrast, in the synchronization graph model, multipleactors can reside on the same processing element (implying zero synchronization cost between them), and the timing constraints always correspond to the case whereW ( e) is positive and equal to the execution time of src( e ) . The implementationmodels,and associated implementation cost functions are also significantly different. A constraint graph is implemented using a schedulingtechnique called relative sched~Zi~g 921, whichcanroughlybe viewed as intermediatebetween self-timed and tatic scheduling. In relative scheduling, the constraint graph vertices that have unbounded execution time, called anchors, are used as reference points against which all other vertices are scheduled: for each vertex v , an offset f, is specified for each anchor a, that affects the activation of v ,and v is scheduled to occur once f clock cycles have elapsed from the completion of a , ,for each i . In the implementation of a relative schedule, each anchor has attached control circuitry that generates offset signals, and each vertex has a synchronization circuit that asserts an activate signal when all relevant offset signals are present. The resynchronization optimization is driven by a cost function that estimates the total area of the synchronization circuitry, where the offset circuitry area estimate for an anchor is a function of the maximum offset, and the synchronization circuitry estimate for a vertex is a function of the number of offset signals that must be monitored. As a result of the significant differences in both the scheduling models and the implementation models, the techniques developed for resynchronizing constraint graphs do not extend in any straightforward manner to the resynchronization of sync~onizationgraphs for self-timed multiprocessor implementation, and the solutions that we have discussed for synchronization graphs are significantly different in structure fromthosereportedin [F 921. Forexample, the fundamental relationships that have established between set coveringand the resynchronizationof self-timed F scheduleshavenotemerged in the context of constraint graphs.
This chapter has discussed a post-optimization called resynchronization for self-timed, multiprocessor implementations of DSP algorithms. The goal of resynchronization is to introduce new synchronizations in such a way that the
Chapter 10
number of additional synchronizations that become redundant exceeds the number of new synchronizations that are added, and thus the net s y ~ c ~ o n i z a t i ocost n is reduced. It was shown that optimal resynchronization is intractable by deriving a reduction from the classic set-covering ~roblem. owever, a broad class of systems was d e ~ n e dfor which optimal resynchronization can beper forme^ in polynomial time. This chapter also discussed a heuristic algo~thm for resynchronization of general systems that emerges naturally from the correspondence to set covering. The performance of an implementation of this heuristic was emo on st rated on a multiprocessor schedule for a music synthesis system. The results em on st rate that the heuristic can efficiently reduce synchronization overhead and im~rovethroughput signi~cantly.
~ h a p t e r10 introduced the concept of resynchronization, a post-optimization for static multiprocessorschedulesinwhichextraneoussynchronization operations are introduced in such a way that the number of original synchroniza~ ~ n t exceeds the number of tions that conse~uentlybecome r e ~ ~ n significantly additional synchronizations~ edundantsynchronizations are synchronization operationswhosecorrespong se~uencingre~uirementsare enforcedcompletely by other synchronizations in the system. The amount of run-time overhead re~uiredfor sync~onizationcan be reduced significantly by eli~inating redundant sync~onizations[Sha89, BSL97). Thus, effective resynchronization reduces the netsync~onizationoverhead in the implementation of a multiprocessor schedule, and improvesthe overall throughput. owever, since additional serialization is imposed by the new synchronizations, resynchronization can produce significant increase in latency. In Chapter 10, we discussed fundamentalproperties of resynchronization and westudied the problemofoptimalresynchronizationunder the assumption that a r b i t r a ~ increases in latency canbe tolerated maximum-thro~ghput resynchronization”). Such an assumption is valid, for example, in a wide variety of simulation applications. This chapter discusses the problem of computing an optimal resynchronizationamong all resynchronizations that do notincrease the latency beyond a p r e s p e c i ~ eupper ~ bound L,, . Thisstudyofresynchronization is based in the context of self-ti~ed e~ecution of iterative data~ow speci~cations, which is an imple~entationmodel that has been applied extensively for digital signal processing systems. Latency constraints become important in interactive applications such as video conferencing, games, and telephony9where latency beyond a certain point becomes annoying to the user. This chapter demonstrates howto obtain the bene-
Chapter 11
fits of resynchronization while maintaining a specified latency constraint S
This section introduces a number of useful properties that pertain to the process by which resynchronization can make certain synchronization edges in the original synchronization graph become redundant.The following definition is fundamental to these properties, :If G is a synchronization graph,
S is a synchronization edge in undant, R is a resynchronization of G and S is not contained in ,thenwesay that R ates S . If R eliminates S ,S’ E R and there is a th p from src( S) t (S) in Y ( R , G) such that p contains S’ and Delay ( p )S delay( S ) ,then we say that S’ CO
A synchronization edge S can be eliminated ifa resynchronization creates a path p from src(s) to snk( S ) such that Delay( p) S delay( S ) .In general, the path p may contain more than one resynchronization edge, and thus, it is possible that none of the resynchronization edges allows us to eliminate S 66by itself’, In such cases, it is the contribution of all of the resynchronization edges within the path p that enables the elimination of S .This motivates the choice of terminology in ~efinition11.1. An example is shown in Figure 11.1. The following two facts follow immediately from Suppose that G is a sync~onizationgraph, R is a resynchronization Y is a resynchronization edge in R . If r does not contribute to the elimination of any synchronization edges, then (R-{r } ) is also a resynchronization of G . If r contributes to the elimination of one and only one synchronization edge S ,then ( R-{Y} -t{ S } ) is a resynchronization of G . :Suppose that G is a synchronization graph, R is a resynchronization of G ,S is a synchronization edge in G and S’ is a resynchronization edge in R such that delay( S’) >delay ( S ) .Then S’ does not cont~buteto the elimination of S.
For example, let G denote the synchronization graph in Figure 11,.2(a). Figure 11.2(b) shows a resynchronization R of G . In the resynchronized graph of Figure 11.2(b), the resynchronization edge (x4,y 3 ) does not contribute to the e~iminationof any of the sync~onizationedges of G ,and thus Fact 11.1 guaran’= R -{(x4,y 3 ) } ,illustrated in Figure 11.2(c), is also a resynchronization of G . In Figure 11.2(c), it is easily verified that (x5,y4) contributes to the the edge (xs,y s ) ,and from elimination of exactly one synchronization edge -
Fact 11.1, we have that R” = R’ -{(x5,y4)} + {(x5,y5)},illustrated in Figure of G . 11.2(d), is a also resynchroni~ation
#
#
Figure 11.l. An i~l~stration of Definition 11.l. Here each processor executes a single actor. A resynchronization of the synchronization graphin (a) is illustratedin (b). In this resynchronization, the resynchronization edges (V,X ) and (X, W ) both contribute to the elimination of (V,W ) .
Chapter 11
A s discussed in Section 10.2, resync~onizationcannot decrease the estimated throughput since it mani~ulatesonly the feedforward edges of a synchronization graph. Frequently in real-time DSP systems, latency is also an portan ant issue, and although resynchronization does not degrade the esti~ated t~oughput, it generally doesincrease the latency. This section defines the Zatency-constraine~r ~ s y n ~ ~ r o n ~ i ~rao~ ~ i ofor Z~ self-timed e ~ mult~~rocessor systems.
Figure 112.s roper ties of r e ~ y n c ~ r ~ n i ~ ~ t i o n .
AINED ~ E S ~ N C H ~ O ~ I Z A T I O N
: Suppose Go is an application graph, G is a synchronization graph that results from a multiprocessorschedule for G o , x is anexecution source (an actor that has no input edges or has nonzero in G , and y is an actor in G other than x . We define th LG(x,y ) ~ n d ( yl, + pG,,(x,y)) ~e refer to x as the with this measure of latency, and we refer to y as the la
Intuitively, the latency is the time required for the first invocation of the latency input to influence the associated latency output, and thus the latency corresponds to the critical path in the dataflow implementation to the first output invocation that is influenced by the input. This inte~retationof the latency as the critical path is widely used in VLSI signal processing [Kun88,~ a d 9 5 1 . In general, the latency can be computed by performing a simple simulaAP execution for G through the (1 + pG0(x,y)) th execution of y . Such a simulation can be performedas a functional sirnulation of an HSDFG G,;," that has the same topology (vertices and edges) as G , and that maintains the simulation time of each processor in the values of data tokens. Each initial token (delay) in is initialized to have the value 0, since these tokens are all present at time 0. Then, a data-driven simulation of G,, is carried out. In this simulation, an actor may execute whenever it has sufficient data, and the value of the output token produced by the invocation of any actor z in the sirnulation is given by
where {{vl, v2, ...,v,,}} is the set of token values consumed during the actor execution. In such a simulation, the i th token value produced by an actor z gives the completion time of the i th invocation of z in the ASAP execution of G . Thus, the latency can be determined as the value of the (1 + pG,(x, y)) th output tokenproduced by y . ith careful implementation of the functional simulator S})) time, described above, the latency can be determined in O(d X max( {lV[, where d = 1 + pG,(x, y) and S denotes the number of sync~onizationedges in G . The simulation approach described above is similar to approaches described in [TTL95] For a broad class of synchronization graphs, latency can be analyzed even more efficiently during resynchronization. This is the class of synchronization graphs in whichthe first invocation of the latency output is influenced by the first invocation of the latency input. Equivalently, it is the class of graphs that contain at least one delayless path in the corresponding application graph directed from 1. Recall from Chapter 4 that Start(v, k ) and end(v, k ) denote the time at which invocation k of actor v commences and completes execution.Also, note that start(x, l) = 0 since x is an execution source.
Chapter 1l
the latency input to the latency output. For transparent synchronization graphs, we can directly apply well-known longest-path based techniques for computing latency. 1.3: Suppose that Go is an application graph, x is a source actor in an actor in Go that is not identical to x .If pc,(x, y ) = 0 , then we t with respect to latency input x and latency output y
n graph that corresponds to a multiprocessor schedule for G,,,we also say that G is t ~ ~ ~ s If a synchronization graph is transp~entwith respect to a latency input/ output pair, thenthe latency can be computedefficiently using longest pathcalculations on an acyclic graph that is derived from the input synchroni~ationgraph G . This acyclic graph, which we call the jci( G ) , is constructed by removing all edges from G that have nonzero-delay; adding a vertex V , which represents the beginning of execution; setting t(v) = 0 ;and adding delayless edges from V to each source actor (other than V )of the partial construction until the only source actor that remains is V .Figure 11.3 illustrates the derivation of fi(G) , Given two vertices x and y in fci( G) such that there is a path in B(C) from x to y ,we denote the sum of the execution times along a path from x to y that has maximum cumulative execution timeby 7'j(G)(x,y ) .That is,
Figure 11.3. An example usedto illustrate the construction of $(G) .The graphon the b o ~ o mis $(G) if G is the top graph.
LATENCY-CONSTRAINED ~ E S ~ N C H ~ O ~ I ~ A ~ O ~
7&)(X,
y)
=
t ( z ) ( p is a pathfrom x to y in$(G))
mm( p traverses z
If there is no path from x to y ,then we define Tficc,(x, y ) to be . for all x, y Tj(c,(x,y) <+ m ,since f i ( G) is acyclic. The values Tj(G)(x,y ) for all pairs x, y can be computed in 0 ( n 3 ) time, where II is the number of actors in G , by using a simple adaptation of the Floyd-Warshal~algorithm described in Section 3.13.3. -W
:Suppose that Go is a an HSDFG that is ans spa rent with respect to latency input x and latency output y ,G, is the synchronizationgraph that results from a multiprocessor schedule for G o , and G is a resyn~~onization G,. Then pG(x,y ) = 0 , and thus Tj(G,(x,y ) 20 (i.e., Tfi(,,(x,y ) # ). -00
Pro03 Since GO is transparent, there is a delayless path p in Go from x to y . Let (u l , u2, ...,U,) ,where x = u1 and y = U, ,denote the sequence of actors
traversed by p . From the semantics of the HSDFG G o , it follows that for 1 S i
The following theorem gives an efficient means for computing the latency LC for transparent synchronization graphs. : Suppose that G is a sync~onizationgraph that is transparent respect to latency input x and latency output y . Then L A X , Y ) = 7;i(G,(”, Y )
with
*
Pro03 By induction, we show that for every actor
W
in f i ( G) ,
which clearly implies the desired result. First, let mt(w) denote the maximum number of actors that are traversed by a path in Ji( G) (over all paths in f i ( G) )that starts at v and terminates at W . If mt( W ) = 1 ,then clearly W = v ,Since both the LHS and RHS of (11-3) are identically equal to t ( v ) = 0 when W = V ,we have that (l 1-3) holds whenever mt(w) = 1 . Now suppose that (1 1-3) holds whenever mt(w) S k ,for some k 2 1 ,and consider the scenario mt(w) = k -+ 1 .Clearly, in the self-timed (ASAP) execution of G , invocation w1 ,the first invocation of W , commences as soon as all invocations in the set
Chapter 11
z =
{Z,l(ZE
P,)>
have comp~etedexecution? where z1 denotes the first invocation of actor z ,and P, is the set of predecessorsof W in f i ( G) . Allmembers z E P, satisfy m$(z) S k ,since otherwise mt( W ) would exceed ( k + 1).Thus, from the induction hypothesis?we have sturt(w, 1) = max(efi~(z,l)l(z
E
P,)) =
~ U ~ ( T ~ (z)J(z ~ ~ E( PV, ), ) ,
which implies that
the RHS of (1 1-4) is clearly equal to T 8 ( ~ ~ W )( ~ and , ut, by definition of T8(G), 7 = 7 ; i ( G , ( ~ ,W ) . thus we have that e f i ~ ( W 1) have shown that (1 1-3) holds for m t ( ~ =) 1 ,and that whenever it s f o r m ~ ( w ) = k ~ l st hold for ~ t ( = ~( k +) 1).Thus, ( l 1-3) holds for all values of mt( W ) . In the context of resync~onization,the main benefit of transparent synchronization graphs is that the change in latency induced by adding a new synchronization edge (a bbresynchronizationo~eration”)can be computed in O( 1) time, given T8,,,(a, 6 ) for all actor pairs (a,b ) .We will discuss this further in Section l1.5. Sincemany practical application graphscontain delayless pathsfrom input to output and these graphs admit a p ~ i c u l a r l yefficient means for cornputing latency, the first i~plementationoflatency-constrainedresynchronization was targeted to the class,of transparent sync~onizationgraphs [BSL96a]. However, the overall resync~onizationframework described in this chapter does not depend on any particular method for computing latency, and thus, it can be fully applied to general graphs (with a moderate increase in complexity) using the A~~~ simulationapproachmentionedabove. This frameworkcan also be applied to subclasses of synchronization graphs other than transp~entgraphs for which efficient techniques for computing latency are discovered. : An instance ofthe consists of a synchronization graph G with latency input x and latency output y , and a ~ateficy c ~ ~ s t r u i fL,,, i t 2&(x, y ) . A solution to suchan instance is aresynchronization R such that 1) G)(x,y) S L,,, ,and 2) no resynchronization of G that results in a latency less than or equal to L,,,x has smaller cardinality than R .
Given a synchronization graph G with latency input x and latency output y ,and a latency constraint L,,, we say that a resynchronization R of G is a
LATENCY- CONSTRAIN^^ R~S~NCHRONIZATIO~
1 if I"Y(R,G)(x, y ) 2I",,, Thus, the latency-constrained resync~onizationproblem is the problem of d e t e ~ i n i n g a minimal LCR, *
Generally, resynchronization can be viewedas complementary to the Con-
ve~-tu-SC-graph optimization defined in Chapter 10: resynchronization is performed first, followed by Cunvert-to-SC-graph.Under severe latency constraints, it may not be possible to accept the solution computed by Convert-to-SC~graph,
in which case the feedforward edges that emerge from the resynchroni~edsolution must be implemented with FFS. In such a situation, Co~vert-to-~C-gra~h can be attempted onthe original (before resynchronization) graph to see ifit achieves a better result than resync~onization without Co~vert-to-SC-graph. However, for transparent synchroni~ationgraphs that have only one source SCC and only one sink SCC, the latency is not affected by Co~v~rt-to-SC-graph, and Cunvert-to-SC-grap~are fully thus, for suchsystems,resynchronizationand comple~entary.This is fortunate since such systems arise frequently in practice. Trade-offs between latency and throughput have been studied by Potkonjac and Srivastava in the context of transformations for dedicated implementation of linear computations [PS94]. Because this work is basedonsynchronous j~plementations,it does not addressthe synchronization issues and opportunities that we encounter in the self-timed dataflow context.
This section shows that the laten~y-constrained resynchroni~ation problem isNP-hardeven for the very restricted subclass of synchronization graphs in which each SCC corresponds to a single actor, and all sync~onizationedges have zero delay. A s with the ~aximum-throughputresynchronization problem, disc~ssed in Chapter 10, the intractability of this special case of latency-constrained resynchronization can be established by a reduction from set-covering, To illustrate this reduction, we suppose that we are given the set X = { x , , x, x3,x,} ,and the family of subsets T = {t , , t2,t3} ,where 1, = {xl,x g } , t, = {x,, x,} ,and t3 = {x,, x4} .Figure l 1.4 illustrates the instance of latency-constrained resynchronization that we derive from the instance of set-covering specified by ere, each actor corresponds to a single processor and the self loop edge for each actor is not shown. The numbers beside the actors specify the actor execution times, and the latency constraint is L,,, = 103 .In the graph of Figure 11.4, which we denote by G , the edges labeled ex,, ex2, ex3,ex, correspond respectively to themembers xi,x2, x3,x, of the set X in the set"cove~ng instance, and the vertex pairs (resync~onization candidates) ( v , s t , ) , ( v , s t , ) , ( v , s t , ) correspond to the members of T . For each relation
Chapter 11
xiE t i , an edge exists that is directed from stj to $xi. The latency input and latency output are defined to be in and out respectively, and it is assumed that C is transparent. The synchronization graph that results from an optimal resynchronization of C is shown in Figure 11S , with redundant resynchronization edges removed. Since the resynchronization candidates ( v , st,), ( v , st,) were chosen to obtain the solution shown in 11S , this solution corresponds to the solution of (X,7') that consists of the subfamily {t,, t,}. A correspondencebetween
the set-covering instance (X,7') and the
l
Figure 11.4.An instance of ~atency-constrained resynchronization that is derived from an instance of the set-covering problem.
instance of latency-constrained resynchronization defined by Figure 11.4 arises from two properties of the const~ctiondescribed above: in the set-covering instance )e ((v, s t j ) subsumes ex, in G) If R is an optimal LCR of G , then each resynchronization edge in R is of the form (v,s t i ) , i E {1, 2, 3 ) ,or oftheform
max
( s t j , sx,), xi g t j .
= 103
Figure 11.S. The synchronization graph that results from a sol instance of l~tency-constraine~ resynchronization shown in Figu
(11-5)
Chapter l1
The first observation is immediately apparent from inspection of Figure 11.4. A proof of the second observation follows.
roofof ~ ~ s e ~ ~6:t We i o must n showthat no other resynchronization edges can becontainedinanoptimalLCR of G . Figure 11.6 specifies argumentswith which we can discard all ~ o s s i b i l ~ t ~ other e s than those given in (11-5). In the matrix shown in Figure 11.6(a), each entry specifies an index into the list of arguments given in Figurel 1.6(b). For each of thesix categories of arguments, except for #6, the reasoning is either obvious or easily understood from inspection of Figure 11.4. A proof of argument#6 follows shortly within this same section. Forexample, edge (v, z) cannotbe a resynchronizationedge in R because the edge already exists in the original synchronization graph; an edge of the form (sxj, W ) cannot be in R because there is a path in G from W to each sxi ; (2,W ) P R since otherwise there wouldbe a pathfrom in to out that traverses v, z, W , s t , , sxI,and thus, the latency would be increased to at least 204 ;( i n , z)P R from Lemma 10.2 since pG(in,z) = 0 ;and (v, v) 4 R since otherwise there would be a delayless self loop. Three of the entries in Figure 11.6 point to multiple argument categories. For example, if x j E t i , then (sxj, s t i ) introduces a cycle, andif xi P ti then ($xj,s t i ) cannotbecontainedin because it would increase the latency beyond L,,, . The entries in Figure l 1.6 marked OK are simply those that correspond to ( I 1-5),and thus we havejustified Observation 6. In the proof of Observation 6, we deferred the proof of ~ g u m e n#6 t for Figure 1l .6. A proof of this ar~umentfollows.
ro of of A r g u ~ e n t#6 in Figure 11.6: By contraposition, we show that ( W , z) cannot contribute to the elimination of any sync~onizationedge of G , and thus ct 1 l . 1, it follows from theopti~alityof R that ( W , z)P R , upp pose that ( W , z) contributes to the elimination of some synchronizationedge S . Then pc(src(s),
W)
= pc(z, s n k ( s ) ) = 0 ,
(1 1-6)
where (1 1-7) From the matrix in Figure 11.6, we see that no resynchronization edge can have z as the source vertex. Thus, snk(s) E {z, o u t } . ow,if snk( S) = z ,then s = (v, z),and thus from (1 1-6), there is a zero delay path from v to W in G . owever, the existence of such a path in 6 implies the existence of a path from zn to out that traverses actors v, W , sti, s x , , which inturnimplies that Lc(in, out) 2 104, and thus that R is not a valid L
a. ~ s s ~that ~ xij Pnt i ; ~otherwise l applies.
. .
1.Exists in G .
2. Introduces a cycle.
3. Increases the latency beyond L,,, 4. pG(al,az) = 0 (Lemma 10.2).
5. Introduces a delayless self loop. 6. Proof is given below.
Figure 11B. A r g u ~ e n t that s support
Chapter 11
On the other hand, if snk(s) = out, then src(s) E {z, sxI,sx2,sx3,$x 4}. ow, from (1 1-6), src( S) = z implies the existence of a zero delay path from z W in 6 , which implies the existence of a path from in to out that traverses v , W, z, stl, sxi ,which in turn implies that L,,, 2 204. On the other hand, if src(s) = sxi for some i ,then since from Figure 1 1.6, there are no resynchronization edges that have ansx, as the source, it follows from(1 1-6) that there must be a zero delay path in 6 from out to W The existence of such a path, however, implies the existence of a cycle in C since pG(w, out) = 0 . Thus, snk( S) = out implies that R is not an LCR. The following observation states that a resynchronization edge of the form contributes to the elimination of exactly one sync~ronizat~on edge, w ~ i c his the edge ex,. (stj, $xj)
is anoptimalLCR of G andsuppose that e = (stj,sx,) is a resynchronization edge in R , for some i E {1 , 2 , 3 , 4 ) , j E {1,2,3} such that xi P t i . Then e contributes to the elimiex;. nation of one and only onesync~onizationedge is an optimal LCR, we know that e must contribute to the elimination of at least one synchronization edge (from Fact 11 l).Let s be some synchronization edge such that e contributes to the elimination of S . Then e
Now from Figure 11.6, it is apparent that there are no resynchronization edges in that have sxi or out as their source actor. Thus, from (11-8), snk (S) = sx; or snk(s) = o u t . Now,if snk(s) = o u t , then src(s) = sxk for some k # i , or src( S) = z . However, since noresynchronizationedgehas a memberof sx,, sx2,$x3,sx4} as its source, we must (from 11-8) rule out src(s) = sxk . i m i l ~ l yif, src( S) = z ,then from (11-8) there exists a zero delay path inR(G) from z to s t j , which in turn implies that LR(G)(in, o u t ) > 140. But this is not possible since the assumption that R is LCR anguarantees that ~ R ~ G out) ~ ( Ii 103 ~ , . Thus,weconclude that snk(s)# o u t , and thus, that snk(s) = sxi . Now (snk(s) = sx,) implies that (a) s = ex, or (b) s = (st,, sx,) for some k such that x, E t, (recall that xi P t j , and thus, that k # j ). If s = (st,, sx;) ,thenfrom (1 1-8), pR(G)(stk, stj) = 0 . It follows that for any member x/E t j ,there is a zero delay path in R ( G ) that traverses st,, stj and $ x I .Thus, s = (st,, $ x i ) does not hold since otherwise LR(G,( in, out) 2 140. s = ex,. Thus, we are left with only possibility (a)-
ow, suppose that we are given an optimal LCR R of G . From Observa-
tion 7 and Fact 11.1, we have that for each resynchronization edge (stj, $xj) in R ,we can replace this resynchronization edge with exi and obtain another optimal LCR. Thus from Observation 6, we can efficiently obtain an optimal LC R’ such that all resynchronization edges in R’ are of the form (v,st,) . For each xi E X such that (l 1-9) have that ex, g R’,This is because R’ is assumed to be optimal, and thus, R,G) containsnoredundantsynchronization edges. Foreach xi E X for which (1 1-9) does not hold, we can replace exj with any (v,stj) that satisfies
x, E t j ,and since such a replacement does not affect the latency, we know that the result will be another optimal LCR for G .In this manner, if we repeatedly replace each exj that does not satisfy (1 1-9) then we obtain an optimal LC such that
eachresynchronizationedge in R” is of the form (v,st,) ,
( l 1-10)
and for each x, E X,there exists a resynchronization edge (v, ti) in that xi E t i .
(11-11)
It is easily verified that the set of synchronization edges eliminated by (exjlxj E X} .Thus, the set 7”’ = {t j l ( v, t j ) is a resynchronization edge in R”} is a cover for X ,and the cost (number of sync~onizationedges) of the resynchronization R” is ( N -1x1+ IT’l) ,where N is the number of synchronization edges in the original sync~onizationgraph, Now, it is also easily verified (from Figure 11.4) that given an arbitrary cover T ,for X,the resynchronization defined by
(1 1-12) is alsoa valid LCRof G , and that the associated cost is ( N -1 x1 + lTlll).Thus, it follows from the optimality of R’’ that T’ must be a minimal cover for X ,given the family of subsets T. To summarize, we have shown how fromthe particular instance (X, T ) of set-covering, we can construct a synchronization graph G such that from a solution to the latency-constrained resync~onizationproblem instance defined by we can efficiently derive a solution to (X,T ) .This example of the reduction from set-covering to latency-constrained resync~onizationis easily generalized to an arbitrary set-covering instance (X’,7’’) .The generalized const~ctionof the initial sync~onizationgraph G is specified by the steps listed in Figure 11.7. The main task in establish~nga general correspondence between latencyconstrained resynchronization and set-covering is generalizing Observation 6 to
Chapter 11
a ~ p l yto all cons~uctionsthat follow the stepsin Figure 11.7. This generalizat~on is not conceptually difficult (although it is rather tedious)since it is easily verified that all of the ~gumentsin Figure 1 l .6 hold for the general ~onstruction. Similarly, the reasoning that justifies converting an optimal LCR for the construction into an optimal LCR of the form implied by ( l 1-10>and ( 1 1-1 l) extends in a straightforw~dfashion to the general construction.
ncy~constrainedresynchroni~ation hard, the problem becomes tractathat is, synchronization ble for systems that consist of onlytwo processors graphs in which there are two SCCs and each SCC is a simple cycle. Thisreveals a pattern of complexity that is analogous to the classic nonpree~ptiveprocessor scheduling problem with deterministic execution times, in which the probl~mis also intractable for general systems, but an efficient greedy algorithm suftices to yield optimal solutions for two-processor systems in which the execution times owever, for latency-constrained resynof all tasks are id~nticalIJr.76, chronization, the tractability for two-processor systems does not depend on any
tors V, W, Z, in, o u t , wi and instantiate all sub~ra~ associat h ,in§tantiate an actorlabe~edst that has ex@cution tim Instantiat@an actorlabeled sx ln§tanti~te the edge ex E &(V, ntiatetheedge do(sx,out)
as @xecutio~ time 60. ,
ntiate the edge &,(W, s t ) . r eachX E t ,in§tantiate theed
ure 11.7. A procedure for construct in^ an instance I,, of ~at~ncy-constrained r e s y n ~ h r o n i ~ afrom t ~ o ~an instance I,, of et-cov~rin~ such that a solution to I,, yields a solution to I,, .
constraints on the task (actor) execut~ontimes. Two-processor optimality results in multiprocessor scheduling havealso been reported in the context of a stochastic model for parallel computation in which tasks have random execution times and communication patterns [Nic89].
n times {t ( x i ) } ,such that each xi is the i th actor scheduled on the processor that corresponds to the source SCC of the sync~onizationgraph; a set of sink processor actors yl, y,, ...,y, with associated execution times {t ( y i ) } ,such that each yi is the i th actor scheduled on the processor that corresponds to the sink SCC of the synchronization graph; a set of non-redundant synchroni~ationedges S = { S ] , S,, . S,} such that for each si , src( si) E { x , , x2, ...,. x p } and snk( S;) E {yl, y,, ..,y,} ;and a latency constraint L,,, ,which is a positive integer. A solution to such an instance is a miniG ) ( ~ l y,) , S L,,, .In the remainder of mal res~nchronizationR that satisfies this section, we denote the sync~onizationgraph corresponding to the generic .7
We assume that ~eZay(s;) = 0 for all s i , and we refer to the subproblem that results from this restriction as .This section demonstrate§ an algorithm that solves the delayless n O ( ~ time, * ) where N is number of vertices in 5. An extension of this algorithm to the general 2L problem ( ~ b i t r delays ~ y can bepresent) is alsogiven.
An efficient polynomial-time solution to delayless 2LCR by reducing the problem to a special case of set-covering called i ,in which we are given an o r d e ~ n gw l , W, ...,wN of the m e ~ b e r sof X (the set that must be covered), such that the collection of subsets T consists entirely of subsets of the form { W , , W , + ...)W b } , 1 S a S b S N . Thus, while general set-covering involves covering a set from a collection of subsets, interval covering amounts to covering an interval from a collection of subintervals.
1x1
Interval covering can be solved in O( ]TI)time by a simple procedure that first selects the subset {w1, W , , . wb,},where
b , = max((bl(w,,wb E E) for some t E 2')) ; then selects any subset of the form {wUz,W,? + ...)W b z } ,a2S b l + l ,where b2 = ~ ~ ( { b ~ ( w b , + l t,)wforsome bE t~
T});
then selects any subset of the form {wa3,wU3+ ...,W b 3 } ,a3S b, + 1,where
b3 = ~ and so on until b, = N .
~
~
( W b bE E)~ for some ( ~t E T}) ~ ;~
+
~
,
Chapter l1
R to interval covering, we start with the following observations. :Suppose that R is a resync~onizationof 6 , r E R , and r contributes to the elimination of synchronization edge s .Then r subsumes s . Thus, the set of sync~onizationedges that r contributes to the elimination of is simply the set of synchronization edges that are subsumed by r . Proof- This follows immediately from the restriction that there can be no resyn-
chronization edges directed from a y j to an xi (feedforw~d resync~onization), (R, 6),there can be at most one synchronization edge in any path directed from src( S) to snk( S ) .QED. is a resyn~hronizationof LW,$ x
I,
5,then
max( { t p r e d ( src(s’)) + t,,,,,( snk(s’)) Is’ E R } ) ,where t ( x j ) for i = l , 2, ...)p , and tSrtcc(yi) = j l i i = 1, 2, ...,q. jri
Y,)
(c)
Proof- Given a synchronization edge (x,, y b ) E R , there is exactly one delayless from x1 to y , that contains (xu,y b ) and the set of vertices traversed by this path is { x , , x2, ..,x,, yb, y & +I , ...,y,} .The desired result follows immediately. QED. Now, co~espondingto each of the source processor actors xi that satisfies t p r r d ( x i+ ) t(y,) I L,,, we define an ordered pair of actors (a “resynchronization candidate”) by
Consider the exampleshown
t(z)
= 1 for each actor z ,and L,,,
in Figure11.8.Here,weassume
= 10. From (1 1-13), we have
that
If vi exists for a given xi ,then do(vi) can be viewed as the best resynchronization edge that has xi as the source actor, and thus, to construct an optimal LCR,we can select the set of resync~onizationedges entirely from among the vi S. This is established by the following two observations. :Suppose that R is an LCR of 6 , and suppose that (x,, y b ) is a R such that (x,, y b )# v,. Then delayless synchronization edge in (R-{(x,, yb)} + { d n ( v , ) } ) is an LCR of R .
LATENCY-CONSTRAINED RESYNCHRONI~ATION
Pro@ Let ,v = (x,, yc) and R' = ( R-{(x,, y b )}+ {do(v , ) } ) that v , exists, since
?
and observe
From Observation 8 and the assumption that (x,, y b ) is delayless, the set of synchronization edges that (x,? y h ) contributes to the elimination of is simply the set of synchronization edges that are subsumed by (xN,y b ) .Now, if S is a synchronization edge that is subsumed by ( x N 7y b ) then ?
0
0
0
CT
0
0
CT
0
0
0
0
0
0
CT
Figure 11.8. An instance of delayless, two-processor latency-constrained resynof all actors are identically equal chroni~ation,In this example, the execution times to unity.
Chapter 11
From the definition of v,, we have that c I b ,and thus, that pG(yc,y b ) = 0 . It follows from (1 l -16) that
and thus, that v, subsumes S . ence, v, subsumes all synchronizationedges that (x,, y b ) contributes to the elimination of, and we can conclude that R’ is a valid resyn~hronizationof . From the de~nitionof v,, we know that tprrd(x,)+ ts,,,,(yC)I LInax, and is an LCR, we have from Observation9 that R’ is an LCR. ~~~.
c
From Fact 1I .2 and the assumption that the members of S are all delayless, anoptimalLCRof consists onlyof delayless sync~onizationedges. Thus from Observation 10, we know that there exists an optimal LCR that consists only of members of the form d o ( ~ . , )F u r t h e ~ o r efrom , Observation 9, we know that a collection V of vi S is an LCR if and only if U X ( V )
= {st,
52,
S,}
7
V E v where x ( v ) is the set of synchronization edges that are subsumed by v The following observation completes the co~espondencebetween 2LCR and interval covering.
:Let (x,
sl’ ,s2’, ..
S,’
.?
“ - 9
be the ordering of s t , s2, ..
= src(s;’), x&= src(sj’),a c 6)+( i <j ) .
and thus from(1 1-14), we have
S,
.?
specified by (1 1-18)
AINED ~ E S Y N ~ H ~ O N I Z A T I O ~
~ ~ s e ~ ~11: t iLet o nv j = ( x j , y,) ,and suppose k is a positive integer i < k
Proof of
x(
(1 1-21) Now clearly pi,(s~k(sj’)7snk(sk’)) = 0 ,
(1 1-22)
since otherwise ( , p snk( S*’), snk( S,’)) = 0 and thus (from l 1-18) S*’ subsumes s i ’ , which contradicts the assumption that the members of S are not redundant. Finally, since si’ E x;(vj) ,we know that pi,(y/,snk( S,’)) = 0 .Combining this with (11-22) yields
and (1 1-21) and (l 1-23) together yield that sk’ E x ( v i ) .QED. From Observation 11 and the preceding discussion, we conclude that an optimal LCR of 6 can be obtained by the following steps. (a) Construct the ordering sf’ ,s2’, ...,S,’ specified by (1l -18). (b) For i = l , 2, ...,p determine whether or not vi exists, and if it exists, compute vi . ompute v,) for each value of j such that vj exists. (d) Find a minimal cover C for S given the family of subsets {x ( v j ) lvj exists} . (e) Define the resynchronization R = {v j l x ( v j ) E C} .
x(
Steps (a), (b), and (e) can clearly be performed in O ( N ) time, where N is the number of vertices in 6 . If the algorithm outlined in Section 11.4.1 is employed for step (d), then from the discussion in Section 11.4.1 and Observation 12(e) in Section 11.4.3, it can be easily verified that the time complexity of step (d) is ~ ( N 2 )Step , (c) can also beperformed in ~ ( ~ time 2 ) using the observation that if vi = ( x i , yj),then
x(vi) = {(xa,yb)E Sla 5 i where S = (si,s2, ...,S,,} is the set of sync~onizationedges in G . Thus, we
have the following result. :~olynomial-timesolutions (quadratic in the number of synchronization graph vertices) exist forthe delayless, two-processor latency-constrained resynchronization problem.
Note that solutions more efficient thanthe ~ ( N 2 approach ) descri~ed above may exist.
Chapter 11
From (1 1-20), we see that there are two possible solutions that can result if we apply Steps (a)-(e) to Figure 11.8(a) and use the technique described earlier for interval covering. These solutions correspond to the interval covers (v3), x(v7)} and C2 = {x(v3), vs)} , The synchronization graph that results from the interval cover C , is shown in Figure 11.8(b).
x(
If delays exist on one or more edges of the original synchronization graph, then the correspondence defined in the previous subsection between 2LCR and interval covering does notnecessarily hold. For example, considerthe synchronization graph in Figure 11.9. Were, the numbers beside the actors specify execution times; a ‘‘W on top of an edge specifies a unit delay; the latency input and latency output are respectively xI and ys ; and the latency constraint is L,,,, = 12 It is easily verified that vi exists for i = l , 2, ...,6 ,and from (1 113), we obtain
Now if we order the sync~onizationedges as specified by (11-18), then
= ( x i , y i + 4 )for i = 1 , 2 , 3 , 4 , and si’ = (xi, yim4)for i = 5,6,7, 8 , S,”
(11-25)
and if the correspondence between delayless 2LCR and interval covering defined in the previoussection were to hold for general 2LCR, then we would have that eachsubset ~ ( v , isoftheform )
{S~’,S,+~’,
...,sb’},1 < a I b < 8 .
(11-26)
owever, computing the subsets x(vi) ,we obtain
and these subsets are clearly not all consistent with the form specified in (1 1-26). Thus, the algorithm developed in Section 1I .4.2 does not apply directly to handle delays. owever, the technique developed in Section 11.4.2 can be extended to general 2LCR problem in polynomial time. This extension is based on s e p ~ a t i n gthe subsumption relationships between the vi ‘S and the synchronization edges into two categories: if vi subsumes the synchronization edge S = (xk,y,) thenwesay that vi es S if i
LATENCY-CONSTRAINED RESYNCHRONI~A~~ON
both (.G,y3) and (.hy4) and vs = (-G, ys) subsumes 7
(x49
ys)
and
(x57
y,)
: Assuming the same notation for a generic instance of 2LRC that was defined in the previous subsection, the initial sync~onizationgraph 6 satisfies the following conditions: (a) Each synchronization edge has at most one unit of delay (~ez~y(~i) E {0, 111.
, ,L
=1
Chapter 11
(b) If ( x i , y j ) is a zero-delay synchronization edge and (xk, y,) is a unitdelay synchronization edge, then i 1. (c) If vi l-subsumes a unit-delay synchroni~ationedge ( x i , y j ) ,then vi also l-subsumes all unit-delay synchronization edges S that satisfy src(s) = x ; + nn, >0 . (d) If vi 2-subsumes a unit-delay synchronization edge ( x i , y j ) ,then v i also 2-subsumes all unit-delay synchronization edges S that satisfy src(s) = xi-,,,n >0 . (e) If ( x i , y j ) and (xk, y,) are both distinct zero-delay synchronization edges or they are both distinct unit-delay synchronization edges, then i f k and ( i
c.
The other parts can be verified easily from the structure of 6 , including the assumption thatno synchronization edge in G is redundant. We omit the details. Resynchronizations for instances of general 2LCR can be partitioned into two categories cate~oryA consists of all resynchronizations that contain at least one synchronization edge having nonzero delay, and category all resynchronizations that consist entirely of delayless sync~onizationedges. An optimal category A solution (a category A solution whose cost is less than or equal to the cost of all category A solutions) can be derived by simply applying the optimal solution described in Subsection 11.4.2 to “rearrange” the delayless resynchronization edges, and then replacing all synchronization edges that have nonzero delay with a single unit delay synchronization edge directed from x p , the last actor scheduled on the source processor to y 1,the first actor scheduled on the sink processor. We refer to this approach as Al~orithmA. An example is shown in Figure 11.10. When general 2LCR is applied to the instance of Figure 1 l.lO(a),the constraint that all synchronization edges have zero delay is too restrictive to permit a globally optimal solution. Here, the latency constraint is assumed to be L,,, = 2 Under this constraint, it is easily seen that no zero-delay resynchronization edges can be added without violating
LATENCY-CONSTRAINED RESYNCHRONI~ATION
the latency constraint. However, if we allow resynchronization edges that have delay, then we can apply Algorithm A to achieve a cost of two synchronization edges. The resulting synchronizationgraph,withredundant synchronizatio~
,L ,
=2
Figure 11.l 0. An example in which constrainingall resynchronization edges to be delayless makes it impossibleto derive an optimal resynchronization.
Chapter l l
edges removed, is shown in Figure ll.lO(b). Observe that this resynchronization is an LCR since only delayless synchronization edges affect the latencyofa transparent synchronization graph. Now suppose that 6 (our generic instance of 2LCR) contains at least one unit-delay synchronization edge, suppose that G, is an optimal category B solution for 6 , and let R, denote the set of resync~onization edges in C,. Let ud( G)denote the set of synchronization edges in G that have unit delay, and let (xkl, y,,), (xk2,y,J, ...,(xkM,ylM) denote the ordering of the members of ud(G) that corresponds to the order in which the source actors execute on the source processor that is, ( i <j ) a ( k i
= { S E ud(G)I@((z,, z2)E R,))s.t ((zl,z2)l-subsumes S inC)}
I s ~ ~ s G,) (G,
If Isubs(5, G,) is not empty, define Y
= min({ji (xkj, ylj) E
Isubs(
G, C,) >).
( l 1-28)
pose (x,, y,.) E Isubs( G, C,), Then by definition of Y ,m' 2 l , , and thus pE(yl,, y,.) = 0 . Furthermore, since x, and xI execute on the same processor, pc(x,, xl) 5 1 .Hence PC;(X,, xd + PC;(Ylr,Y,.) 5 1 = ~elay(x,9 Y,.) so we have that (x1, ylr) subsumes (x,, y,.) in G .Since (x,, y,.) member of ud( G ) , we conclude that 7
Everymemberof
Isubs(6, G,) is subsumed by (x1, y,,) .
is an arbitrary (1 1-29)
Now, if T'= (ud(6 )-Isubs( 6, G,)) is not empty, thendefine
(x,, y,.) E T'. By definition of U , m 5 k , and thus and suppose pi;(x,, xk,) = 0 . Furthermore, since y,. and yc, execute on the same processor, PG(Y,'y,4 5 1 Hence,
P & n ,Xk,) + P,(Y,,
Ym9
5 1
= delay(x,, Yd)
9
and we havethat Every member of
I- is subsumed by (xk,, y,) .
Observe also that from the d~finitionsof
Y
and
(11-31)
U , and from Observation
LATENCY-CONSTRAINE~RESYNCHRONIZATION
(Isubs(6, G,) = 0 ) *( U = M ) ;
(1 1-33)
and
Now we define the synchroni~ationgraph Z( 6) by
Z ( 6 ) = ( V , (E-
u d ( 6 ) )+ P ) ,
(1 1-35)
where V and E are the sets of vertices and edges in 6 ; P = { d o ( x t 7y/,), dO(xkN, y,)} , ifboth Isubs(6, G,) and l" are non-empty; P = {do(x,,ylr)} if l? is empty;and P = {do(xkl,,y c I ) } if Isubs(6, G,) is empty. :G, is a resynchronization of Z( 6) .
Proofi The set of synchronization edges in Z(6) is Eo+ P , where Eo is the set of delayless synchronization edges in 6 . Since Gb is a resynchronization of 6 , it suffices to show that for each e E P , pc,( src(e), snk( e)) = 0 .
(1 1-36)
If Isubs(6, G,) is non-empty then from (1 1-28) (the definition of r )and Observation 12(f), there must be a delayless synchronization edge e' in G, such that snk( e') = y, for some W 5 I, .Thus, fG,(xt? y/,) fG,(XI, src(e')) + pC,(s"k(e')7 y/,) 0 + P G , ( y w y/,) and we have that (1 1-36) is satisfied for e = ( x 1 ,y l r ).
0
7
Similarly if l? is non-empty, then from ( l 1-30) (the definition of U ) and from the definition of 2-subsurnes, there exists a delayless synchronization edge e' in G, such that src( e') = x, for some W 2k , .Thus, Y,) fGb(xku, src(e')> + fG,(s"k(e'), Y y ) = P G , ( X k , r x,) + 0 hence, we have that (1 1-36) is satisfied for e = (xk,,y,) . PCb(xkl,,
eE
0;
From the definition of P ,it follows that (11-36) is satisfied for every P. : The latency of
Lz(c)(xt,Y,) 2 L m u x
Z(6)
is not greater than L,,,
,
That is,
*
roofi FromTheorem11.3,weknow
that G, preserves Z( 6) . Thus,from
Chapter 11
Lemma 10.l , it follows that Lz(6J(xl, y,) 5 LGb(xl,y,) .Furthermore, from the assumption that Gb is an optimal category B LCR, we have LGb(xl,y,) 5 L,,, . conclude that Lz(e)(xl,y,) 5 L,,, . Theorem 1 1.3,along with (1 1-32)-( 1 1-34), tells us that an optimal category B LCR of G is always a resynchronization of (1) a sync~onizationgraph of the form
or
or
Thus, from Corollary 1 1.1, anoptimal resynchronization can be computed by examining each of the (M+ 1) = ud( + 1) sync~onization graphs defined by (1)-(3), computing an optimal LCR for each of these graphs whose latency is no greater than L,,, ,and returning one of the optimal LCRs that has the fewest number of sync~onizationedges. This is straightfor~ardsince these graphs contain only delayless synchronization edges, and thus the algorithm of Section 11.4.2 can be used.
(I c)[
Recall the example of Figure 1 &a). 1 Here,
=
Y4) 1 and the set of synchronization graphs that correspond to (1)-(3)are shown in Figures 1 1.1 1 and 1 1.12. The latencies of the graphs in Figure 1 1.1 (a)-(c) 1 and Figure 11.12(a-b) are respectively 14, 13, 12, 13, and 14. Since L,,, = 12, we only need to compute an optimal LCR for the graph of Figure 1l. 1l(c) (from Corollary 1 1.1).This isdone by first removing redundant edges from the graph (yielding the graph in Figure 1l .13(b)) and then applying the algorithm developed in ~ection 11-4.2. For the syn~hronization graph of Figure 11,13(b), and L,,, = 12 ,it is easily verified that the set of vi S is ud@)
v1 = (x17 Y3),
{(x57
v2 v5
If we let
Yd, (X67 Y2)t
= (x29 Y4)t =
v3
Y S ) , v6
(x79 Y3),
7
= (4% Y d t v4 = (x47 Ys) = (x69 y8)
?
(c)
Figure 11.l 1. An illustrationof Algorith~B.
Chapter 1l
then we have
From (1 1-38), the algorithm outlined in Section 11.4.1 for interval covering can be applied to obtain an optimal resynchronization. This results in the resynchronization R = {v l , v3, v6}. The resulting synchronizationgraph is shown in Figure 11.13(c). Observe that the number of synchronization edges has beenreducedfrom8 to 3 ,while the latency hasincreasedfrom 10 to L,,, = 12 .Also,none of the original synchronizationedgesin are retained in the resynchronization. We say that ~ l ~ o ~ i for t general2LCR is the approachof constructing the (1 ud( G)/ + 1 )synchronization graphs corresponding to (1)-(3), computing an optimal LCR for each of these graphs whose latency is no greater than L,,, ,and returning one of the optimal LCRs that has the fewest number of syn-
(a)
(W
Figure 1l .12. ont ti nu at ion of the illustrationof Algorithm B in Figure11.l 1.
LATE~CY-CQNSTRAINEDRESYNCHRQNIZATIQN
(6)
Figure 11.13. Optimal LCR example.
Chapter 11
chronization edges. We have shown previously that Algorithm B leads to an optimal LCR under the constraint that all resynchronizationedges have zero delay. Thus, given an instance of a general 2LCR, a globally optimal solution can be derived by applying Algorithm A and Algorithm B and retaining the best of the resulting two solutions. The time complexity of this two-phased approach is dominated by the complexity of Algorith~B, which is O(Iud( c ) l N 2 ) (a factor of lud(c)l greater than the complexity of the technique for delayless 2LCR that wasdevelopedinSection11.4.2),where N is thenumber of verticesin . Since ud( C)l 5 N from Obs~rvation12(e), the complexity is O(N3) .
c
1
: Po1ynomia~-timesolutions exist for the general two-processor latency-constrained resync~onizationproblem.
”he example in Figure 11.10 shows how itis possible for Algorithm A to produce a better result than Algorithm B. Conversely, the ability of Algorithm B to o u t p e r f o ~Algorithm A can be demonstrated through the example of Figure 11.9. From Figure 11*13(c),we know that the result computed by A l ~ o ~ t hBm has a cost of 3 synchronization edges. The result computed by Algorithm A can be derived by applying interval covering to the subsets specified in (11-27) with all of the unit-delay edges (ss’, s6’, s7’, s 8 ’ ) removed:
x( x(
A minimal coverfor (1 1-39) is achieved by {x;(v2), v3), v ~ ) },and thecorresponding synchronization graph computed by Algorithm A is shown in Figure 11.14. This solution has a cost of 4 synchronization edges, which is one greater than that of the result computed by Algorithm B for this example.
In Section 10.5, we discussed a heuristic called Global-resynchronize for the maximum-t~oughputresynchronization problem, which is the problem of d e t e ~ i n i n gan optimal resynchronization under the assumption that arbitrary increases in latency can be tolerated. In this section, we extend Algorithm Global-resynchronize to derive an efficient heuristic that addresses the latency-constrained resynchronization problem for general sync~onizationgraphs. Given an input sync~onizationgraph C , Algorithm ~lobal-resynchronizeoperates by first computing the family of subsets
After computing the family of
subsets specified by (1 1-40), Algorithm
LATENCY-CONSTRAINE~RESYNCHRONIZATION
~lobal-resynchronizechooses a member of this family that has ~ a x i m u mcardinality, inserts the corresponding delayless resynchronization edge, and removes all sync~onizationedges thatbecome redundant as a result of inserting this resynchronization edge. To extend this technique for maximum-throu~hput resynchronizatio~to the latency-constrained resync~onizationproblem, we simply replace the subset computation in (1 1-40) with
Figure 11.14. The solution derivedby Algorithm A when it is applied to the example of Figure 11.9.
Chapter 11
where L’ is the latency of the synchronization graph ( V , { E + {(v1,v;?)}}) that results from adding the resynchronization edge (vl, v2) to G . A pseudocode specification of the extension of Global-resynchronize to the latency-constrainedresynchronizationproblem, called Algorithm GlabalLCR, is shown in Figure 11.15.
In Section l 1.2, we mentioned that transp~ent sync~onization graphs are advantageous for performing latency-constrained resynchronization. If the input synchronization graph is transparent, then assuming that T3(G)(x,y ) hasbeen determined for all x, y E V , L’in Algorithm Global-LCR can be computed in O( 1) time from
where V is the source actor in Ji(G) ,aL is the latency output, and L, is the latency of G . Furthermore, T 3 ( G ) (y~) , can be updated in the same manner as pc. That is, once the resynchronization edge best is chosen,wehave that for each (X?Y ) E ( V U { V > ) 7
T,,,(x, y) = nzQ.4{Tji(G)(X, y ) , T $ ( G ) ( X , src(best)) + T3~G~(sy1~(best), Y)}),
(1 1-43)
where T,,, denotes the maximum cumulative execution time between actors in the first iteration graph after the insertion of the edge best in G . The computations in (11-43) canbeper by inserting the simple Y loopshown in Figure11.16 at the endof th lock in AlgorithmGlobal-LCR.Thus, as with the computation of pc, -time Bellman-Ford algorithm need only be invokedonce, at the beginning of the LCRAlgorithm, to ialize T 3 c G , (y) ~ ,. This loop canbe inserted immediatelybefore or after the loop that updates PG
*
In Section 10.5, it was shown that Algorithm Global-resynchronize has
O(sffn4) time-complexity, where y1 is the number of actors in the input synchro-
nization graph,and sff is the numberoffeedforwardsynchronizationedges. Since the longest path quantities TJiCG,( *,*) can be computed initially in O(n3) time and updated inQ( n2) time, it is easily verified that the Q( sffn4) bound also applies to the customization of Algorithm Global-LCR to transparent synchroni-
LATENCY-CONSTRAINE~RESYNCHRONIZATION
zation graphs. In general, whenever the nestedloops in Figure l 1 .15 ~ o ~ i n a the te computation of the ) complexity is ~aintainedas long as (L’(x, y ) S L,,,) can be evaluated in O( 1) time. For general (not necessarily
Global-LCR educed synchronization graphG = (V,E) .an alternative reduced synchronization graph that preserves G. compute pG(x,Y ) for ail actor pairs X, Y E V complete = FALSE
= NULL, M = 0
complete
= TRUE
= E: -X(best) + {do(best)} G = (V,E)
r
X,YE
v
I* update pG *I
Figure 11.l 5. A heuristic for ~atency-constrained r~synchro~ization.
Chapter 11
transparent) synchronization graphs, we can usethe functional simulation approach described in Section 11.2 to determine L'(x, y ) in O(d X max( {n, S } ) ) time, where d = 1 + pG,(x, y ) ,and S denotes the number of s y n c ~ o n i ~ a t i o nedges in G . This yields a running time of O(d ~ ~{n, S~} ) ) n for general ~ ~synchronization ~ x ~ graphs. The complexity bounds derived above are based on a general upper bound of n2 ,which is derived inSection 10.5,on the total number of resynchronization loop iterations). However, this n2 bound can be viewed as a very estimate since in practice, constraints on the introduction of cycles severely limit the number of possible resynchronization steps. Thus, on practical graphs, wecan expect signi~cantlylower average-case complexity than the worst-case bounds of ~ ( ~and~ ~ ~( ~ n ~~ ~) S~} ) ) n. ~ ~ ~ ( { n
Figure 11.17 shows the synchronization graph that results from a six-processor schedule of a synthesizer for plucked-string musical instruments in 11 voices based on the Karplus-Strong technique, as shown in Section 10.5. In this example, exc and out are respectively the latency input and latency output, and the latency is 170. "hereare ten sync~onizationedges shown, and none of these is redundant. Figure 1 1,18 shows how the number of synchronization edges in the result computed by the heuristic changes as the latency constraint varies. If just over 50 units of latency can be tolerated beyond the original latency of 170, then the heuristic is able to eliminate a single synchronization edge. No further improvement can be obtained unless roughly another 50 units are allowed, at which point the number of synchronization edges drops to 8 ,and then down to 7 for an additional 8 time units of allowable latency. If the latency constraint is weakened to
Figure 11.16. Pseudocode to update Tfifctfor use in the custornization of Algorithm Global-LCR to transparent synchronization graphs.
,
Figure 11.17. The synchronization graph that results from a six-processor schedule of a music synthesizer based on the Karplus-Strong technique.
Chapter 1l
Set 0
-
9.50 -
9.00 -
8.50 -
8.00 7.50 -
7.00 6.50 6.00 -
5.50 --
5.00 l
500.00 400.00 300.00 200.00
l
I
I
x 670~0*.0~0
Figure 11.18. Performance of the heuristicon the example of Figure 11.l 7.
LATENCY-CONSTRAINEDRESYNCHRONI~ATION
382, just over twice the original latency, then the heuristic is able to reduce the number of synchronization edgesto 6 .No further improvement is achieved over the relatively long range of (383 -644) .When L,,, 2645 ,the minimal cost of 5 synchronization edges for this system is attained, which is half that of the original s y n c ~ o n i ~ a t i ograph. n Figure 11. l 9 and Table 11.1 show how the average iteration period (the reciprocal of the average t~oughput)varies with different memory access times for various resync~onizationsof Figure I 1.17. Here, the column of Table 1 l.1 and the plot of Figure 11.19 labeled A represent the original synchronization graph (before resynchronization); colurnn/plot label B represents the resynchronized result corresponding to the first break-point of Figure 11. l 8 (L,,, = 221 , 9 sync~onizationedges); label C corresponds to the second break-point of Figure 1X. 18 (L,,, = 268 ,8 synchronization edges); and so on for labels D , E
700
l
l
I
I
l
I
I
l
l
Memory access time
Figure 11.l 9. Average iteration period (reciprocal of average throughput) vs. of the memory access time for various latency~constrained resynchronizations music synthesis examplein Figure 11.l 7.
Chapter 11
and F , whose associated synchronization graphs have 7 ,6, and 5 synchronization edges, respectively. Thus, as we go from label A to label F , the number of sync~onizationedges in resynchronized solution decreases monotonically. However, as seen in Figure 11.19, the average iteration period need not exactly follow this trend. For example, even though synchronization graph A has one synchronization edge more than graph B , the iteration period curve for graph B lies slightly above that of A . This is because the simulations shown in the figure model a shared bus, and take bus contention into account. Thus, even though graph B has one less synchronization edge than graph A ,it entails higher bus contention, and hence results in a higher average iteration period. A similar Table 11.l. Performance results for the resynchronization of Figure 11.17. The “I‘ standsfor“averageiteration firstcolumngivesthememoryaccesstime; period” (the reciprocal of the average throughput); and ‘ ‘stands ~ for “memory accesses per graph iteration.”
F
E3 I 219 274 302 334 373 413
457 502 553 592
I
641
c
D
LATENCY-CONSTR~INE~ R~SYNCHRONIZATION
2
anomaly is seen between graph C and graph D , where graph D has one less synchronization edge thangraph C , but still has a higheraverage iteration period. However, we observe such anomalies only within highly localized neighborhoods in which the number of synchronization edges differs by only one. Overall, in a global sense, the figure shows a clear trend of decreasing iteration period with loosening of the latency constraint, and reduction of the number of synchronization edges. It is difficult to model bus contention analytically, and for precise performance data we must resort to a detailed simulation of the shared bus system. Such a simulation can be usedas a means of verifying that the resynchronization optimization does notresult in a performance degradation due to higher bus contention. ~xperimentalobservations suggest that this needs to be done only for cases where the number of synchronization edges removedby resynchronization is small compared to the total number of synchronization edges (i.e., when the resynchronized solution is within a localized neighborhood of the original synchronization graph). Figure 11.20 shows that the average number of shared memory accesses
A
€3
c- - - -D
E
F
Figure 11.20. Average number of shared memory accesses per iteration for various latency-constrained resynchronizations of the music synthesis example.
Chapter 1l
pergraph iteration decreases consistently withloosening of the latency constraint. A s mentioned in Chapter10, such reduction in shared memory accesses is relevant when power consumption is an important issue, since accesses to shared memory often require significant amounts of energy. Figure 11.21 illustrates how the placement of sync~onizationedges changes as the heuristic is able to attain lower synchronization costs. Note that synchronization graphs computed by the heuristic are not necessarily identical over any of the L,,, ranges in Figure l 1.18 in which the number of synchronization edges is constant. In fact, they can be significantly different. This is because even when there are no resynchronization candidates available that can reduce the net synchronization cost (that is, no resynchronization candidates for which *)I > 1) ),the heuristic attempts to insert resynchronization edges for the purpose of increasing the connectivity; this increases the chance that subsequentresynchronizationcandidateswillbegenerated for which *)I > l ,as discussed in Chapter 10. For example, Figure 11.23 showsthe synchronization graph computed when L,,, is justbelow the amount needed to permit the minimal solution, which requires onlyfivesynchronization edges (solution F ) . Comparison with the graph shown in Figure 11.21(d) shows that even though these solutions have the same sync~onizationcost, the heuristic had much more room to pursue further resynchronization opportunities with L,,, = 644 ,and thus, the graph of Figure l 1 2 3 is more similar to the minimal solution than it is to the solution of Figure 11.2 1(d),
([x(
Ix(
Earlier, we mentioned that the O(sjfn4)and O ( ~ S{IZ,S~} ) ) com~ ~ plexity expressions are conservative since they are based on an n2 bound on the number of iterations of the while loop in Figure 11.15, while in practice, the actual number of while loop iterations can be expected to be much less than n 2 . This claim is supported by the music synthesis example, as shown in the graph of Figure l 1.22. Here, the X -axis corresponds to the latency constraint L,, ,and the Y-coordinates givethe number of while loop iterations that were executed by the heuristic. We see that between 5 and 13 iterations were required for each execution of the algo~thm,which is not only much less than n2 = 484, it is even less than n .This suggests that perhaps a significantly tighter bound on the number of whileloop iterations can be derived.
This chapter has discussed the problem of latency-constrained resynchronization for self-timed implementation of iterative dataflow specifications. Givenanupperbound L,,, on the allowable latency, the objective of latency-constrainedresynchronization is to insert extraneoussynchronization operations in sucha way that a) the number oforiginal sync~onizationsthat con-
~
4
* -
4
c
0 0 0
,,L
= 221
0
c 0
t
0
,,L ,
= 268
Figure 11.21. Synchronization graphs computed by the heuristic for different values of L,n,, .
Chapter 1I
Y
Set 0
I
13.00 12.50 12.00 11.50 --
11.00 10.50 -
10.00 -
9.50 9.00 8.50
--
8.00 -
I
7.50 -
-
7.00 6.50 6.00 -
5.50 5.00 200.00
300.~
I
l
400.00
500.00
l
~ . 0 0
700.00
X
Figure 11.22. Number of resynchronization iterations versusL,,,,, for the example of Figure 11.17.
LATENC~-~ONSTRAINED RESYNCHRONIZATION
sequently become redundant significantly exceeds the number of new synchronizations, and b) the serialization imposed by the new synchronizations does not increase the latency beyond L,,, .To ensure that the serialization imposed by resynchronization does not degrade the throughput, the new synchronizations are restricted to lie outside of all cycles in the final sync~onizationgraph. In this chapter, it has been shown that optimal latency-constrained resynchronization is NP-hardeven fora very restricted class of synchronization graphs. Furthermore, an efficient, polynomial-time algorithm has been demonstrated that computes optimal latency-constrained resyn~~onizations for twoprocessor systems; and the heuristic presented in the Section 10.5 for maximumthroughput resync~onization hasbeen extended to address the problem of latency-constrained resynchronization for general n-processor systems. Through an example of a music synthesis system, we have illustrated the ability of this extended heuristic to systematically trade-off between synchronization overhead and latency. The techniques developed in this chapter and Chapter 10 can be used as a post-processing step to improve the performance of any of the large number of static multiprocessor scheduling techniques for dataflow specifications, such as
. "
-- --"
" "
\
\ \
,,L,
\
= 644
Figure 11.23. The synchroni~ationgraph computed by the heuristic for L,n,, = 644 .
Chapter 11
those described inEBPFC94, CS95, GGD94, Hoa92, LAAG94, PM91 ,Pri91,
r sizes The previous three chapters have developed several software-based techniques for minimizing synchronization overhead for a self-timed multiprocessor implementation. After all of these optimizations are completed on a given application graph, we have a final synchronization graph G, = (V, Ein,U E,) that preserves G,. Since the synchronization edges in G, are the ones that are finally implemented, it is advantageous to calculate the self-timed buffer bound B, as a final step after all the transfo~ationson G, are completed, instead of using G,, itself to calculate these bounds. This is because addition of the edges in the Convert-tu-SC-gruphand Resynchronize steps may reduce these buffer bounds. It is easily verified that removal of edges cannot change the buffer bounds in (9-1) as long as the synchronizations in G, are preserved. Thus, in the interest of obtaining minimum possible shared buffer sizes, we computethe bounds usingthe optimized synchronization graph.The following theorem tells us how to compute the self-timed buffer bounds from G, . 2.1: If G, preserves G, and the synchronizationedges in G, are implemented, then for each feedback communication edge e in G,, ,the selftimed buffer bound of e (B,(e) ) an upper bound on the number of data is given by: tokens that can be present on e
-
Proo$ By Lemma 7.1, if there is a path p from s n k ( e ) to src(e) in G,, then
start( src (e), k ) 2 end( snk( e ) , k -Delay ( p ) ) .
(12-2)
Chapter 12
Taking p to be an arbitrary minimum-delay path from snkfe) to src( e ) in C , , we get s t ~ ~ ( s r c ( ke )) 2 , e n ~ ( s n k ( e k) ,-pG~(snk(e),src(e))) .
(l 2-3)
That is, src(e) cannot be more that pGl(snk(e), src(e))iterations “ahead” of snk(e) .Thus there can never be more that pG,(snk(e), src(e)) tokens more ~ e l a y ( .eSince ) the initial number of than the initial number of tokens on e tokens on e was delay( e ) ,the size of the buEer corresponding to e is bounded above by
-
m(
e)
= pc,(snk( e), src( e ) ) + delay ( e )
~~~*
Thequantities pG8(snk(e), src(e)) can be computed using Dijkstra’s algorithm (Section 3.13.1) to solve the all-pairs shortest path problem on the syn. chronization graph in time O(]VI3)
se To present a unified viewof multiprocessor implementation issues in a concrete manner. that can cont~buteto the development of future ~ultiprocessor implementation tools, we introduce a flexible framework for combining arbitrary multiprocessor scheduling algorithms for iterative dataflow graphs, including the diverse set that we discussed in Chapter 5, with algorithms for opti~izingIPC and sync~onizationcosts of a given schedule, such as those covered in Chapters 9-1 1.
A pseudocode outline of this framework is depicted in Figure 12.1, In Step l, an arbitrary multiprocessor scheduling algorithm is applied to construct a parallel schedule for the input dataflow graph. From the resulting parallel schedule, the IPC graph and the initial synchronization graph models are derived in Steps 2 and 3. ’Then, inSteps 4-8, a series of transformations is attempted on the synchronization graph. First, A l g o ~ t hRemoveRedundantSynchs ~ detects and removes all of the s y n c ~ o n i ~ a t i oedges n in G, whose associated synchroni~ationfunctions are guaranteed by other synchronization edges in the graph, as described in Section 9.7. Step 5 then applies resynchronization to the “reduced” graph that emerges from Step 4, and inco~oratesany applicable latency constraints. Step6 inserts new sync~onizationedges to convert the synchronization graph into a strongly connected graph so that the efficient BBS protocol can be used unil a y s discussed in Section 9.9 formly, Step 7 applies the ~ e t e ~ i n e ~ eAlgorithm to determine an efficient placement of delays on the new edges. Finally, Step 8
INTEGRAmD S~NCHRONI~ATION O~I~IZATION
removes any synchronization edges that have become redundant as a result of the conversion to a strongly connected graph. After Step 8 is complete, we have a set of IPC buffers (co~espondingto the IPC edges of G,) and a set of synchronization points (the synchronization edges of the transformed version of G, ).The main task thatremains before mapping the given parallel schedule into an implementation is the determination of
Irn~lernentlMult~rocessorSchedule iterative dataflow graph specificationG of a DSP application. n optimized synchronization graphG,, an IPC graphGiF ,and IPc buffer sizes {B ( e )le is an IPC edge in G,w}. 1. Apply a multiprocessor scheduling algorithm to construct a parallel schedule for G onto the given target multiprocessor architecture. The parallel schedule specifies the assignment of individual tasks to processors, and the orderin which tasks execute on each processor. 2. Extract G,F from G and the parallel schedule constructedin Step 1. 3. Initialize G, = G , ~
4. G# = ~ e ~ o v e ~ e d u n d a n t S y n G,) chs(
6. Gs = c o n v e r t - t u - S C - ~ r ~ G,) ~h(
?. G, = ~ e t e r ~ i n e ~ e l a y s ( G ~ )
.Ga= ~ e ~ o v e ~ e d u n d a n t S y n Gs) chs( 9.Calculate the buffer sizeB ( e ) for each IPC edge e in Giw.
a) Compute pG,(src(e),snk(e)) ,the total delay on a minimumdelay path in C;, directed from src(e) to snk(e) b) Set B(e) = pGs(src(e),snk(e)) +- deZay(e) Figure 12.1.A framework for synthesizing multiprocessor i~piementations.
Chapter 12
the buffer size the amount of memory that must be allocated for each IPC edge. From Theorem 12.1, we can compute these buffer sizes from G, and G, by the procedure outlined inStep 9 of Figure 12.1. A s we have discussed in Chapter 5, optimal derivation of parallel schedules is intractable, and a widevariety of useful heuristic approacheshave emerged, with no widely accepted “best choice” among them. In contrast, the technique that we discussed in Section 9.7 for removing redundant synchronizations (Steps 4 and 8) is both optimal andof low computational complexity. How10 and 1l , optimal resy~chronization is ever, as discussedinChapters intractable, and although some efficient resynchronization heuristics have been developed, the resync~onizationproblem is very complex, and experimentation with alternative algorithms may be desirable. Similarly, the problems associated with Steps 6 and 7 are also significantly complex to perform in an optimal manner, although no result on the intractability has been derived so far.
Thus, at present, as with the parallel scheduling problem, tool developers are not likely to agree on anysingle “best” algorithm for each of the implementation sub-problems s u ~ o ~ n d i nSteps g 5,6, and 7. For example, some tool designers maywish to experimentwithvariousevolutionaryalgorithms or other iterative/probabilistic searchtechniquesonone or more of the sub-problems [Dre98]. The multiprocessor implementation framework defined in Figure 12.1 addresses the inherent complexity and diversity of the sub-problems associated with multiprocessor implementation of dataflow graphs by implementing a naturaldecompositionof the self-timed sync~onizationproblem into a series of well-defined sub-problems, and providing a systematic method for combining arbitrary algorithms that address the sub-problems in isolation.
This section has integrated the software-based synchronization techniques developed in the Chapters 9-1l into a a single framework for the automated derivation of self-timed multiprocessor implementations. The input to this framework is an H S D F ~representation of an application. The output is a processor assignmentandexecutionorderingof application sub-tasks; anIPCgraph G, = ( V ,EiF),which represents buffers as communication edges; a strongly connected synchronization graph G, = ( V , Ei,, U E,) ,which represents synchronization constraints; and a set of shared-memory buffersizes {B,%(e) I e is an IPC edgein GiF}.
(12-4)
A code generator can accept G, and G, from the output of the ~ i n i ~ i z e Sync~Costframework, allocate a buffer in shared memory for each comn~unicaof size B,( e ) , and generate synchronization code tion edge e specified byG,
for the synchronization edges represented in G,. These synchronizations may be implementedusing the bounded b u ~ e rsync~roniz~tiun (BBS) protocol. The resulting sync~onizationcost is 2n,T,where n, is the number of synchronization edges in the sync~onizationgraph G, that is obtained after all optimizations are completed.
This Page Intentionally Left Blank
This book has explored techniques that minimize inter-processor communicationand sync~onizationcosts in statically scheduled multiprocessors for DSP. The main underlying theme is that communication and sync~onizationin statically scheduled hardware is fairly predictable, and this predictability can be exploited to achieve our aims of low overhead parallel implementation at low hardware cost. The first technique described was the ordered transactions strategy,wherethe idea is to predictthe order of processoraccesses to shared resources and enforce this order at run time. An application of this idea to a sharedbus multiprocessor was described, wherethe sequence of accesses to shared memory is pre-dete~inedat compile time and enforced at run time by a controller implemented in hardware. A prototype of this architecture, called the ordered memory access architecture, demonstrates how low overhead IPC can be achieved at low hardwarecost for the class of DSP applications that can bespecified as SDF graphs, provided good compile time estimates of execution times exist. We also introduced the IPC graph model for modeling self-timed schedules. This model was used to show that we candetermine a particular transaction order such that enforcing this order at run time does not sacrifice performance when actual execution times of tasks are close to their compile time estimates. When actual running times differ from the compile time estimates, the computation performed is still correct, but the performance (throu~hput)may be affected. We described how to quantify such effects of run time variations in execution times on the throughput of a given schedule. The ordered transactions approach also extends to graphs that include constructs with data-dependent firing behavior. We discussed how conditional constructs and data-dependent iteration constructs canbemappedtothe 0 architecture, when the numberof such control constructs is small a reasonable assumption for most DSP algorithms. Finally, we described techniques for minimizing sync~onizationcosts in
Chapter 13
a self-timed implementation that can be achievedby systematically manipulating the sync~onizationpoints in a given schedule; the IPC graph construct was used for this purpose. The techniques described include determining whencertain synchronization points are redundant, transforming the IPC graph into a strongly connected graph, and thensizing buffers appropriately such that checks for buffer overflow by the sender can be eliminated. We also outlined a technique we call resynchronization, which introduces new synchronization points in the schedule with the objective of minimizing the overall synchronization cost.
The work presented in this book leads to several open problems anddirections for further research. apping a general BDF graph onto the OMA architecture to make the best use of our ability to switch between bus access schedules at run time is a topic that requires further study. Techniques for multiprocessor scheduling of BDF graphscouldbuildupon the quasi-static schedulingapproach,which restricts itself to certain types of dynamic constructs that need to be identified (for example as conditional constructs ordata-dependent iterations) before scheduling can proceed. Assumptions regarding statistics of the Boolean tokens (e.g., the propo~ionof TRUE valuesthat a control token assumes duringthe execution of the schedule) would be requiredfor determining multiprocessor schedules for BDF raphs. A architecture applies the ordered transacti~nsstrategy to a shared bus multiprocessor. If the interprocessor communicationbandwidth requirements for an application are higher than whata single shared bus can support, a more elaborate interconnect, such as a crossbar or a mesh topology, may be required. If the processors in such a system run a self-timed schedule, the communication pattern is again periodic and we can predict this pattern at compile time. We can then determine the states that the crossbar in such a system cycles through or we can determine the sequence of settings for the switches in the mesh topology. The fact that this i n f o ~ a t i o ncan be determined at compile time should makeit possible to simplify the hardware associated withthese interconnect mechanisms, since the associated switches need not be configuredat run time. Exactly how this compile time information can be made use of for simplifying the hardware in such interconnectsis an interesting problem for further study. In the techniques we proposed in Chapters 9 through 11 for minimizing sync~onizationcosts, no assumptions regarding bounds on execution times of actors in the graph were made.A direction for further work is to incorporate timing guarantees for example, hard upper and lower execution time bounds, as Dietz, Zaafrani, and O’Keefe use in [DZ092]; and handling of a mix of actors, some of which have guaranteed execution time bounds, and others that have no such guarantees, as Filo, Ku,Coelho Jr., and De Micheli do in [FKJM93]. Such
FUTURE RESEARCH DIRECTIONS
guarantees could be used to detect situations in which data will always be available before it is needed for consumption by another processor. Also, execution time guarantees can be used to compute tighter buffer size bounds. As a simple example, consider Figure 13.1. Here,the analysis of Section 9.4 yields a buffer size B,( (A,B)) = 3 ,since 3 is the minimum path delay ofa cycle that contains (A, B) .However, if t(A) and $(B),the execution times of actors A and B ,are guaranteed to be equal to the same constant, then it is easily verified that a buffer size of l will suffice for (A,B) .Systematically applying execution time guarantees to derive lower buffer size bounds appears to be a promising direction for further work. Several useful directions for further work emerge fromthe concept of selftimed resynchronization described in Chapters 10 and 11. These include investigating whether efficient techniques can be developed that consider resynchronization opportunities within strongly connectedcomponents, rather than just across feedforward edges. There may also be considerable roomfor improvement over the resynchronization heuristics that we have discussed, which are straightforward adaptations of an existing set-covering algorithm. In particular, it would be useful to explore ways to best integrate the heuristics for general synchronization graphs with the optimal chaining methodfor a restricted class of graphs, and it may be interesting to search for properties of practical synchronization graphs that could be exploited in addition to the correspondence with set covering. The extension of Sarkar’s concept of counting semaphores [Sar89] to self-timed, iterative execution, and the incorporation of extended counting semaphores within the framework of self-timed resynchronization, are also interesting directions for further work.
3
Figure 13.1. An example of how execution time guarantees canbe used to reduce buffer size bounds.
Chapter 13
Another interesting problem is applying the synchronization minimization techniques to graphs that contain dynamic constructs. Suppose we schedule a graph that contains dynamic constructs using a quasi-static approach, or a more general approach if one becomes available. Is it still possible to employ the synchronization optimization techniques we discussed in Chapters 9-1 l ? The first step to take would beto obtain an IPC graph equivalent for the quasi-static schedule that has a representation for the control constructs that a processor may execute as a part of the quasi-static schedule. If we can show that the conditions we established for a synchronization operation to be redundant (in Section 9.7) holds for all execution paths in the quasi-static schedule, then we could identify redundant synchronization points in the schedule. It may also be possible to extend the strongly-connect and resynchronization transformations to handle graphs containing conditional constructs; these issues require further investigation, Also, the quasi-static schedulingapproaches that havebeenproposed (e.g., Ha’s techniques [HL97]) do not take the communication overhead of broadcasting control tokens to all the processors in the system into account. Multiprocessorschedulingapproaches that do take this overhead into account are an interesting problem for future study.
[ABU91]
ic, and T. Ungerer. Evolution of dataflow computo~ Prentice Hall, ers. In Advanced Topics in D a t a - ~ Computing. 1991.
[AB%92]
.Burnett, and B. A. Zimmerman. Operational versus de~nitional:A perspective on programming paradigms. ZEEE Computer Magazine,25(9), September 1992.
[ACD74]
.Chandy, and J. R. Dickson. A comparison of arallel processing systems. Co~municationsof the ACM, 17(12):685-690, December 1974.
[Ack82]
W. B. Ackerman. Data flow languages. ZEEE Comp~ter
zine, 15(2), February 1982.
A. V.Aho, J. E. Hopcroft, and J.D.Ullman. Data Structures and
Algorithms. Addison-Wesley9 1987.
[AK87]
.Allen and D. Kennedy. Automatic transformations of F 0 "RAN programs to vector form.A C Transa~tions ~ on Programming ~anguages and Systems,9(4), October 198'7.
[AN881
A. Aiken and A. Nicolau. Optimal loop parallelization. In Proceedings of the A CM Conference on Programming .LanguageDesign and Implementation, 1988.
[AN901
Arvind and R.S. Nikhil. Executing a program onthe token dataflow architecture. ZEEE ~ransactionson Co~puters, 39(3), March 1990.
[ASIt-981
A. Abnous, K. Seno, Y. Ichikawa, M. an, and J. Rabaey. Evaluation of a low-power reconfigurable DSP architecture. In Proceedings of the ~econ~gurable Architectures ~orkshop,
1998, [At-871
M.Annaratone et al. The Warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers, C-36(12), December 1987.
[M981
S. Aiello et al. Extending a monoprocessor real-time system in a
~ ~ ~ L ~ O ~ R A P ~ Y
DSP-based multiprocessing environment. In Proceedings of the
Internatio~al Conference on Acoustics, Speech, and Signal Processing, 1998.
[BB911
A. Benveniste and G. Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79(9): 12701282, September 1991.
[BCQQ92]
F. Baccelli, G. Cohen, G. J.Qlsder, and J.Quadrat. Synchronization a n d L i n e a r i ~John . Wiley & Sons, Inc., 1992.
[BDWSS]
J. Beetem, M. Denneau, and D. Weingarten. The GF11 superComputer.In International Symposium on ~ o m p u t e rArchitecture, June 1985.
[BELP94]
C. Bilsen, M.Engels, R. Lauwereins,and J. A. Peperstraete. Static scheduling ofmulti-rate and cyclo-static DSP-applications. In P~oceedings ofthe International ~ o r k s h o p o nVLSI Signal Processing, 1994.
[BHCF95]
S. Banerjee, T. Hamada, P. M. Chau, and R. D. Fellman. Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems. IEEE ~ransactions on Signal Processing, 43(6):1468-1484, June 1995.
J. T. Buck, S. Ha, E. A. Lee, and D. G. Messerschmitt. Ptolemy: A framework for simulating and prototyping heterogeneous systems. International J o u ~ a o l f Computer Simulation, January 1994.
R. K. Brayton, G. D. Hachtel, C. T. McMul~en,and A. L. SangioAlgorithms f o r VLSI Synvanni-~incentelli.Logic ~inimization thesis. Kluwer Academic Publishers, 1984. [BL89]
J. Bier and E. A. Lee. Frigg: A simulation environment for multiprocessor DSP system development. In Proceedings of the International Conference on Computer Design, pages 280-283, October 1989.
B. Barrera and E. A. Lee. ultirate signal processing in Comdisl on CO’s SPW. In Proceedings of the I n t e ~ a t i o n a Conference Acoustics, Speech, andS i ~ ~ lProcessing, al April 1991.
S. S. Bhattacharyya and Edward A. Lee. Scheduling synchronous dataflow graphs for efficient looping. Journal of VLSI Signal Processing, 1993. [BL94]
S. S. Bhattacharyyaand E. A.Lee.Memorymanagement for dataflow programming of multirate signal processing algorithms.
BIBLIOGRAP~Y
IEEE Transactions on Signal Processing,42(5): 1190-1 201, May
1994. [Bla87]
J. Blazewicz. Selected topics in scheduling theory. In Surveys in Combinatorial ~ptimization.North Holland Mathematica Studies, 1987.
[BML961
S. S. Bhattacharyya, P.K. Murthy, and E. A. Lee. Software Synthesis from D a t a ~ o wGraphs. ISluwerAcademic Publishers, 1996.
[Bok881
S. H. Bokhari. Partitioning problems in parallel, pipelined, and distributed computing. IEEE Transactions on C o ~ p u t e r s , 37(1):48--57,January 1988.
[Bor88]
G. Borriello. Combining events and data-flow graphs in behavioral synthesis. In Proceedings of the International Conference on on Computer-Aided Design,pages 56-59, 1988.
[BPFC94]
S. Banerjee, D. Picker, D. Fellman, and P. M. Chau. Improved scheduling of signal flow graphs onto multiprocessor systems through an accurate network modeling technique. Proceedings In of the International Workshopon VLSI Signal Processing, 1994.
[Bry861
R. E. Bryant. Graph based algorithms for boolean function manipulation. IEEE Transactions on Computers, 35(8):677--691, August 1986.
[BSL9O]
J. Bier, S. Sriram, andE. A. Lee. Aclass of multiprocessor architectures for real-time DSP. In Proceedings of the Inte~ational Workshop on VLSI Signal Processing,November 1990.
[BSL96a]
S. S. Bhattacharyya, S. Sriram,and E. A.Lee.Latency-constrained resynchronization for multiprocessor DSP implementation. In Proceedings of the Inte~ational Conference on Application Specific Systems, Architectures, and Processors, August 1996. Chicago, Illinois.
[BSL96b]
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Self-timed resynchronization: A post-optimization for static multiprocessor schedules. In Proceedings of the Inte~ationalParallel Processing S y ~ p o s i u mApril , 1996. Honolulu, Hawaii.
[BSL97]
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Optimizing synchronization in multiprocessor DSP systems. ZEEE Transactions on Signal Processing,45(6), June 1997.
orkar et al. iVVarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing, 1988.
J.T. Buck. Scheduling ~ y n
er no^^ using the Token F1 of Electrical Engineering a California at Berkeley, September 1993. G. Cohen, D. Dubois, and J, uadrat. A linear system theoretic view of discrete eventprocesses and its usefor performance evaluation in manufacturing. IEEE Transactions onAutomatic Control, March 1985.
R. Cunningham-c re en. inimax algebra. In Lecture ~ o t e sin ~ c o n o m i c s a n dat he ma tical Systems. Springer-~erlag,1979. [Cha84]
.Chase. A pipelined data flow architecture for digital signal processing: The NEC pPD7281. In Proceedings of the InternaI ~ r o c e s s i n g~, o v e m b e r1984. tional ~ o r k s h o p oVn ~ SSignaZ P. Chretienne. Timed event graphs: A complete study of their cutions. In I n t e ~ a t i o n a l ~ o r k s h o pTimed o n Petri
.~hretienne.Task ~chedulingover distributed memory rnachines. In ~ r o c e e ~ iof~the g sInternutionaZ ~ o r ~ s hon o pPar~lZeZ a n d ~ i s t r i ~ u t e d ~ l g o r 1989. ithms, ,and R. L.Rivest. Introduction to .Chen and J. M. Rabaey. A reconfigurable multiprocessor IC for rapid prototyping of algorithm-speci~chigh-speed DSP data paths. ~~E~J o u ~ aolf Solid State Circuits, 27( 12), December 1992.
L. F. Chao and E. Sha. ~nfoldingand retiming data-flow DSP ISC multiprocessor scheduling. In Proceedings of
the I n t e ~ a t i o n a Conference l on Acoustics, Speech, and Signal Processing, April 1992.
.Silva. Structural techniques and p e ~ o ~ a n c e s Petri Nets. bounds of stochastic Petri net models. In A ~ v a n c ein Springer-~erlag,1993. L. Chao and E. H. Sha.Static scheduling for synthesisof DSP algorithms on various models. J o u ~ a olf VLSI Signal Processing,
pages 207-2~3, 1995. [De 941
upta. Fastermaximumand mean cycle algorit~msfor ~ransactionson ~ o ~ p u t e r tober 1998.
mini~um
root, S. Gerez,and guided iterative data-flow graph sc~eduling. on ~ircuitsand Syste~s,pages 351-364, May 1992. U, A.
Izatt, and 6. Conference on Co~puterisi ion and
[Dij59]
Academic Publishers, 1998. avoli et al. Parallel computing in networks of workstations with Paralex. I~~~ ~ransactionson ~arallela Syste~s9 7(4), April 1996. robability: ~ h e and o ~~ x a ~ p l e s .
rooksKole, 1991. .T. O’Keefe. Static scheduling tectures. Journal of Superco~puting9
.T~eoretical i~provements in algorithmic efficiency for network flowalgorithms. ~ o u ~ ofthe aZ sociationfor Co~puting~ a c h i r ~pages e ~ , 248-264, April 1972. agan. The Pentium(R) processor with eedings ofthe I ~ Co~puter E ~ Society nationa~Conference, 1997. .Lewis. Scheduling parallel pro
BIBLIOGRAPHY
onto arbitrary target machines. Journal of Parallel and ~ i s t r i ~ u t ed Computing, pages 138-1 53, 1990. [FKAJM93]
D. Filo, D. C. Ku,C. N. Coelho Jr., and G. De Micheli.Interface optimization for concurrentsystemsundertiming constraints. IEEE ~ransactionson Very Large Scale Integration (VLSI) Systems, l (3), September 1993. D. Filo, D. C. Ku, and G. De Micheli. Optimizing the controlunitthrough the resynchronization of operations. INTEGRATION, the VLSI Journal, pages 231-258, 1992.
[Fly661
Proceedings of M. J.Flynn. Very high-speed computing systems. the IEEE, December 1966.
[F+97]
R. Fromm et al. The energy efficiency of IRAM architectures. In ~nternationalSymposium on Computer Architecture,June 1997.
[GB911
J. Gaudiot and L. Bic, editors. Advanced Topics in ~ a t a ~ l o w Computing. Prentice Hall, 199 1.
[Ger951
S. Van Gerven. Multiple beam broadband beamforming: Filter design andreal-time i~plementation.In Proceedings of the IEEE ASSP ~ o r k s h o pon Applications of Signal Processing to Audio and Acoustics, 1995.
[GGA92]
K. Guttag, R. J. Grove, and J. R. Van Aken. Asingle-chip multiprocessor for multimedia: the MVP, IEEE Computer Graphics and Applications, 12(6), November 1992.
[GGD94]
R. Govindarajan, G. R. Gao, and P. Desai. Minimizing memory requirements in rate-optimal schedules, In Proceedings of the International Conference on Application Specific Array Processors, August 1994.
[G5791
li~: M. R. Garey and D, S. Johnson. Computers and I ~ t r a c t a ~ i A Guide to the Theory of ~P-Completeness.W. H, Freeman and Company, 1979.
[GMN96]
B, Gunther, G. Milne, and L. Narasimhan. Assessing document relevance with run-time reconfigurable machines. Proceedings In of the IEEE Symposium on FPGAs for Custom Computing Machines, pages 10-17, April 1996.
[Gra691
R. L. Graham.Boundsonmultiprocessingtiminganomalies. S I A Journal ~ of Applied ~ a t h17(2):416"429, , March 1969.
BIBLIOG~APHY
[Gri88]
7
C. M. Grinstead. Cycle lengths in A%*. SIAM Journal on ~ a t r i x
Analysis, October 1988.
[GS92]
F. GasperoniandUweSchweigelshohn.Schedulingloops on parallel processors: A simple algorithm with close to optimum performance. In Proceedings of the International Conferenceon Vector & Parallel Processors, September 1992.
[G+91]
M. Gokhale et al. Building and using a highly programmablelog, 1):81--89, January 1991. ic array. IEEE Computer~ a g a z i n e24(
[G+92]
A. Gunzinger et al. Architecture and realization of a multi signal processor system. In Proceedings of the International Conference on Application SpecificArray Processors, pages 327-340, 1992.
[GVNG94]
D. J. Gajski, F.Vahid, S. Narayan, and J. Gong. Specification and Designof Embedded Systems. Prentice Hall, 1994.
[GW92]
B. Greer and J. Webb. Real-time supercomputing Proceedings of the SPIE,1992.
[G11921
A. Gerasoulis andT. Yang. A comparison of clustering heuristics for scheduling directed graphs on multiprocessors. Journal of Parallel and Distributed Computing,16276-291, 1992.
[Ha921
S. Ha. Compile Time Scheduling of Data~owProgram Graphs with Dynamic Constructs. Ph.D. thesis, Department of Electrical
on iWarp. In
Engineering and Computer Sciences, University of California at Berkeley, April 1992. [Ha1931
N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers, 1993.
[Hav91]
B. R. Haverkort. Approximate performabilityanalysis using generalized stochastic Petri nets. In Proceedings of the Inte~ational ~ o r k s h o pon Petri Nets and Perjiormance Models, pages 176185, 1991.
[HCA89]
J. J. Hwang, Y. C. Chow, and F.D. Anger. Scheduling precedence graphsinsystemswith inter-processor communication times. S I A Journal ~ of Computing, 18(2):244-257, April 1989.
[HCRP91]
N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data flow programming language LUSTW. Proceedings of the IEEE,September 1991.
rocee~ingsof the I~~~Syms for C u s t o ~ Co~puting ~achines, pages 105-
lgenstock, and P. Pirsch. Design of a development system for multimedia app~icationsbased on a ay. In Proceedings of the Internutional ~onferenceon ~lectronics,Circuits, and S y s t e ~ spages ~ 1151-1 154,1996. a and E. A.. Lee. Compile-time sched structs in dataflow program graphs. Z~~~ puters, 46(7), July 1997.
g of dynamic consa~tionson Com-
mini mu^ cost to time ratio ~ s , cycles with small integral transit times. ~ e ~ o rSepte~ber 1993.
.~ ~ u k o t u Considerations n. in the design of ssor"on-a-chip micro~chitecture. ~echnical er eport C S L - T ~ - 9 8 - ~ tanf ~ 9 ,ford ~niversity~ o m ~ u tSystems Lab, February 1998.
ingand
Computer Sciences, ~niversityof
ceedings of the International Conference on ~pplicationSpeci~c Array Processors, A.ugust 1992.
n et al. ~ynthesisof synchronous communication in a multi~rocessorarchitecture. ~ournalof V ~ SSignal I g, 6 : 2 ~ 9 - ~ 9 91993. , allel se~uencingand assembly line problems.
B l B ~ l O ~ ~ A ~H~
April 1997. [Joh74]
.S. Johnson. Approximation algorithms forcombinatoria~ problems. Journal of Computer and System Sciences,pages 2 278,1974.
Jr.761
E. G. Coffman, Jr. Computer and Job Shop ~cheduling~ h e o ~ . iley & Sons, Inc., 1976.
961
[Kar781
U,Kwok andI. Ahmad. Dynamic critical path scheduling: An effective technique for allocating task graphs to multiprocessors, IEEE Transactions on ~aralleland Distributed Syste~s, haracterization of the minimum cycle te ~athematics, 23, 1978. e. A general approach to mappi multiprocessor architectures. In ceedings of the Internutional ~onference on ~ a r a l ~ e l ~ r o c e s s i ~ g , pages 1-43, 1988.
[Kim881
im, A ~ e n e r a l A ~ ~ r otoa ~ c hu l t i ~ r o c e s sSchedulin~. or hesis, Depa~mentof Computer Science, ~niversityof Texas at Austin, 1988. Kalavade and E. A. Lee. A h ~ d w ~ e / s o f t w codesign ~e methology for DSP applications. IEEE Design and Testof~omputers ~ a g a z i n e10(3):16-28, , September 1993.
[KLL87J
.S. Lewis, and S. C. Lo. Pe~ormanceanalysis and optimi~ationof VLSI dataflow arrays. Journal of ~ a r a l l eand l Distributed Computin~, pages 592-6 18, 1987. iller. Properties of a model cy, t e ~ i n a t i o n queueing, , nal of Applied ~ a t h14(6), , November 1966. ative sc~edulingunder timing level synthesis of digital cirnsactions on Com~uter-Ai~ed Desi~n, I 1( 6 ) : 6 9 ~ 718, June 1992. A preliminary evaluation of critical pat tasks on ~ultiprocessorsystems. ~E~~ tions on Computers, pages 1235-1238, December 1975.
[Koh901
rable Syste~ for oh. A ~ e c o n ~ ~ ~ ~ultiprucessor
BIBLIOGRAPHY
havioral Simulation.Ph.D. thesis, Department ofElectrical EngineeringandComputerSciences,UniversityofCalifornia at Berkeley, June 1990. [Km871
B. h a t r a c h u e . Static Task Scheduling and Grain Packing in Parallel Processing Systems. Ph.D. thesis, Department of Computer Science, Oregon State University, 1987,
[KS83]
K. Karplus andA. Strong, Digital synthesis of plucked-string and drum timbres. Computer Music Journal,7(2):56-69, 1983.
[Kun88]
S . Y. Kung. VLSZ Arrays Processors. Prentice Hall, Englewood Cliffs, N.J., 1988.
[LAAG94]
G. Liao, E. R. Altman, V.K. Agarwal, and G. R. Gao. A cornparative study of DSP multiprocessor list scheduling heuristics. In Proceedings of the Hawaii Inte~ationalConference on System Sciences, 1994.
[Lam861
L. Lamport. The mutual exclusion problem:Part I and11. Journal of the Association for Computing machine^, 33(2):3 13-348, April 1986.
[Lam881
M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM Conference on Programming Language Design and Zmplementation, pages 318328, June 1988.
[Lam891
M. Lam. A Systolic Array ~ptimizingCompiler. Kluwer Academic Publishers, 1989.
[Lap9 11
P. D. Lapsley. Host interface and debugging of dataflow DSP systems. Master’s thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, December 1991.
[Law761
E. L. Lawler. Combinatorial Optimization: N e ~ o r k sand Matroids. Wolt, Rinehart and Winston, 1976.
[LE3901
E. A. Lee and J. C. Bier. Architectures for statically scheduled dataflow. Journal of Parallel and DistributedComputing, 10:333-348, December 1990.
[LBSL94]
P. Lapsley, J. Bier, A. Shoham, and E. A. Lee. DSP Processor ~undamentals.Berkeley Design Technology, Inc., 1994.
[LDK98]
S. Y. Liao, S. Devadas, and K. Keutzer. Code density optirnization for embedded DSP processors using data compressiontech-
niques. IEEE Transactions Computer-Aided on Design, 17(7):601-608, July 1998. [LDK+95]
S. Liao, S. Devadas, K. Keutzer, S. Tjiang, and A. Wang. Code optimization techniques for embedded DSP microprocessors. In Proceedings of the Design Automation Conference,June 1995.
[LEAP941
J. A. Peperstraete. R.Lauwereins, M. Engels, M. Ade,and Grape-ii: Graphical rapid prototyping environment for digital signal processing systems. InProceedings of the International Conference onSignalProcessingApplications and Technology, 1994.
[Lee861
E. A. Lee. A Coupled Hardware and S o f ~ a r eArchitecture for Programmable DSPs.Ph.D. thesis, Department of Electrical EngineeringandComputerSciences,University of California at Berkeley, May 1986.
[Lee88a]
Part I. IEEE E. A.Lee.Programmable DSP architectures ASSP Magazine, 5(4), October 1988.
[Lee88b]
E. A. Lee. Recurrences, iteration, and conditionals in statically scheduled block diagram languages. InProceedings ofthe International ~ o r k s h o pon VLSI Signal Processing, 1988.
[Lee9 1]
E. A. Lee. Consistencyin dataflow graphs. IEEE Transactions on Parallel and Distri~utedSystems, 2(2), April 1991.
[Lee931
E. A. Lee. Representing and exploiting data parallelism using Proceedings of the Intermultidimensional dataflow diagrams. In national Conference on Acoustics, Speech, and Signal Processing?pages 453-456, April 1993.
[Lee961
R. B. Lee. Subword parallelism with MAX2. ZEEE Micro, 16(4), August 1996.
[Lei921
F.T.Leighton. Introduction to Parallel Algorithmsand Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers Inc., 1992.
[LH89]
E. A. Lee and S. Ha. Scheduling strategies for multiprocessor real time DSP. In Global Telecommunications Conference, November 1989.
[LLG+92]
D.Lenoski, J. Laudon, K. Gharachorloo, W. D.Weber,and J. Hennessey. The Stanford DASH multiprocessor. IEEE Computer ~ a g a z i n eMarch , 1992.
E. A.Lee and D. C. Messerschmitt. tatic scheduling of synchronousdataflowprograms for digital signal processing. IEEE Transactions on Computers,February 1987.
Li and S. Malik. Performance analysis of embedded softusing implicit path enumeration. In Procee~ingsof the Design Automatio~Conference, 1995. E. Lemoine and .Merceron, Run-time reconfiguration of FPGA for scanning genomic databases. In P r o c e e ~ i ~ o gf s the IEEE Symposiu?n on FPGAs for Custom Computing ~ a c h i n e s 9 pages 90-98, April 1996.
[Lou931
J. Lou. Application development on the Intel Warp system. In ~roceedingsof the SPIE, 1993.
[Lov75]
L. Lovasz. On the ratio of optimal integral and fractional covers. Discrete ~ a t h e ~ a t i cpages s , 383-390, 1975.
[LP811
.Lewis and C . H. Papadimitriou. Elements of the Theory of omputation. Prentice
[LP821
.R. Lewis and C . H. Papadimitriou. Elements of the ~ h e ofo ~ Co~putation.Prentice
[LP951
E. A. Lee andT.M. Parks. Dataflow process networks.Proceedings ofthe IEEE9pages 773-799, May 1995.
[LP981
W, Liu and V.K, Prasanna. ~tilizingthe power of high-performance computing.IEEE Signal Processing 100, September 1998.
[LS91]
C. E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, pages 5-35, 1991.
V.~adisetti.VLSI ~ i g i t aSignal l Processors. IEEE Press, 199 irsky and A, DeHon.ATRIX:Areconfigurablecornputevice with configurab instruction distribution and deployable resources. In Proceedings of the Hot Chips Symposium, August 1997. .Messerschmitt. Breaking the recursive bottleneck. In J. Pe~ormanceLimits in Communication Theuwer Academic Publishers, 1988. an, J. J. Thornpson, and istics for scheduling DA
processors. In Proceedings of the International Parallel Processing Symposium, pages 446-45 1,1994. [MM921
F. Moussavi and D. G. Messerschmitt. Statistical memory management for digital signal processing. In Proceedings ofthe International Symposium on Circuits and Systems, pages 1011-1 014, ay 1992.
[M01821
M. K. Molloy. Performance analysis using stochastic Petri nets. IEEE ~ransactionson Computers, September 1982.
[Mot891
Motorola Inc. DSP9600~IEEE ~loating-PointDual-Port Pro, cessor User’s ~ a n u a l1989.
[Mot901
Application Development System Motorola Inc. ~SP96OOOA~S Reference ~ a n u a l1990. ,
G. Mouney. Parallel solution of linear ODE’S: Implementation on transputer networks. Concurrent Systems Engineering Series, 1996. N. Morgan et al. The ring array processor: A multiprocessing peripheral for connectionist applications. J o u ~ a olf Parallel and ~ i s t r i ~ u t eComputing9 d 14(3):248-259, March 1992.
[Mur89]
T.Murata. Petri nets: Properties, analysis, and applications. Proceedings of the IEEE, pages 39-58, January 1989.
[Nic89]
. Nicol. Optimal partit~onin~ ofrandomprograms across two processors. IEEE ~ransactionson Computers, 15(2):134141, 1989 1989.
[OlS89]
G. J. Olsder. Performance analysis of data-driven networks. In J. McCanny, J. McWhiter, and E. Swartzlander Jr., editors, Systolic Array Processors; Contributions by Speakers at the International Conference on Systolic Arrays. Prentice Hall, 1989.
[ORVK90]
G. J. Olsder, J. A. C. Resing, R. E. De Vries, and Discrete event systems with stochastic processing times. IEEE ~ransactionson Automatic Control, 35(3):299-302, March 1990.
[Oi-961
kotun et al.The case for a single-chip multiprocessor. SIC;Notices, 31(9):2-11 September 1996.
[Ous94]
J.K. Ousterhout. An Introduction to Tcl and Tk. Addison-Wesley, 1994.
[Pap901
.Papadopoulos. Monsoon: A dataflow computing architec-
ture suitable for intelligent control. In Proceedings of the 5th IEEE International Symposium onIntelligent Control, 1990. A. Papoulis. Probabili~,Random Variablesand Stochastic Processes. McGraw-Hill, 1991. [PBL95]
J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling system for DSP applications. In Proceedings of the ZEEE Asilomar Conference on Signals, Systems, and Computers, November 1995.
[Pet811
J.L. Peterson. Petri Net Theory and the odel ling of Systems. Prentice Hall, 1981.
[pH961
D. A. Patterson and J.L. Hennessey. Computer Architecture : a ~uantitativeApproach. Morgan Kaufmann Publishers Inc., second edition, 1996,
[PHLB95]
J.Pino, S. Ha, E. A. Lee, and J. T. Buck. Software synthesis for DSP using Ptolemy. Journal of VLSI Signal Processing, 9(l), January 1995.
[PLN92]
D. B. Powell, E. A. Lee, and W. C. Newman. Direct synthesis of optimized DSP assembly code from signal flow block diagrams. In Proceedings of the Inte~ationalConference on Acoustics, Speech, and Signal Processing, March 1992.
[PM911
K. K. Parhi and D. G. Nlesserschmitt. Static rate-optimal scheduling of iterative data-flowprogramsviaoptimum unfolding. IEEE Transactions on Computers, 40(2): 178-194,February 1991.
[Pm871
M. Prastein. Precedence-co~strainedscheduling with minimum time and communication. Master’s thesis, University ofIllinois at Urbana-Champaign, 1987.
[PriS)l ]
H. Wntz. Automatic ~ a p p i n gof Large Signal Processing Sys. thesis, School of Computer tems to a Parallel ~ a c h i n e Ph.D. Science, Carnegie Mellon University, May 1991.
[Pri92]
H. Printz. Compilation of narrowband spectral detection systems for linear MIMD machines. In P~oceedingsof the Znternational Conference on App~icationSpecific Array Processors, August 1992.
[PS941
M. Potkonjac andM.B. Srivastava. Behavioral synthesis of high performance, and low power application specific processors for
linear computations. In Proceedings of the International Conference on Application Specific Array Processors, pages45-56, 1994. [P+97]
D. Patterson et al. A case for intelligent RAM: IRAM. IEEE Micro, April 1997.
[Pto98]
Department of Electrical Engineering and Computer Sciences, University ofCalifornia at Berkeley. The Almagest: A Manual for Ptolemy, 1998.
[Pur971
S. Purcell. Mpact 2 media processor, balanced 2X performance. In Proceedings of SPIE, 1997.
[PY90]
C. Papadimitriou and M. Yannakakis. Toward an architecture-independent analysis of parallel algorithms. SIAM J o u ~ a olf Computing, pages 322-328, 1990.
[Rao85]
S. Rao. Regular Iterative Algorithms and their Implementation on ProcessorArrays. PhD. thesis, Stanford University, October
1985. [RC6721
C. V. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez. Optimalscheduling strategies inmultiprocessorsystems. ZEEE Transactions on Computers,February 1972.
[RCHP91]
J. M. Rabaey, C. Chu, P.Hoang, and M.Potkonjak. Fast prototyping of datapath intensive architectures. IEEE Design and Test of Computers Magazine,8(2):40”5 1,June 1991. D.Regenold. A single-chip multiprocessor DSP solution for co~municationsapplications. In Proceedings of the IEEE International ASIC Conference and Exhibit,pages 437-440, 1994.
[Rei681
R. Reiter. Scheduling parallel computations. Journal of the Association for C o m p u t i n ~ M a c h ~October n e ~ , 1968.
[RH801
C, V. Rarnamoorthy and G. S. Ho. Performance evaluation of asynchronous concurrent systems using Petri nets. IEEE Transactions on Software Engineering, SE-6(5):440-449, September 1980.
11
M. Renfors andY. Neuvo. The maximum samplingrate of digital filters under hardware speed constraints. IEEE Transactions on Circuits and Systems, March 1981.
[RPM921
S. Ritz, M. Pankert, and H. Meyr. High level software synthesis for signal processing systems. InProceedings of the International
313LI~~~~P~Y Conference on Application Speciflc Array Processors, August
1992. S. Rajsbaum and M. Sidi. On the performance of synchronized programs in distributed networks with random processing times and transmissiondelays. IEEE ~ransactionson Parallel and Distributed Systems,5(9), September 1994. [RS98]
S. Rathnam and G. Slavenburg. Processing the new world of interactive media. IEEE Signal Processing ~uguzine, 15(2), March 1998. S. Ramaswamy, S. Sapatnekar, and P. Banerjee. A framework for exploiting task and data parallelism on distributed memory muticomputers. IEEE ~runsactionson Parallel and Distributed Systems, 8( 1l), November 1997.
[Sar88]
V.Sarkar. Synchronization using counting semaphores. In Proon §uperco~puting, ceedingsoftheInternational§ymposium 1988.
[Sar891
V. Sarkar. Partitioning and Scheduling ~ultiprocessors.MIT Press, 1989.
Parallel Progrums for
N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. Graph algorith~sfor clock schedule optimization. In Proceedings of the Internutionul Conferenceon on Computer-Aided Design, pages 132-1 36, 1992. [Sch88]
H. Schwetman. Using CSIM to model complex systems. In Proceedings of the 1988 Winter SimulutionConference,pages 246" 253,1988.
[Sha89]
P. L. Shaffer. Minimization of interprocessor synchronization in multiprocessors with shared andprivate memory. In Proceedings of the InternationulConference on Purallel ~ r o c e s s ~ n1989. g,
[Sha98]
M. El Sharkawy. Multiprocessor 3d sound system. In Proceedings of the idw west Symposium on Circuits and Systems, 1998.
[SHL+97]
D. Shoemaker, F. Honore, P. LoPresti, C. Metcalf, and A unified system for scheduled com~unication.In Proceedings of the 1nte~utionulConference on Parallel and DistributedPro-
cessing ~echniques and Applications,July 1997,
[SI851
D. A. Schwartz and T. P. Barnwell 111. Cyclo-static solutions: Optimal multiprocessor realizations of recursive algorithms. In
~ I ~ L I O ~ R A P ~ ~
Proceedings of the I n t e ~ a t i o n a l ~ o r k s h o pVLSI o n Signal Processing, pages 117-128, June 1985.
[Sih9l]
G. C. Sih. ~ u l t i p r o c e s s oScheduling r to accountfor Interprocessor Communication. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, April 1991.
[SL901
G. C. Sih and E. A.Lee. Scheduling to account for interprocessor co~munication within interconnection-constrained processor networks. In Proceedings of the International Conference on Parallel Processing, 1990.
[SL93a]
G. C. Sih and E. A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. IEEE Transactions on Parallel and Distributed Systems, 4(2):75-87, February 1993.
[SL93b]
G. C. Sih and E. A. Lee. Declustering: A new multiprocessor scheduling technique. ZEEE Transactions on Parallel and Distri~utedSystems, 4(6), June 1993.
[SL94]
S. Sriram and E. A. Lee. Statically scheduling com~unicationresources in multiprocessor DSP architectures. In Proceedings of the IEEE Asilomar Conferenceon Signals, Systems, and Computers, November 1994.
Sri921
M.B. Srivastava. Rapid-Proto~pingof ~ a r d w a and r ~ Softwure
Sri951
S. Sriram. ~ i n i m i z i n g Communication and Synchronization ~ v e r h e a din ~ u l t i p r o c e s s o r s f o r ~ i g i t uSignal l Processing.
o r k .thesis, Department of Electrical in a U n i ~ e d ~ r a ~ e wPh.D. Engineering and Computer Sciences, University of California at Berkeley, June 1992.
Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California atBerkeley, 1995.
[St0771
H. S. Stone. ~ultiprocessorscheduling with the aid of network flow algorithms. IEEE Transactions on Software Engineering, 3( 1):85--93,January 1977.
[St0911
~ Speech A. Stolzle. A Real Time Large ~ o c a b u l aConnected Recognition System. Ph.D. thesis, Department of Electrical Engi-
neering and Computer Sciences, University of California at Berkeley, December 199l. [SW921
R. R. Shively and L. J. Wu. Application and packaging of the
AT&T DSP3 parallel signal processor. In Proceedings of the International Conference on Application SpeciJic Array sors, pages 316-326, 1992.
[Tex98]Texas
Proces-
CPUandInstructionSet Inst~ments.T~S32OC42~C47X Reference Guide,March 1998.
H. Liang. VIS [TONL96]M.Tremblay,J.M.O'Connor,V.Narayanan,and speeds new media processing. IEEE ~ i c r o16(4), , August 1996. [T+95]
A.Trihandoyo et al. Real-time speech recognition
architecture for a multi-channel interactive voice response system. In Pro-
ceedings of the International Conference on Acoustics, Speech, and Signal Processing,1995.
[TTL95]
J. Teich, L. Thiele, and E. A. Lee. Modeling and simulation of heterogeneous real-time systems based on a deterministic discrete event model. In Proceedings of the InternationaZ Symposium on Systems S~nthesis, pages 156-1 6 1,1995,
[Vai93]P.
Systemsand FilterBanks. Prentice P, Vaidyanathan. ~ultirate Hall, 1993.
[VLS86]VLSI
CAD Group,Stanford University, ThorTutorial, 1986.
[VPS90]
M. Veiga, J. Parera, and J. Santos. Programming DSP systemson multiprocessor architectures. In Proceedings ofthe Internationa~ Conference on Acoustics, Speech, and Signal Processing,April 1990.
[V+96]
J. E. Vuillemin et al. Programmable active memories:Reconfigurable systems come of age. IEEE Transactions on Very Large Scale Integration(VLSI) Systems,4( l), March 1996.
[WLR98]A.Y.
Wu,K.J.R. Liu, and A. Raghupathy.System architecture of an adaptive reconfigurable DSP computing engine. IEEE Transactions on Circuits and Systemsfor Video TechnoZogy,February 1998.
[W+97]
E. Waingold et al. Baring it a11 to software: Raw machines. IEEE
[YC941
T. Yang andA. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded numberof processors. IEEE Transactionson Parallel and ~istributedSystems,5(9):95 1-967, September 1994.
[YM96]
J. S. Yu and P. C. Mueller.On-lineCartesianspace obstacle avoidance schemefor robot arms. ath he ma tics and Computersin
Computer ~ a g a z i n epages , 86-93, September 1997.
BIBLIOGRAPHY
Simulation, August 1996.
[Yu841 W. Yu.
LU Decompositionon a ~ultiprocessingSystem with Communication Delay. Ph.D. thesis, University of California at Berkeley, 1984,
[YW93]
L. YaoandC.M.Woodside. Iterative decompositionandaggregation of stochastic marked graphPetri nets. In G. Rosenberg, editor, Advances in Petri Nets 1993. Springer-Verlag, 1993.
[ZKM94]
V, Zivojnovic, H. Koerner,and H. Meyr.Multiprocessorscheduling with a-priori node assignment. InProceedings of theInternational Workshop on VLSI Signal Processing,1994.
S. Ritz, and H. Meyr.Retiming of DSPprograms [ZRM94]V.Zivojnovic, for optimum vectorization. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing,April 1994. [ZSSS]A.Zakyand
P.Sadayappan.Optimal static scheduling of sequential loops on multiprocessors. InProceedings of the Inte~ational pages 130-137, 1989. Conference on Parallel Processing,
J. M.Velarde,C.Schlager,and H. Meyer.DSPSTONE: A DSP-oriented benchmarking methodology. In Proceedings of the International Conference on Signal Processing 1995. ~pplications and TechnQlogy,
[ZVSM95]V.Zivojnovic,
This Page Intentionally Left Blank
ranch actors 91
chain-st~cturedgraph 44 lustering algorithms 87 mmunication edges 182 omplexity of algorithms45 computation graph 33 omputation graphs 32 connected component 44 connected graph 44 constraint graph 242 contributes to the elimination of 246 convex function 150 critical path 58
dead-end path of a graph 43 locked synchronization graph 2 15 ustering 91 delays 34 ~ i j ~ s t rS a48 ’
nant sequence clustering 89 dynamic critical path 86 dynamic level 86 dynamic level scheduling 85
earliest actor-processor mapping 85
Index
earliest task first84 elimination of synchronization edges 246 estimated throughput 143 ETF 80,84 execution profile for a dynamic construct65
back edge 182 implementation on the OMA architecture 132 first-iteration graph 250 forward constraint 242 fully-connected interconnection network 76 functional p~allelism56 f u n ~ a ~ e n tcycle a l of a graph 43
Gantt chart 58 Graham’s bound for list scheduling anomalies 83 Graph data structures 3 1
lghest level first with estimated times84 omogeneous SDF graph 36 omogeneous synchronous dataflow graphs (
Branch 91 idle time 58 ILP in programmable architectures initial tokens 34 input edge of a vertex in a graph43 input hub 237 inst~ction14 Instruction level p~allelism14 i n t e ~ a l i ~ a t i o89n
15
ce§§or c Q ~ ~ ~ n i c a t(i o n cessor c o ~ ~ u n i c a t i o n
trong algorithm 129
re§ynchronizatiQn2 zation pro~lern2,
~ e r actor ~ e 92 84 ~ i n i r n ~ r n - ~ epath l a y 4.3
Index
ransactions Strategy 101 origination of a path in a graph 43 output edge of a vertex in a graph 43 output hub 237 overlapped schedule 6 1
p a i r ~ i s eresynchronization problem 2 18 p ~ t i a l l yscheduled graph 89 path delay 43
olynomial time algorithms 45 precedence constraint 68 priority list 80 processor assignment step 56 89
F filter bank 131 ready 80 ready-list scheduling 80 eco~fi~urable computing 25 educed Inst~ctionSet ~omputer16 reduced synchronization graph 191 redundant synchronization edge 19l relative §cheduling 243 repetition§ vector 35 resynchro~ization2 l5 resynchronization edge 2 I5 resynchronization problem 2 18 resynchronized graph 2 15
retiming 6 l
scheduling problem 76 scheduling state 86 selected input hub 237 selected output hub 237 self-timed buffer bound 184 self-timed scheduling (ST) 62 set covering problem 2 18 Set-covering problem 46 Shortest and longest paths in graphs 47 Single Chip Multiprocessors 23 Solving difference constraints 50 Stone’s Assignment Algorithm 76 strongly connected component (SCC) of a graph 44 strongly connected graph 44 subsumption of a synchronization edge 219 Sub-word parallelism 17 Superscalar processors 14 synchronization graph 42, 188 Synchronous 7 Synchronous Dataflow 7 Synchronous languages 42
task graph 76 t e r ~ n a t i o nof a path in a graph 43 ellman-Ford algorithm 48 The Floyd- shall algorithm 49 tokens 6 topological sort 44 topology matrix 35 TPPI 91 Transaction 114 transaction controller 107
Index
transaction order 102 Transaction Order Controller 114 transitive closure 9 1 transparent 250 two-path parallelism instance 9 1 two-processor latency-constrained resynchroni~ation26 l
unbounded buffer synchronization 185 Undecidable problems 8 unfold in^ 6 1 Unfolding HSDF graphs 69 vertex cover 46 ~~~Wprocessors 14 well-ordered graph 44