lltiprocessors and Synchronization
BHATTACHARYYA
MARCEL DEKKER, INC.
NEWYORK BASEL e
2000 00-0~2900 This book is printed on acid-free paper.
Marcel Dekker, Inc. 270 Madison Avenue, New York, 10016 tel: 2 12-696-9000; fax: 2 12-685-4540
Marcel DekkerAG ~utgasse4, Postfach 8 12, CH-400 1 Basel, Switzerland tel: 41-61-261-8482; fax: 41-61-261-8896
The ~ublisheroffers discounts on t h s book when ordered in bulk quantities. For more i n f o ~ t i o n , write Special to Sa~es~rofessionalMarketing the at ~ e a d q u a ~ eaddress rs above.
Neither this booknor any part m y be reproduced or transmitted in f o m or by means, electronic mechanical, including p h o t o c o p ~ g , m i c r o ~ l ~and ng, recording, by ~ f o ~ t i storage o n and retrieval system, without permission in writing from the publisher. Current printing (last digit) l 0 9 8 7 6 5 4 3 2 1
To my parent^, and Uma Sundararajan Sriram
~~und~ati Shuvra S. Bhattacharyya
This Page Intentionally Left Blank
Over the past 50 years, digital siglla~ rocessing has evolved major engineering d i s c ~ p ~ ~The n e . fields of signal processing have grown from the origin of fast Fourier transforln and digital filter design to statistical spectral analysis and array processing, and image, audio, and lnultiln~diaprocessing, and shaped developments in high-performance VLSI signal processor design. Indeed, there are few fields that enjoy so manyapplications-signalprocessingiseverywhere in our lives. Whenoneuses cellular phone, the voice is compressed,coded,and modulated using signal processing techniques. cruise missile winds along hillsides searching for the target, the signal processor is busy processing the imagestakenalong the way.Whenwe are watching movie in HDTV, millions of audio and video data are being sent to our homes and received with unbelievable fidelity. When scientists compare samples, fast pattern recognition techniques are being used. On and on, one can see the impact of signal processing in almost every engineering and scientific discipline, Because of the immense importan~eof signal processing and the fastgrowingdemands of businessand in dust^, this series on signal processing serves to report up-to-date developments and advances in the field. The topics of interest include but are not limited to the following: Signal theory and analysis Statistical signal processing Speech and audio processing Image and video processing ~ ~ l t i l ~ esignal d i a processing and technology Signal processing for colnlnunications Signal processing architectures and VLSI design
I hope this series will provide the interested audience with higll-~uality, state-of-the-art signal processing literature through research monographs, edited books, and rigorously written textbooks by experts in their fields.
Ray V
DSP 1 DSP 2 MCU
Embedded systems are computers that are not first and foremost computers. They are pervasive, appearing in automobiles, telephones, pagers, consumer electronics, toys, aircraft, trains, security systems,weaponssystems, printers, modems, copiers, thermostats, manufacturing systems, appliances, etc. technically active person today probably interacts regularly with more embedded systems than conventional computers. This relatively recent phenomenon. Not so long automobiles depended on finely tuned mechanical systems for the timing of ignition and its synchronization with other actions. It was not so long ago that modems were finely tuned analogcircuits. Embedded systems usually encapsulate domain expertise. Even small software programs may be very sophisticated, requiring deep understanding of the domain and of supporting technologies such signal processing. Because of this, such systems are often designed by engineers who classically trained in the domain, for example, in internal combustion engines or in communication theory. They have little background in the theory of computation, parallel computing, and concurrency theory. Yet they face one of the most difficult problems addressed by these disciplines, that of coordinating multiple concurrent activities in real tjme, often in safety-critical environment.Moreover,they face these problems in context that is often extremely cost-sensitive, mandating optimal designs, and time-critical, mandatin~rapid designs. Embedded software is unique in that parallelism is routine. Most modems and cellular telephones, for example, incorporate multiple programmable processors. Moreover, embedded systems typically include custom digital and analog hardware that must interact with the software, usually in real time. That hardware operates in parallel with the processor that runs the software, and the software must interact with it much it would interact with another software process running in parallel. Thus, in having to deal with real-time issues and parallelism, the designers of embedded softwareface daily basis problems that occur only in esoteric research in the broader field of computer science.
uter scientists refer to use of physica~ly distinct computational resources (processors) “parallelism,” and to the logical property that multiple activities occur at the same time as “concu~ency.” Paral~e~ism implies concurrency, but the reverse is not true. Almost all operating systems deal with concurrent which managed by multiplexing multiple processes or threads on deal with parallelism, for example by mapping processor. A few onto physically distinct processors. Typical embedded systems exhibit both concu~encyand parallelism, but their context different from that of genose opera tin^ systems in many ways. In embedded systems, concu~enttasks are often statically defined, largely the lifetime of the system. A cellular phone, for example, has nct modes of operation (dialing, talking, standby, etc.), and in each mode of operatio ll-defined set of tasks is c o n c u ~ e ~ t active ly (speech encoding, etc.). The static structure of the concurr much more detailed analysis and optimization in more dynamic environment. is book is about such analysis and optimization. rdered transaction strategy, for example, leverages that relatively static of embedded software to dramatically reduce the synchronization overhead of communication between processors. It recognizes that embedded software is intrinsically less predictable than hardware and more predictable than eneral-pu~osesoftware. Indeed, minimizing synchronization overhead by static i n f o ~ a t i o nabout the application is the major theme of this book. In general-pu~osecomputation, communication is relatively expensive. Consider for example the interface between the audio h a r d w ~ eand the software of typical personal computer today. Because the transaction costs are extremely h, data extensively buffered, resu~tingin extremely long latencies. A path from the microphone of PC into the software and back out to the speaker typically has latencies of hundreds of milliseconds. This severely limits the utility of the audio hardware of the computer. Embed ed systems cannot tolerate such latencies. major theme of this book is communication between components. The iven in the book are firmly rooted in manipulable and tractable ford yet are directly applied to hardware design. The closely related IPC ssor communication) graph and synchronization graph models, introhapters 7 and 9, capture the essential prope~iesof this com~unicae of graph-theoretic properties of IPC and sync~onizationgraphs,
optimi~ationproblems are formulated and solved. For example, the notion of resynchroni~ation, where explicit synchronization operations are minimi~ed through manipulation of the sync~onizationgraph, proves to bean effective optimi~ationtool. In some ways, embedded software has more in common with hardware thanwith traditional software. ardware highly parallel. Conceptually9hardware is an assemblage of components that operate continuously or discretely in time and interact via sync~onousor asynchronous communication, oftw ware an assemblage of components that trade off use"of CPU, operating sequentially, and communicating by leaving traces of their (past and completed) execution on stack or in memo^. Hardware temporal. In the extreme case, analog hardware operates in a continuum, computational medium that is totally beyond the reach of software, Communication not just synchronous; it is physical and fluid, oftw ware is sequential and discrete. ~ o n c u ~ e n cinysoftware is about reconciling sequences, Concu~encyin hardware is about reconciling signals, This book ~xaminesparallel software from the perspective of signals, and identifies joint hardware/software designs that are ~articularlywell-suited for embedded systems. The prima^ abstraction mechanism in software is the ~rocedure(or the method in object-oriented designs). Procedures are terminating computations. module that operat The primary abstraction mechanism in hardware allel with the other components. These modules represent non-termina putations. These are very different abstraction mechanisms. Hardw do not start, execute, complete, and return. They just are. In embedded systems9 software components often have the sameproperty. They do not t e ~ i n a t e . ~onceptually,the distinction between hardware and software, from the perspective of co~putation9has only to do with the degree of concu~encyand the role of time. An application with large amount of concu~encyand heavy that have temporal content rnight well be thought of using the ~bstract~ons been successful for hardware, regardless of how it implemented. An application that is sequential and ignores time rnight well be thought of using the abstractions thathave succeeded for software, regardless ofhowit is implemented. The key problem becomes one of identifying the appropriate abstractions for representing the design. This book identifies abstractions that work well for the joint design of embedded software and the hardware on which itruns. The intellectual content in this book is high. While some of the methods it describes are relatively simple, most are quite sophisticated. Yet examples are given that concretely de strate how these concepts can be applied in practical hardware architectures. over, there is very little overlap with other books on parallel processing. The focus on application-specific processors and their use in
x embedded systems leads to a rather different set of techniques. I believe that this book defines a new discipline. It gives a systematic approach to problems that engineers previously have been able to tackle only in an ad hoc manner.
E d w a r ~ Lee Professor ~ e ~ a r t m e n t o ~ ~ l e cEngineering trical and Computer Sciences University Cal~orniaat Berkeley Berkeley, Cal~ornia
Softwareimplementation of c0mpute"intensivemultimedia applications such video conferencing systems, set-top boxes, and wireless mobile terminals and base stations is extremely attractive due to the flexibility, extensibility, and potential portability of programmable implementations. However, the data rates involved in many ofthese applications tend to be very high, resulting in relatively few processor cycles available per input sample for reasonable processor clock rate. Employing multiple processors is usually the only means for achieving the requisite compute cycles without moving to dedicated ASIC solution. With the levels of integration possible today, one can easily place four to digital signal processors on single die; such an integrated multiprocessor strategy promising approach for tackling the complexities associated with future systems-on-achip. However, it remains significant challenge to develop software solutions that can effectively exploit such multiprocessor implementation platforms. Due to the great complexity of implementing multiprocessorsoftware, and the severe performance constraints of multimedia applications, the develop~nent of automatic tools for mapping high level specifications of multimedia applications into efficient multiprocessor realizations has been an active research area for the past several years. ~ a p p i n gan application onto multiprocessor system involves three main operations: assigning tasks to processors, ordering tasks on each processor, and determining the time at which each task begins execution. These operations are collectively referred to the application on the given architecture. key aspect of the multiprocessor scheduling problem for multimedia system implementation that differs from classical scheduling contexts is the central role of interprocessor communication the efficient management of data transfer between communicating tasks that are assigned to different processors. Since the overall costs of interprocessor communication can have dramatic impact on execution speed and power consumption, effective handling of interprocessor communicatio~is crucial to the development of cost-effective multiprocessor implementations. This books reviews important research in three key areas related to multiprocessor implementation of multimedia systems, and this book exposes important synergies between efforts related to these areas. Our areas of focus are the incorporation of interprocessor communication costs into multiprocessor scheduling decisions; modelingmethodology, called the "synchronization
graph,” for multiprocessor system performance analysis; and the application of the synchronization graph model to the development of hardware and software timizations that can significantly reduce the inte~rocessorcommunication erhead of given schedule. ore specifically, this book reviews, in a unified manner^ several imporiprocessor scheduling strategies that effectively inco~oratethe consideration of inte~rocessorcommunication costs, and highlights the varietyof techniques employed in these multiprocessor scheduling strategies to take interprocessor communication into account. The book reviews body of research performed by the authors on modeling implementations of multiprocessor schedules, and on the use of these odel ling techni~uesto optimize interprocessor communication costs. A unified framework then presented for applying arbitrary scheduling strategies in conjunction with the application of alternative optimization algorithms that address specific subproblems associated with implementing given schedule. We provide several examples of practical applications that demonstrate the relevance of the techniques desc~bedin this book. are grateful to the Signal Processing Series Editor Professor K. Liu (University of land, College Park) for his encouragement of this project, and to Executive isition Editor B. Clark(MarcelDekker, Inc.) his coordination of the effort. It was privilege for both of us to be students of Professor Edward A. Lee (University of California at erkeley). Edward provided truly inspiring research environmen~during our d toral studies, and gave valuable feedbackwhileweweredevelopingmanyof the concepts that underlie n this book. We also acknowledge helpful proofreading assistance andrachoodan, Mukul ~handelia,and Vida Kianzad ~ a r y l a n dat College Park); andenlighteningdiscussionswith n and Dick Stevens (U. S. Naval Research Laboratory), and Praveen (AngelesDesignSystems).Financialsupport (for S. S. Bhattadevelopment of this book was provided by the National Science
§rira~
Liu)
l
~ultiprocessorDSP systems
2
l .2 Application-specific multiprocessors
4
1.3 Exploitation of p a r a ~ l e l i s ~5 1.4 Dataflow modeling for DSP design 1
9
Utilityof dataflow for DSP
1.6 Overview
11
13 2.1 Parallel architecture classifications
2.2
13
Exploiting instruction level parallelism
15
2.2.1 ILP in programmable DSP processors 2.2.2 2.2.3
Sub-word parallelism processors
17
18
2.3 Dataflow DSP architectures
2.4 Systolic and wavefront arrays
19 20
15
CONTENTS
2.5 Multiprocessor DSP
architectures
2.6 Single chip multiprocessors
21
23
2.7 Reconfigurable computing 25 2.8Architectures
that exploit predictable IPC27
Summary 2.9 29
3.1
Graphdata structures
31
3.2 Dataflow graphs 32 3.3 Computation graphs 32 3.4 Petri
nets
33
3.5 Synchronous dataflow 3.6Analytical 3.7 Converting
34
properties of SDF graphs35 general SDF graph into homogeneous SDF graph
3.8Acyclicprecedenceexpansiongraph
36
38
3.9 Application graph 41 3.10 Synchronous languages
42
3.1 1 HSDFGconceptsand
notations
3.12Complexityofalgorithms
45
3.13 Shortest andlongestpaths 3.13.1
43
in graphs47
Dijkstra’s algorithm
48
3.13.2 TheBellman-Fordalgorithm
48
3.13.3 The Floyd-~arshallalgorithm 3.14Solving
difference constraints using shortest paths
3.15 Maximum cycle mean 3.16 Summary
UL
49 50
53
53
ELS
4. 1 Task-level parallelism anddata parallelism
55
5
CONTENTS
4.2
Static versus dynamic scheduling strategies
4.3 Fully-static schedules
62
4.5 Dynamic schedules
64
4.6 ~uasi-staticschedules Schedule notation
56
57
4.4 Self-timed schedules
4.7
XV
65
67
Unfolding HSDF graphs
69
4.9 Execution time estimates and static schedules 4.10 Summary
72
74 7
..I........*..
Froblem description
75
5.2 Stone’s assignment algorithm 5.3 List scheduling algorithms 5.3.1
76 80
1
Graham’s bounds
5.3.2 The basic algorithms
HLFET and ETF
5.3.3 The mapping heuristic
84
5.3.4 Dynamic level scheduling 5.3.5Dynamic
85
critical path scheduling
5.4 Clustering algorithms
84
86
87
5.4.1 Linear clustering 5.4.2 Internalization
89
5.4.3 Dominant sequence clustering 5.4.4 Declustering
19
5.5 Integrated scheduling algorithms 5.6 Fipelined scheduling Summary 5.7
89
92
94
100
l 6.1 The ordered-transactions strategy
101
xvi
~~NT~NTS
6.2 Shared bus
~chitecture 104
6.3 Interprocessor communication mechanisms
104
6.4 Usingthe ordered-transactions approach107 6.5 Design ofan orderedmemory access ~ultiprocessor 108 6.5.1 Highleveldesign
108
description
6.5.2 A modified design 109 6.6 Design details
a prototype
6.6.1 Top level design
1
112
6.6.2 Transaction order controller 6.6.3 Host interface 6.6.4 Processing
114
1 18
element
121
6.6.5 FPGA circuitry
122
6.6.6 Shared memo^
123
6.6.7 Conne~tingmultiple boards 6.7Hardwareand
123
software implementation
125
oard design 125 6.7.2 Software interface 6.8 Ordered
125
andparameter control
128
6.9 Application examples 129
Fourier Transform (FFT) 132
6.9.31024pointcomplexFast 6.10 S u ~ ~ a r y,134
7.1 Inter-processor communicationgraph (Gipc) 7.2 Execution time
estimates
138
143
7.3 Ordering constraints viewed as edges addedto Gipc
xvii Periodicity 7.4 7.5 Optimal
145 order
146
7.6 Effects of changes inexecutiontimes149 7.6. l
Dete~inisticcase
150
7.6.2 Modelingrun-timevariationsin
execution times151
7.6.3 Bounds on the average iteration period154 7.6.4 Implications fortheordered transactions schedule 7.7 Summary
155
157
T 8.1TheBoolean 8.1.1
dataflow model159 Scheduling
160
8.2 Parallel implementation on sharedmemorymachines163 8.2.1 General strategy 163 8.2.2Implementation
on theOMA
165
8.2.3 Improved mechanism 169 8.2.4 Generating theannotatedbus access list 8.3 Data-dependent iteration 8.4 Summary
9.1
171
174
175
barrier ~ I M D technique178
9.2Redundant
synchronization removalin non-iterative dataflow179
9.3 Analysis of self-timed execution182 9.3.1 Estimated throughput 182 9.4 Strongly connected componentsandbuffer size bounds182 9.5 Synchronization model 185 9.5.1
Synchronization protocols
185
9.5.2 The synchronizationgraph G,
xviii 9.6Asynchronization
cost metric190
9.7Removingredundantsynchronizations19 9.7.1
1
The independenceofredundantsynchronizations192
9.7.2 Removing redundant synchronizations 193 9.7.3 Comparisonwith Shaffer’s approach195 9.7.4 An example 195 9.8 Making the synchronizationgraph strongly connected197 9.8.1Addingedges
to the synchronizationgraph199
9.9 Insertion of delays201 9.9.1 Analysis
of DetermineDelays205
9.9.2 Delay insertion example 207 9.9.3 Extending the algorithm208 9.9.4 Complexity 2
10
9.9.5 Related work 210 9.10 Summary 21
1
N. 10.1 Definition of resynchronization2
13
10.2 Properties ofresynchronization2
15
10.3 Relationship to set covering218
10.4 Intractability of resynchronization221 10.5 Heuristic solutions
224
10.5.1 Applying set-covering techniques to pairs of SCCs 10.5.2 Amore flexible approach225 10.5.3 Unit-subsumptionresynchronization edges23 10.5.4 Example 234 10.5.5 Simulation approach 236 10.6Chainablesynchronizationgraphs236 10.6.1Chainablesynchronizationgraph
SCCs
237
1
10.6.2 Comparison to the Global-Resynchronize heuristic
239
10.6.3 A generalization of thechainingtechnique240 10.6.4 Incorporating the chainingtechnique242 10.7Resynchronizationof 10.8 Summary
constraint graphs for relative scheduling242
243
11.1 Eliminationofsynchronizationedges246 11.2Latency-constrainedresynchronization248 11.3 Intractability ofLCR253 11.4Two-processorsystems260 11.4. l
Interval covering26
1
11.4.2Two-processorlatency-constrainedresynchronization262 11.4.3 Takingdelays into account266 1.5 A heuristic for generalsynchronizationgraphs
276
11.S. 1 Customization to transparent synchronization graphs 278 11S.2 Complexity 278 11.5.3 Example 280 11.6 Summary
12.1Computing
286
buffer sizes
29 l
12.2 A framework for self-timed implementation292 12.3 Summary 294
297 3011 321
This Page Intentionally Left Blank
The focus of this book is theexploration of architectures and design methodologies for application-specific arallel systems in the gener embedded applications in digital si nal processing (DSP).In the such multiprocessors typically consist of one or more central processing units (micro-controllers or programmable digital signal processors), and one or more application-specific hardware components (implemented custom application specific integrated circuits or reconfigurable logic such field programmable gate arrays ( F ~ ~ A s )Such ) . embedded multiprocessor systems are becoming increasingly common today in applications ranging from digital audio/video equipment to portable devices such cellular phones and personal digital assistants. With increasing levels of integration, it is now feasible to integrate such heterogeneous systems entirely on single chip. The design task of such multiprocessor systems-on-a-chip is complex, and the complexity will only increase in the future. One of the critical issues in the design of embedded multiprocessors is managing communication and synchronization overhead between the heterogeneous processing elements. This book discusses systematic techniques aimed at reducing this overhead in multiprocessors that are designed to be application-specific. The scope of this book includes both hardware techniques for minimizing this overhead based on compile time analysis, well software techniques for strategically designing synchronization points in multiprocessor implementation withthe objective o ducing synchronization overhead. The techniques presented here apply to P algorithms that involve predictable control structure; the precise domain of applicability of these techniques will be formally stated shortly. Applications in signal, image, and video processing require large computing power and have real-time p e ~ o ~ a n requirements. ce The computing engines in such applications tend to be embedded opposed to general-purpose. Custom
Chapter 1
VLSI implementations are usually preferred in such high throughput applications. However, custom approaches havethe well known problems of long design cycles (the advances in high-level VLSI synthesis notwithstanding) and low flexibility in the final implementation. Programmable solutions are attractive in both of these respects: the p r o g r a ~ ~ a bcore l e needsto be verified for correctness only once, and design changes can be made late in the design cycle by modifying the software program. Although verifying the embedded software to be run on programmable part is also hard problem, inmost situations changes late in the design cycle (and indeed even after the system design is completed) are much easier and cheaper to make in the case of software than inthe case of hardware. Special processors are available today that employ an architecture and an instruction set tailored towards signal processing. Such software programmable integrated circuits are called “Digital Signal Processors” (DSP chips or DSPs for short). The special features that these processors employ are discussed extensively by Lapsley, Bier, Shoham and Lee [LBSL94]. However, single processor even DSPs often cannot deliver the performance requirement of some applications. In these cases, use of multiple processors is an attractive solution, where both the hardware and the software make use of the application~specific nature of the task to be performed. For multiprocessor implementation of embedded real-time DSP applications, reducing interprocessor communication costs andsynchronization costs becomes particularly important, because there usually premium on proof video cessorcyclesin these situations. Forexample,considerprocessing images in a video-conferencing application. Video-conferencing typically involves Quarter-CIF (Common Intermediate Format) images; this format specifies data rates of 30 frames per second, with each frame containing144 lines and 176 pixels per line, The effective sampling rate of the Quarter-CIF video signal is 0.76 Megapixels per second. The highest performance programmable DSP processor available of this writing (1999) has cycle time of 5 nanoseconds; this allows about 260 instruction cycles per processor for processing each sample the video signal sampled at 0.76 MHz. In multiprocessor scenario, IPC can potentially waste these precious processor cycles, negating some of the benefits of using multiple processors. In addition to processor cycles, IPC wastes power since it involves access to shared resources such memories and busses. Thus reducing IPC costs also becomes important from power consumption perspective for portable devices.
Over the past few years several companies have offered boards consisting of multiple DSPs. More recently, semiconductor companies have been offering
chips that integrate multiple DSP engines on single die, Examples of such integrated multiprocessor DSPs include commercially available products such the Texas Instruments TMS320C80 multi-DSP [GGV92], Philips Trimedia processor [RSSS], and the Adaptive Solutions CNAPSprocessor. The Hydra research at Stanford [H0981 is another example of an effort focussed on single-chip multiprocessors. MultiprocessorDSPs are likely to be increasingly popular in the future for variety of reasons. First, VLSItechnologytodayenables one to “stamp” 4-5 standard DSPs onto single die; this trend is certain to continue in the coming years. Such an approachis expected to become increasingly attractive because it reduces the testing time for the increasingly complex VLSI systems of the future. Second, since such device is programmable, tooling and testing costs of building an ASIC (application-specific integrated circuit) for each different application are saved by using such device for many different applications. This advantage of DSPs is going to be increasingly important circuit integration levels continue their dramatic ascent. Third, although there has been reluctance in adopting automatic compilers for embedded DSPs, suchparallel DSP products make the use of automatedtools feasible; with large number of processors per chip, one can afford to give up some processing power to the inefficiencies in the automatic tools. In addition, new techniques are being researched to make the process of automatically mapping design onto multiple processors more efficient the research results discussed in this book are attempts in that direction. situation is analogous to how logic designers have embraced automatic logic synthesis tools in recent years logic synthesis tools and VLSI technology have improved to the point that the chip area savedby manual design over automated designis not worth the extra design time involved: one can afford to “waste’, few gates, just one can afford to waste limited amount of processor cycles to compilation ine~ciencies in multiprocessor DSP system. Finally, proliferation telecommunication standards andsignal formats, often giving rise to multiple standards for the very same application, makes software implementation extremely attractive. Examples of applications in this categoryinclude set-top boxescapableofrecognizing varietyofaudiolvideo formatsandcompression standards, modernssupportingmultiple standards, multi-mode cellular phones and base stations that work with multiple cellular standards, multimedia workstations that are required to run variety of different multimedia software products, and programmable audiolvideo codecs.Integrated multiprocessor DSP systems provide very flexible software p l a t f o ~for this rapidly-growing family ofapplications.
A natural generalization of such fully-programmable, multiprocessor inte-
Chapter
grated circuits the class of multiprocessor systems that consists of an a r b i t r ~ possibly heterogeneous collection of programmable processors well set of zero or more custom hardware elements on single chip. ~ a p p i n gapplications onto such an architecture is then hardware/software codesign problem. However,theproblems of interprocessor communi~ation and synchronization are, for the most part, identical to those encountered in fully-pro~rammable systems, In this book, when we refer to “m~ltiprocessor,~’ we imply an architecture that, described above, may be comprised of dif€ere~ttypes of programmable processors, andmay include custom hardware elements. Additionally, the multiprocessor systems that we address in this book may be packaged in single integrated circuit chip, or maybe distributed across multiple chips. All of the techni~uesthat we present in this book apply to this general class of parallel processing architectures.
Although this book addresses broad range of parallel architectures, it focuses on-thedesign of such architectures in the context of specific, well-defi~ed families of applications. We focus on application-specific parallel proce instead of applying the ideas in general purpose parallel systems because systems are typically components embedded app~ications,and the computational characteristics of embedded applications are fundamentally different from those of genera1“purposesystems. General purpose parallel computation involves user-progra~mablecomputing devices, whichcanbeconveniently config~red for wide variety of purposes, and can be re-configured any number of times the user’s needs change. omp put at ion in an embedded app~ication,however, is usually one-time programmed by the designer of that ernbedded system digital cellular radio handset, for example) and is not meant to be programmable by the end user. Also, the computation in embedded syste is specia~ized (the c o ~ p u t a tionin SE” functions such as speech cellular radio handsetinvolvesspecifi compression, channel equalization, modulation, etc.), andthe desi ners of embedded multiprocessor hardware typically have specific knowled applications that will be developed on the p l a t f o ~ sthat they develo trast, ~ c h i t e c t of s general purpose computing systems cannot afYord to customize their hardware too heavily for any specific class of applications. only designers of embedded systems have the oppo~unityto accurately predict and optimi~efor the specific ap ation subsystems that willbe executing on the hardware that theydevelop.wever,ifonly general purpose imple~entation techniques are used in the development of an embedded system, then the designers of that embedded system lose this oppo~unity.
Furthemore, embedded applications face very different constraints compared to general purpose computation. on-recu~ng design costs, competitive time-to-mar~etconstraints, limitations on the amount and placement of memory, constraints on power consumption, and real-time performance requirements are few examples. Thus for an embedded application, it is critical to apply techniques for design and implementation that exploit the special characteristics of the application in order to optimize for the specific set of constraints that must be satisfied. These techniques are naturally centered around design methodologies that tailor the hardware and software implementation to the particular application.
Parallel computation has of course been topic of active research in computer science for the past several decades. Whereas parallelism within single processor hasbeen successfully exploited (instruction-level parallelism), the problem of pa~itioning single user program onto multiple such processors is yet to be satisfactorily solved. Although the hardware for the design of multiple processor machines the memory, interconnection network, inpu~outputsubsystems, etc. has received much attention, efficient partitioning of general program (w~ttenin G, for example) across given set of processors arranged in particular configuration is still an open problem. The need to detect parallelism from within the overspecified sequencing in popular imperative languages such G, the need to manage overhead due to communication and synchronization between processors, and the requirement of dynamic load balancing for some programs (an added source of overhead) complicates the partitioning problem for general p r o g r a ~ . Ifwe turn from general purpose computation to application-specific domains, however, parallelism often easier to identifyand exploit. This is because much more is known about the computational structure of the functionality being implemented, In such cases, we do not have to rely on the limited ability of automated tools to deduce this high-level structure from generic, low-level specifications (for instance, from general purpose programmin~language such C). Instead, it may bepossible to employ specialized computational models such one of the numerous variants of dataflow and finite state machine models that expose relevant structure in our targetted applications, and greatly facilitate the manualor automatic derivation of optimized implementations. Such specification models will be unacceptable in general-purpose context due to their limited applicability, butthey present tremendous opportunity tothe designer of embedded applications. The use of specialized computational models particularly d a t a ~ o ~ - b a s emodels d especially prevalent in the DSP domain.
Chapter
Similarly, focusing particular application domain mayinspire the discovery of highly streamlined system architectures. For example, one of the most extensively studied family of application-specific parallel processors is the class of syst~licarray architectures [Kun88][Rao85]. These architectures consist of regularly arranged arrays of processors that communicate locally, onto which certain class of applications, specified in a mathemat~calform, can be systematically mapped. Systolic arrays are further discussed in Chapter 2.
The necessaryelementsin the studyof application-specific computer architectures are: 1) clearly defined set of problems that can be solved usingthe particular application-specific approach, 2) formal mechanism for specification of these applications, and systematic approach for designing hardware and software from such a specification. In this book we focus on embedded signal, image, and videosignal processing applications, and specification model called Sync~onousDataflow that has proven to be very useful for design of such applications. Dataflow is a well-known programming model in which program is represented as a set of tasks with data precedences. Figure 1.1 shows an example of dataflow graph, where computation tasks (actors) A B , C , and D are represented as circles, and arrows (or arcs) between actors represent FIFO (first-infirst-out) queues that direct data values from the output of one computationto the input of another. Figure 1.2 shows the semantics of a dataflow graph. Actors consume data (or tokens, represented bullets in Figure 1.2) fromtheir inputs, perform computations on them (fire), and produce certain number of tokens on their outputs. The functions performed by the actors define the overall function of the dataflow graph; for example in Figure 1.l, and B could be data sources, C
Figure 1.l.
example of a dataflow graph.
could be simple addition operation, and D could be data sink. Then the function of the dataflow graph would be simply to output the sum of two input tokens. Dataflow graphs are very useful specification mechanism for signal processing systems since they capture the intuitive expressivity of block diagrams, flow charts, and signal flow graphs, while providing the formal semantics needed for system design and analysis tools. The applications we focus on are those that ELM873 and its extensions; described becan by willwe discuss the fo putational model in detail in Chapter 3. SDF in its pure form can onlyrepresent application sion making at the task level. Extensions of SDF (such the (BDF) model [Lee91][Buc93]) allow control constructs, so that data-dependent control flow can be expressed in such models. These models are si~nificantly more powerful in terms of expressivity, but they give up some of the useful analytical properties possessed the SDF model. For instance, Buck shows that it is possible to simulate any Turing machine in the BDF model [Buck93), TheBDF model can therefore compute all Turing computable functions, whereas this not
l
"firing".
Chapter 1
possible in the case of the SDF model. We further discuss the Boolean dataflow model in Chapter 8. In exchange for the limited expressivity of an SDF representation, we can efficiently check conditions such whether given SDF graph deadlocks, and whether it can be implemented usinga finite amount of memory.No such general procedures can be devised for checking the c o ~ e s p o n d i nconditions ~ (deadlock behavior and bounded memory usage)for computation model that can simulate any given Turing machine. This is because the problems of determining if any given Turing machine halts (the halting problem), and determining whether‘it will use less than given amount of memory (or tape) are that is, no general algorithmexists to solve these problems in finite time. In this work, we first focus on techniques that apply to SDF applications, and we will propose extensions to these techniques for applications that can be specified essentially SDF, but augmented with limited number of control constructs (and hence fall into the BDF model). SDF has proven to be useful model for representing significant class of DSP algorithms; several computeraided design tools for DSP have been developed around SDF and closely related models. Examples of commercial tools based on SDF are the Signal Processing rksystem (SPW) from Cadence [PLN92][BL91]; and COSSAP, from Synopsys [RPM92]. Tools developed at various universities that use SDF and related models include Ptolemy [PHLB95a], the Warp compiler [Pri92], DESCARTES M921, GRAPE[LEAP94],and the GraphCompiler[VPS90].Figure 1.3
Figure 1.3. block diagram specificationof an F system in Cadence Signal Processing ~ o r k s y s t e (SPW). ~
showsanexampleofansystem SP
specified
blockdiagraminCadence
The SDF model is popular because it has certain analytical properties that in practice; we will discuss these properties and how they arise in the section. The most important property of SDF graphs in the context of this book that it is possible to effectively exploit parallelism in an algorithm specified an SDF graph by scheduling computations in the SDF graph onto multiple processors at compile or design timerather than at run-time. Given such schedule that d e t e ~ i n e dat compile time, we can extract i n f o ~ a t i o nfrom it with view towards optimizingthe final implementation. Inthis book we present techniques for minimizing synchronization and inter-processor communication overhead in statically (i.e., compiletime)scheduledmultiprocessorsinwhich the program derived from dataflow graph specification. The strategy is to model run-time execution of such multiprocessor to determine how processors communicate and sync~onize,and then to use this information to optimize the final implementation.
SDF (and other closely As mentioned before, dataflow models such related models) have proven to be useful for specifying applications in signal processing and communications, with the goal of both simulation of the algorithm at the functional or behavioral level, and for synthesis from such high level specification to a software description (e.g., a C program) or a hardware description (e.g., DL) or combination thereof. The descriptions thus generated can then be compiled down to the final implementation, e.g., an embedd~d processor, or an ASIC. One of the reasons for the popularity of such dataflow based modelsis that they provide formalism for block-diagram based visual programming, which is very intuitive specification mechanism for DSP; the expressivity of the S model sufficiently enco~passes significant class of DSP applications, including multirate applications that involve upsampling and downsamplingoperations. An equallyimportantreason for employingdataflow is that such specification exposes parallelism in the p It is wellknown that imperativeprogramming styles such C andF N tend to over-specify the control structure of givencomputation,andcompilationofsuch specifications onto parallel architectures is known to be hard problem. Dataflow onthe other hand imposes minimal data-dependency constraints in the specification, potentially enabling compiler to detect p~allelismveryeffectively. The sameargumentholds for hardware synthesis, where it also important to be able to specify and exploit concu~ency.
Chapter
The SDF model has also proven to be useful for compiling DSP applications on single processors. Programmable digital signal processing chips tend to have special instructions such single cycle multiply-accumulate (for filtering functions), moduloaddressing (for mana&ingdelay lines), and bit-reversed addressing (for FFT computation). DSP chips also contain built in parallel functional units that are controlled from fields in the instruction (such parallel moves from memoryto registers combined with anALU operation). It is difficult for automatic compilers to optimally exploit these features; executable code generated by commercially available compilers today utilizes one-and-a-half to two times the programmemory that correspondinghandoptimizedprogram requires, and results in two to three times higher execution time compared to hand-optimi~ed code[ZVSM95]. There are however significant research efforts underway that are narrowing this gap. Forexample, see [LDK95][SM~97]. Moreover, some of the newer DSP architectures such the Texas Instruments S 3 2 0 C 6 ~ 0are more compiler friendly than past DSP architectures; automatic compilers for these processors often rival hand optimized assembly code for many standard DSP benchmarks. Block diagram languages based on models such SDF have proven to be bridge between automatic compilation and hand coding approaches;library of reusable blocks in particular programming language is hand coded, this library then constitutes the set of atomic SDF actors. Since the library blocks are reusable, one can afford to carefully optimize and fine tune them. The atomic blocks are fine to medium grain in size; an atomic actor in the SDF graph may implement anything from filtering function to two input addition operation. The final program is then automatically generated by concatenating code corresponding to the blocks inthe program according to the sequence prescribed by schedule. This approach mature enough that there are commercial tools available today, for example the SPVV and COSSAP tools mentioned earlier, that employ this technique. Powerful optimization techniques have been developedfor generating sequential programs from SDF graphs that optimize for metrics such program and data memory usage, the run-time efficiency of buffering code, and context switching overhead betweensub-tasks [BM~96]. Scheduling is fundamental operation that must be performed in order to implement SDF graphs on both uniprocessor well multiprocessors. Uniprocessor scheduling simply refers to determining sequence of execution ofactors such that all precedence constraints are met and all the buffers between actors correspondi in^ to arcs) return to their initial states. Multiprocessor scheduling involves determining the mapping of actors to available processors, in addition to determining of the sequence in which actors execute. VVe discuss the issues involved in multiprocessor scheduling in subsequentchapters.
ve~vie The following chapter describes examples of application specific multiprocessors used for signal processing applications. Chapter lays down the formal notation anddefinitions used in the remainder of this book for modeling runtime synchronization and interprocessor communication. Chapter describes scheduling modelsthat are commonly employed when scheduling dataflow graphs on multiple processors. Chapter describes scheduling algorithms that attempt to maximize performance while accurately taking interprocessor communication costs into account. Chapters 6 and describe a hardware based technique for minimizing IPCand synchronization costs; the key idea in these chapters is topredict the pattern of processor accesses to shared resources and to enforce this pattern during runtime. We present the hardware design and implementation of four processor machine the Ordered Memory Access Architecture (OMA). The OMA is shared bus multiprocessor that uses shared memory for IPC, Theorder in which processors access shared memory for thepurpose of communication is predetermined at compile time and enforced by bus controller on the board, resulting in low-cost IPC mechanism without the need for explicit synchronization. This scheme is termed the Ordered Transactions strategy In Chapter we present graph theoretic scheme for modeling run-time onization behavior of multiprocessors using structure we call the that takes into account the processor assignment and ordering constr that self-timed schedule specifies. We also discussthe effect of run-time variations in execution times of tasks on the performance of a multiprocessor implementation. In Chapter 8, we discuss ideas for extending the Ordered Transactions strategy to models more powerful than SDF, for example, the Boolean dataflow (BDF) model. The strategy here is to assume we have only small number of control constructs in the SDF graph and explore techniques for this case. The domain of applicability of compile time optimization techniques can be extended to programs that display some dynamic behavior in this manner, without having to deal with the complexity of tackling the general BDF model. The ordered memory access approach discussed in Chapters 6 to 8 requires special hardware support. When such support is not available, we can utilize set of software-based approaches to reduce synchronization overhead. These techniques for reducing sync~onizationoverhead consist of efficient algorithms that minimize the overall synchronization activity in the imple~entation of given self-timed schedule. A straightfo~ardmultiprocessor implementation of dataflow specification often includes ~ ~ u n ~ a n t points, i.e., theobjective of certain set of synchronizations is guaranteed a side effect
of other synchronization points in the system. Chapter 9 discusses efficient algorithms for detecting and eliminating such redundant synchronization operations. discuss graph transformation called that allows e use of more efficient synchronization protocols. It is alsopossible to reduce the overall synchronization cost of self-timed implementation by adding synchronization points between processors that were not present in the schedule specified originally. In Chapter 10, we discuss technique, called r ~ s y ~ ~ h r o n ~ ~ for t i osystematically n, manipulating synchronization points in this manner. Resynchronization is performed with the objective of im~rovingthroughput of the multiprocessor implementation. Frequently in realtime signal processing systems, latency also an important issue, and although resynchronization improves the throughput, it generally degrades (increases) the latency. hapter 10 addresses the problem of resynchronization underthe assumption that an arbitrary increase in latency is acceptable. Such scenario arises when the computations occur in feedforward manner, e.g., audiolvideo decoding for playback from media such Digital 'Versatile Disk (DVD), and for wide variety of simulation applications. Chapter 11 examines the relationship between resynchronization and latency, and addresses the problem of optimal resynchronizationwhenonly limited increase in latency is tolerable. Such latency constraints are present in interactive applications such video conferencing and telephony, where beyond certain point the latency becomes annoying to the user. In voicetelephony, for example, the round trip delay of the speech signal is kept below about 100 milliseconds to achieve acceptable quality. The ordered memory access strategy discussed in Chapters 6 through 8 can be viewed hardware approach that optimizes for IPC and synchronization overhead in statically scheduled multiprocessor implementations. The synchronization optimization techniques of Chapter9 through 12, on the other hand, operate at the level of scheduled parallel program by altering the synchronization s t ~ c t u r eof given schedule to minimize the synchronization overhead in the final implementation. ~hroughoutthe book, we illustrate the key concepts by applying them to examples of practical systems.
dollar cost
tion.
elements could themselves be self-contained processors that exploit parallelism within themselves. In the latter case, we can view the parallel program as being split into multiple threads of computation, where each threadis assigned to a processing element. The processing element itself could be a traditional von Neumann-type Central Processing Unit (CPU), sequentially executing instructions fetched from a central instruction storage, or it could employ (ILP) to realize high performance by executing in parallel multiple instructions in its assigned thread. The interconnection mechanism between processors is clearly crucial to the performance of the machine on a given application. For fine-grained and instruction level parallelism support, communication often occurs through a simple mechanism such as a multi-po~edregister file. For machines composed of more sophisticated processors, a large varietyofinterconnectionmechanism have been employed, ranging from a simple shared bus to 3-dimensional meshes and hyper-trees [Lei92]. Embedded applications often employ simple structures such as hierarchical busses or small crossbars. The twomain flavors of ILPare superscalar andVLIW(VeryLong Instruction Word) [PH96]. Superscalar processors (e.g.,the Intel Pentium processor) contain multiple functional units (ALUs, floating point units, etc.); instructions are brought into the machine sequentially and are scheduled dynamically by the processor hardware onto the available functional units. Out-of-order execution of instructions is alsosupported. VLIW processors, on the otherhand,relyonacompiler to statically schedule instructions onto functional units; the compiler determines exactly what operationeach functional unit performsineach instruction cycle. The “long instruction word” arises because the instruction word must specify the control i n f o ~ a t i o nfor all the functional units in the machine. Clearly, a VLIW model is less flexible than a superscalar approach; however, the implementation cost of VLIW is also significantly less because dynamic scheduling need not be supported in hardware. Forthis reason, several modern DSP processors have adopted the VLIW approach; at the same time, as discussed before, the regular nature of DSP algorithms lend themselves wellto the static scheduling approach employed in VLIW machines. We will discuss some of these machines in detail in the following sections. Given multiple processors capable of executing autonomously, the program threads running on the processors may be tightly or loosely coupled to one another. In a tightly coupled architecture the processors may run in lockstep executing the same instructions on different data sets (e.g., systolic arrays), or they may run in lock step, but operate on different instruction sequences (similar to VLIW). Alternatively,processors may executetheir programs independent ofone
another, only communicating or sync~onizingwhen necessary. Even in this case there is wide range of how closely processors are coupled, which can range from shared memory model where the processors may share the same memory address space to “network of workstations’’ model whereautono~ousmachines communicate in coarse-grained manner over local area network. In the following sections, we discuss app~ication-specificparallel processors that exemplify the many variations in parallel architectures discussed thus far. We will find that these machines employ tight coupling between processors; these machines also attempt to exploit the predictable run-time nature of the targeted applications, by employing architectural techniquessuch as VLIW,and employing processor interconnectionsthat reflect the nature of the targeted application set. Also, these architectures rely heavilyupon static scheduling techniques for their performance.
DSP processors have incorporated ILP techniques since inception; the key innovation in the very first DSPs was single cycle multiply-accumulate unit. In addition, almost all DSP processors today employ an architecture that includes multiple internal busses allowing multiple datafetches in parallel with aninstruction fetch in single instruction cycle; this is also known “Harvard” architecture. Figure 2.1 showsanexampleof modern DSP processor(Texas Instruments TMS320C54x DSP) containing multiple address and data busses, and parallel address generators. Since filtering is the key operation in most DSP algorithms, modern programmable DSP architectures provide highly specialized support for this function. For example, multiply-and-accumulate operation may be performed in parallel with two data fetches from data memory (for fetching the signal sample and the filter coefficient); in addition, an update of two address registers (potentially including modulo operations to support circular buffers and delay lines), and an instruction fetch can also be done in the same cycle. there are many atomic operations performed in parallel in single cycle; this allows finite impulse response (FIR) filter implementation using only oneDSP instruction cycle per filter tap. For example, Figure 2.2 shows the assembly code for the inner loop of an FIR filter implementation on a TMS32OC54x DSP. The MAC instruction is repeated for each tap in the filter; for each repetition this instruction fetches the coefficient and data pointed to by address registers AR2 and AR3, multiplies and accumulates them into the “A” accumulator, and postincrements the address registers.
Chapter
have a complex inst~ctionset and follow philosophy very difTerent from ““Reduced n s t ~ c t i o nSet ~ o m ~ u t e r ” tectures, that are prevalent in the general p u ~ o s e high ~ e ~ o ~ a n c e microprocessor domain. The advantages of a com~lex inst~ction set are compact
ified viewof the
object code, and dete~inistic perfo~ance, while the price of supporting complex instruction set lower compiler efficiency and lesser portability of the software. The constraint of lowpower,andhigh performance-to-cost ratlo re~uirementfor embedded applications has resulted in very differe tion paths for processors compared to general-purpose processors. these paths eventually converge in the future remains to be seen.
Sub-word parallelism refers to the ability to divide wide ALU into narrower slices so that multiple operations on a smaller data type can be performed on the same datapath in an SIMD fashion (Figure 2.3). Several general purpose microprocessors employ multi-media enhanced instruction set that exploits sub-word parallelism to achieve higher performance on multimedia applicatio~s that require a smaller precision. Technology”-enhanced Intel Pentium processor [E own general purpose CPU with an enhanced instruction set to handle throughput intensive “media” processing. The MMX instructions allow 64~bitALU to be partitioned into $-bit slices, providing subThe $-bit ALU slices work in parallel in an SIMD fashion. The Pentiurn can perform operations such as addition, subtraction, and logical operations on eight &bit samples (e.g., image pixels) in a single cycle. It also can perform data movement operations such single cycle swapping of bytes within words, p a c ~ n gsmaller sized words into a 64-bit register, etc. operations such as four 8-bit multiplies (with or without satu shifts within sub-words, and sum of products of sub-words, may all be p e r f o ~ e d in a singlecycle. Similarly enhanced microprocessors have been developed by systems (the “VIS” inon set for the SPARC processor [TO Hewlett-Packard (the inst~ctionsfor the PA RISC process The VIS instruction set includes a capability for performing absolute difference (for image compression ~pplications). The include a sub-word average, shift and add, and fairly generic permute instr~ctions
Chapter
that change the positions of the sub-words within 64-bit word boundary in a very flexible manner. The permute instructions are especially useful for efficiently aligning data within 64-bit word before employing an instruction that operates on multiple sub-words. DSP processors such the TMS32OC60 and ~ S 3 2 ~ 8and 0 the , Philips Trimedia also support sub-word parallelism. Exploiting sub-word parallelism clearly requires extensive static or compile time analysis, either manually or by compiler.
ro~~ssors Asdiscussed before, the lower cost of compiler-scheduledapproach employed in VLIW machines compared to hardware scheduling employed in superscalar processors makes VLIW good candidate DSP architecture. It is therefore no surprise that several semiconductormanufacturershave recently announced VLIW-based signal processor products. The Philips Trimedia processor, for example, is geared towards video signal processing, and employs VLIW engine. The Trimedia processor also has special hardware for handling various standard video formats. In addition, hardwaremodules for highly specialized functionssuch Variable Length Decoding (usedfor MPECvideo decoding), color and format conversion, are also provided. Trimedia also instructions that exploit sub-word parallelism among byte-sized samples within 32-bit word. The ChromaticsMPACT architecture [Pur971usesan interesting hardware/software partitioned solution to provide programmable platform for PC-
byte
a + be + cf + gd + h
Figure 2.3. Example sub-word parallelism: Additionof bytes within a 32 bit register (saturation or truncation could be specified).
APPLICATION-SPECIFIC~ULTIPROC~SSORS
based multi-media. The target applications are graphics, audiohide0 processing, and video games. The key idea behind Chromatic’s multimedia solution is to use some a ~ o u n tof processing capability in the native x86 CPU, and usethe MPACT processor for accelerating certain functions when multiple applications are operated simultaneously (e.g., when FAX message arrives while teleconferencing session in operation). Finally, the Texas Instruments TMS32OC6x DSP [Tex98]is high performance, general purpose DSP that employs VLIW architecture. The C6x processor is designed around eight functional units that are grouped into two identical sets of four functional units each (see Figure 2.4). These functional units are the D unit for memory loadlstore and addhubtract operations; the M unit for multiplication; the L unit for additio~subtraction,logical and comparison operations; and the S unit for shifts in addition to addhubtract and logical operations. Each set of four functional units has its own register file, and bypass provided for accessing each half of the register file by either set of functional units. Each functional unit is controlled by a 32-bit instru~tionfield; the instruction word for the processor therefore has length between 32 bits and 256 bits, depending on how many functional units are actually active in given cycle. Features such predicated inst~ctionsallow conditional execution of instructions; this allows one to avoid branching when possible, very useful feature considering the deep pipeline of the C6x.
Several multiprocessors geared towards signal processing are based on the dataflow architecture principles introduced by Dennis ~ D e n 8 0 these ~ ; machines deviate from the traditional von Neumann model of computer. Notable among these are Hughes Data Flow Multiprocessor [GB91], the Texas Instruments Data Flow Signal Processor [Gri84], and the AT&T EnhancedModular Signal Processor [Blo86]. The first two perform the processor assignment step at compile time (i.e., tasks are assigned to processors at compile time) and tasks assigned to processor are scheduled on it dynamically; the AT&T EMPS performs even the assignment of tasks to processors at run-time. The main steps involved in scheduling tasks on multiple processors are discussed fully in Chapter 4. Each of these machines employs elaborate hardware to implement dynamic scheduling within processors, and employs expensive communication networks to route tokens generated by actors assigned to one processor to tasks on other processors that require these tokens. In most DSP applications, however, such dynamic scheduling is u n n e c e s s ~since compile time predictability makes static scheduling techniques viable. Eliminating dynamic scheduling results in much simpler hardware without an undueperformance penalty.
Another example ofan application-specific dataflow architecture the 1 [Cha84], which single chip processor geared towards image ch chip contains one functional unit; multiple such chips can be connected together to execute programs in a pipelined fashion. The actors are statically assigned to each processor, and actors assigned to given processor are scheduled on it dynamically. The primitives that this chip supports, convolution, bit manipulations, accumulation, etc., are specifically designed for image processing applications.
ystolic arrays consist of processors that are locally connected and may be arranged in different interconnection topologies: mesh, ring, torus, etc. The term “systolic” arises because all processors in such a machine run in lock-step, alternating between computation step and communication step. The model followed is usually SIMD (Single Instruction ~ u l t i p l eData). S execute certain class of problems that can be specified o ~ t h m s(RIA)” [Rao85]; systematic techni~uesexist for mapping an algo-
256-bit instruction word
rithm specified in form onto dedicated processor arrays in an optimal fashion. ~ptimalityes i metrics such as processor and communication link utilization, scalability with the problem size, and achieving best for a givennumber of essors. Several numerical computation problerriswere found to fall into the algebra, matrix operations, singular value decomposition, [Lei921 for interesting systolic array implementations of a variety of di~erentnumerical problems). Only highly regular computations can be specified in the RIA form; this makes the applicability of systolic arrayssomewhat restrictive. vefront arrays are similar to systolic arrays except that processors are n881. Communication between procesnot under the control a global clock sors is async~onousor self-timed; ands shake between processors ensures runtime sync~onization,Thus processors in wavefront array can be complex and the arrays themselves can consist of a large number of processors without incurring the associated problems of clock skew and global sync~onization.The ibility of wavefront arrays over systolic arrays comes atthe cost of llon University [A+87] is an example of ed ato dedicated array designed for one and communicate anged in a linear array es. Programs are written for this comhe Warp project also led to the i orate inter"processor c node is a single VL composed of a computation engine and a communication engine. tion agent consists of an integer and logical unit as well as a Ao and multiply unit. Each unit is capable of ~ n n i inde~endently, ~ g to a multi-po~edregister file. The communication agent connects to its neig~bors via four bidirectional communication links, and provides the interface to support message passing type communication between cells as well as word-based sysi tolic communication. The i nodescan therefore be connected invari gle and two dimensional topologies. Various image processing applicat FFT, image smoothing, computer vision) and matrix algorithms decomposition) have been reported for this machine [Lou93].
programmable systoli specific application. with their neighbors
ext, we discuss multiprocessors that make use of multiple off-the-shelf p r o ~ r a ~ m a ~ l e chips. An example of such a system is the S ~ A . R Tar ture [Koh90] that reconfigurable bus-based design comprised of SP32C processors, and custom VLSI components for routing data between pro-
Chapter
cessors. Clusters of processors may be connected onto common bus, or may form linear array with neighbor-to-neighbor communication. This allows the multiprocessor to be reconfigured depending on the communication requirement of the particular application being mapped onto it. Scheduling and code generation for this machine is done by an automatic parallelizing compiler [HJ92]. The DSP3 multiprocessor [SW921 comprised of AT&T DSP32C processors connectedin a mesh configuration. The meshinterconnect is implementedusingcustomVLSIcomponents for data routing. Eachprocessor communicates with four of its adjacent neighbors through this router, which consists of input and output queues, and crossbar that is configurable under program control. Data packets contain headersthat indicate the ID of the destination processor. The RingArrayProcessor(RAP)system[M+92]uses TI DSP32OC30 processors connected in ring topology. This system is designed specifically for speech-recognition applications basedon artificial neural networks.TheRAP system consists of several boards that are attached to host workstation, andacts as a co-processor for the host. The unidirectional pipelined ring topology employed for interprocessor communication was foundto be ideal for the particular algorithms that were to be mapped to this machine. The ring structure is similar to the SMART array, except that no processor ID is included with the data, and processor reads and writes into the ring are scheduled at compile time. The ring is used to broadcast data from one processor to all the others during one
INmRFACE UNIT
APPLICATION-SPECIFIC~~LTIPROCESSORS
phase of the neural network algorithm, andis used to shift data from processor to processor in pipelined fashion in the second phase. Several modern oE-the-shelf DSP processors provide special support for multiprocessing. Examples include the Texas Instruments TMS32OC40 Motorola DSP96000, Analog Devices ADSP-21060 “SHARC”, well the Inmos(nowowned by SGS Thompson)Transputer line of processors. The DSP96000 processor is floating point DSP that supports two independent busses, one of which can be usedfor local accesses and the other for inter-processor communication. The C40 processor is also floating point processor with two sets of busses; in addition it has six $-bit bidirectional ports for interprocessor provides six communication. The ADSP-21060 is floating point DSP that bidirectional serial links for interprocessor communication. The Transputer is CPU with four serial links for interprocessor communications. Owing to the ease with which these processors can be interconnected, numberofmulti-DSPmachineshavebeen built around the DSP960~, SHARC,and the Transputer. Examplesofmulti-DSPmachinescomposed of DSP96000s include MUSIC [G+92] that targets neural network applications well as the architecture described in Chapter 6; C40 based parallel processors havebeendesigned for beamforming applications [Ger9S],andmachine vision [DIE3961 among others; ADSP-21060basedmultiprocessorsinclude speech-recognition applications [T+9S], applications in nuclear physics [A+98], and digital music [Sha98]; and machines built around Transputers have targeted applications in scientific computation [Mou96], and robotics [YM96].
Modern VLSI technology enables multiple CPUs to be placed on single die, to yield multiprocessor system-on-a-chip, Olukotun [0+96] present an interesting study that concludes that goingto multiple processor solution is better path to high performance than going to higher levels of instruction level parallelism (using superscalarapproach, for example). Systolic arrays have been proposed ideal candidates for application-specific multiprocessor on chip implementations; however, pointed out before, the class of application targeted by systolic arrays limited. We discuss next some interesting single chip multiprocessor architectures that have been designed andbuilt to date. The Texas I n s t ~ m e n t s~ S 3 2 0 C 8 0(Multimedia Video Processor) [GGV92] is an example of single chip multi-DSP. It consists of four DSP cores, and RISC processor for control-oriented applications. Each DSP core has its own local memory and some amount of shared RAM. EveryDSP can access the shared memory in any one ofthe four DSPs through an interconnection network. A powerful transfer controller is responsible for moving data on-chip, and also
graphics applications. ta transfers are all persor desi~ned video PE9 consists nine indi-
ction level paral~e~ism by means four indivi~ualprocess in^ uniwhichcanperform mu~tiple arithmetic operations each cycle. Thus the a h i ~ h l y~ a r a l ~ architecel ture that exploits p~allelismat m ~ l t i p ~levels. e m~eddedsingle-chip mu~tiprocessor§may be composed heteroe ~ e o processors. ~§ For exa anyconsumerdevicestoday, controllers, etc., signal processi~gtasks, ~ h i l the e other is ~icrocontrol~er such as a two-processor s y s t e ~ increasingly found in embedded applicaoptimization used in each processor. t i o ~ ~ b ~of~ the a u types s e of arch~te~tural microcontroller an ef~cient inte~upt-hand~in~ capability, and is more
APPLICATION-SPECIFIC~~LTIPROCESSORS
amenable to compilation from high-level language; however, it lacks the multiply-accumulate performance of DSP processor. The microcontroller thus ideal for p e r f o ~ i n guser interface and protocol processing type functions that are somewhat asynchronous in nature, while the DSP is more suited to signal processing tasks that tend to be synchronous and predictable. Even though new DSP processors boasting microcontroller capabilities havebeen int~oduced recently (e.g., the itachi SH-DSP andthe TI TMS320C27x series) an AR DSP two processor solution expected to be popular for embedded signal processinglcontrol applications in the near future. A good example of such an architecture is described in [Reg94]; this part uses two DSP processors along with microcontroller to implement audio processing and voice band modemfunctions in software.
Reconfigurable computers are another approach to application-specific computing that has received significant attention lately.. Reconfigurable computing is based on implement in^ function in hardware using con~gurablelogic (e.g., field programmable gate array or FPGA), or higher'levelbuilding blocks that can be easily configured and reconfigured to provide range of different functions, Building dedicated circuit for given function can result in large speedups; examples of such functions are bit manipulation in applications such cryptography and compression; bit-field extraction; highly regular computations such Fourier and Discrete Cosine Transforms; pseudo random number generation; compact lookup tables, etc. One strategy that has been employed for building configurable computers to build the machine entirely out of reconfigurable logic; examples of such machines, used for applications such DNA sequence matching, finite field arithmetic, and encryption, are discussed in [G+91][~~95][GMN96~[~+96].
A second and more recent approach to reconfigurable architectures is to augment programmable processor with configurable logic. In such an architecture, functions best suited to hardware implementation are mapped to the FPGA to take advantage of the resulting speedup, and functions more suitable to software (e.g., control dominated applications, and floating point intensive computation) can make useof the programmable processor. The Garp processor [ H ~ 9 7 ] , for example, combines Sun UltraSPARC core with an FPGA that serves reconfigurable functional unit. Special instructions are defined for configu~ng the FPGA, and for transferring data between the FPGA and the processor. The authors demonstrate 24x speedup over SunUltraSPARC machine, for an encryption application. In [HFHK97] the authors describe similar architecture, called Chimaera, that augments RISC processor with an FPGA. In the Chimaera architecture, the reconfigurable unit has access to the processor register
Chapter 2
file; in the GARP architecture the processor is responsible for directly reading from and writing data to the reconfigurable unit through special instructions that are augmented to the native instruc~ion setof the RISC processor. Both architectures include special inst~ctionsin the processor for sending commands to the reconfigurable unit. Another example of reconfigurable architecture Matrix [MD97], which attempts to combine the efficiencyof processors on irregular, heavily multiplexed tasks with the efficiency of FPGAs on highly regular tasks. The Matrix architecture allows selection of the granularity according to application needs. It consists ofan array of basic functional units (BFUs) that maybe configured either as functional units (add, multiply, etc.), or control for another BFU. Thus one can configure the array into parts that function in SIMD mode under common control, where each such partition runs an independent thread in an MIMD mode. In [ASI+98] the authors describe the idea of domain-specific processors that achieve low power dissipation for small class of applications they are optimized for, These processors augmented with general purpose processors yield practical trade-off between flexibility, power and performance. The authors esti-
Instruction, Data
Configuration, Data
Figure 2.7. A processor augmentedwith an FPGA-based accelerator [H~97][~FHK97].
APPLICATION-SPECI~IC ~ULTIPROCESSORS
7
mate that such an approach can reduce the power utilization of speech coding implementations by over an order of magnitude compared to an implementation using only general purpose DSPprocessor. PADDI (Programmable Arithmetic Devices for DIgital signal processing) is another reconfigurable architecture that consists of an array of high performance execution units (EXUs) with localized register files, connected via flexible interconnectmechanism[CR92]. The EXUs perform arithmetic functions add, subtract, shift, compare, accumulate etc. The entire array is consuch trolled by hierarchical control structure: A central sequencer broadcasts global control word, which then decoded locally by each EXU to determine its action. The local EXU decoder (“nan~store~’) handles local control, for example the selection of operands and program branching. Finally, Wu and Liu [WLR98] describe reconfigurable processing unit that can be used building block for variety of video signal processing functions including FIR, IIR, and adaptive filters, and discrete transforms such DCT, An array of processing units along with an interconnection networkis used to implement any one of these functions, yielding t ~ o u g h p ucomparable t to custom ASIC designs but with much higher flexibility and potential for adaptive operation.
As we will discuss in Chapter 4, compile time scheduling very effective for large class of applications in signal processing and scientific computing, Given such schedule, we can obtain information about the pattern of inter-processor communication that occurs atrun-time. This compile time information can be exploited by the hardware architecture to achieve efficient communication between processors. We exploit this fact in the strategy discussedinChapter In this section wediscuss related work in this area of employing compile time information about inter-processor communication coupled with enhancements to the hardware architecture with the objective of reducing IPG and sync~onizationoverhead. Determining the pattern of processorcommunications is relatively straightforward in SIMD implementations. Techniques applied to systolic arrays in fact use the regular communication pattern to determine an optimal interconnect topology for given algorithm. An interesting architecture in this context is the GF11 machine built at IBM [BDW85]. The GF11 is an SIMD machine in which processors are interconnected using Benes network (Figure 2.8), which allows the GF1 to support variety of different interprocessor communication topologies rather than fixed topology. Benes networks are non-blocking, i.e., they can provide one-to-one con-
Chapter
nectionsfrom all the network inputs to the networkoutputssimultaneously according to any specified permutation. These networks achieve the functional capability of full crossbar switch with much simpler hardware. The drawback, however, that in Benes network, computing switchsettings needed to achieve particular p e ~ u t a t i o ninvolves somewhat complex algorithm [Lei92]. In the GFl1, this problem is solved by precomputing the switch settings based on the program to be executed onthe array. A central controller is responsible for reconfiguring the Benes network at run-time based on these predete~inedswitch setl synchronous with respect to tings. Interprocessor communication in the computations in the processors, similar to systolic arrays. The GF11 has been used for scientific computing, e.g., calculations in quantum physics, finite element analysis, LU decomposition, and other applications, An example of mesh connected parallel processor that uses compile time information at the hardware level is the ~ u M e s hsystem at MIT [SHL+97]. In this system, it is assumed that the communication pattern source and destination of each message, and the communication bandwidth required can be extracted from the parallel pro~ramspecification. Some ~ o u noft dynamic execution is supported by the architecture. Each processing node in the mesh gets communication schedule which it follows at run-time. If the compile time estimates of bandwidth requirements are accurate, the architecture realizes effiInterconnection Network
Central Controller
IBM
cient, hot-spot free, low-overhead communication. Incorrect bandwidth estimates or dynamic executionare not catastrop~ic,but these do cause lower pe~ormance. machine is another example of a paral~elprocessor re configured statically. The processing elements are tiled mesh topology; each element consists of a RISC-like processor, with ements special inst~ctionsand configurable data widths. es enforce a compile-time determined static communication pattern, allowingdynamicswitchingwhen necessary. Implementing the static communication pattern reduces sync~onizationoverheadandnetwork congestion, A compiler is responsible for pa~itioningthe program into threads mappedontoeach processor, configuring the reconfigurable logic oneach processor, and routing communications statically.
In this chapter we discussed various types of application-specific multiprocessorsemployed for signal processing. Although these machinesemploy arallel processing techni~ueswell known in general pu ing, the predictable natureof the computationsallows for simp~ified syste ~chitectures.It is often possible to configure processor interconnectsstatically to make use of compile time knowledge inter-processor communication patterns. This allows for low overhead inte~rocessorcommunication and synchr ~ e c h a n i s that ~ s employ a combination of simple hardware s u p p o ~for softw~e tech~iques applied to programsrunning on the processors. explore these ideas f u ~ h ein r the following chapters.
This Page Intentionally Left Blank
In this chapter we introduce terminology and definitions usedinthe remainder of the book, and formalize the dataflow model that was introduced intuitively in Chapter 1. We also briefly introduce the concept of algorithmic complexity, and discuss various shortest and longest path algorithms in weighted directed graphs alongwith their associated complexity. These algorithms are used extensively in subsequent chapters. start with, we define the difference of two arbitrary sets S, and S2 by {S St 1s S,} and we denote the number of elements in finite set if r is real number, then we denote the smallest integer that is greater than or equal to r by r r l . S , S2 S by IS1
d pair E) where V is the set of where edge is an ordered pair (v1, E If e E we say that e is directed from to v1 is the of and is the of We refer to the source and sink vertices of graph edge e E E by src( e) and snk(e) In directed graph we cannot have two or more edges that have identical source sink vertices. A generalization of directed graph is which two or more edges have the same source and sink vertices. Figure 3.l(a) shows an example of directed graph, and Figure 3.l(b) shows an example of directed multigraph. The vertices are represented by circles and the edges are represented by arrows between the circles. Thus, the vertex set of the directed graph of Figure 3.l(a) is B,C, and the edge set is
B),(A,
(D,B),
Chapter
directed multirah,wherethe vertices (actors) represent com~utationand edges (arcs) repre rst-in-~r~t-out) queues that direct data values from the output of one to the input of another. es thus represent data precedences between computations. cons~medata tokens) from their inputs, p e r f o ~computations on them re), and produce certain numbers of tokens on their outputs. -level functional lan uages such pure L1 and as Id Lucid be directly converted i presentations; such conversion is possible because these laned to be i.e., programs in these languages contain global variables or data structures, and functions in these lan~uagescannot modify their ~ g u m e n t s[Ack82]. since it is possible to s i ~ u l a t eany Turing machine in one of these languages, questions such as deadlock (or equivalently, t e ~ i n a t i nbehavior) ~ and determining maximum h become undecid-
inand
the speci~edcomputation in har~wareor s o f t ~ ~ e .
ne such restricted model (and in fact one of the earliest graph-based
computation models) the eo of and Miller where the authors establish th graph model is i.e., the sequence of tokens produced on the edges of given computation graph are unique, and do not depend on the order that the actors in the graph fire, long all data dependencies are respected by the firing order. The authors also provide an algorithm that, based on topological and algebraic properties of the graph, determines whether the c putation specified by a given computation graph willeventually t e ~ i n a t e . cause of the latter property, computation graphs clearly cannot simulate all Turing machines, and hence are not expressive general dataflow language like Lucid or pure LISP. omp put at ion graphs provide some of the theoretical foundations for the SDF model to be discussed in detail in Section
Another model of computation relevant to dataflow is the which are analogous [Pet8l][Mur89]. A Petri net consists of set of to actors in dataflow, and set of that are analogous to arcs. Each transition has certain number input places and output places connected to it. Places may contain one or more A etri net has the following semantics: transition when all its input places have one or more tokens and, upon firing, it produces certain number of tokens on each of its output places. A large number of diff~rentkinds of Petri net models have been proposed in the literature formodeling di~erenttypes of systems. Some of these Petri net models have the same expressive power Turing machines: for example, if transitions areallowed to possess “inhibit” inputs (if place co~espondingto such an input to transition contains token, then that transition is not allowed to fire) then Petri net can simulate any Turing machine (pp. 201 in [Petsl]). Others (depending on topological restrictions imposed on how places and transitions can be interconnected) are equivalent to finite state machines, and yet others are similar to SDF graphs. Some extended Petri net models allow notion of time, to model execution times of computations, There is also body of work on stochastic extensions of timed Petri nets that are useful for modeling uncertainties in computation times. We will touch upon some of these Petri net models again in Chapter 4. Finally, there are Petri nets that distinguish between different classes of tokens in the specification Petri nets), that tokens can have information associated withthem. We refer to [Pet811 [Mur89] for details on the extensive variety of Petri nets that have been proposed overthe years.
Chapter 3
The particular restricted dataflow model we are mainly concerned with in this book is the SDF Sync~onousData Flow model proposed by Lee and ~esserschmitt[LM97].The SDF model poses restrictions on the firing of actors: the number of tokens produced ( ~ o n s u ~ e by d )an actor on each output (input) edge is fixed number that is known at compile time. The number of tokens produced and consumed by each SDF actor on each of its edges is annotated in illustrations of an SDF graph by numbers at thearc source and sink respectively. In an actual im~lementation,arcs represent buffers in physical memory. "%e arcs in an SDF graph may contain initial tokens, which we also refer to delays. Arcs with delays canbe interpreted data dependencies across iterations of the graph; this concept will be formalized in the following cha ter when we discuss scheduling models. We will represent delays using bullets on the edges of the SDF graph; we indicate more than one delay on an edge by number alongside the bullet. An example of an SDF graph is illustrated in Figure 3.2. DSP applications typically represent computations on an indefinitely long data sequence; therefore the SDF graphs we are interested in for the purpose of signal processing must execute in non-te~inatingfashion. Consequently, we must be able to obtain periodic schedules for SDF representations, which can then be run infinite loops using a finite amount of physical memory. Unbounded buffers imply sample rate inconsistency, and deadlock implies that all actorsin the graph cannot be iterated indefinitely. Thus for our purposes, correctly constructed SDF graphs are those that can be scheduled periodically using finite amount of memory. The main advantage of imposing restrictions on the SDF model (over general dataflow model) lies precisely in the ability to determine whether or not an arbitrary SDF graph has periodic schedule that neither
1
SDF
BACKGROUND TERMINOLOGY ANDNOTATION
deadlocks nor requires unbounded buffer sizes [LM87]. The buffer sizes required to implement arcs in SDF graphs can be determined at compile time (recall that this is not possible for general dataflow model); consequently, buffers can be allocated statically, andrun-timeoverhead associated withdynamicmemory allocation is avoided. The existence of periodic schedule that can be inferred at compile time implies that correctly constructed SDF graph entails no run-time scheduling overhead.
This section briefly describes some useful properties of SDF graphs; for more detailed and rigorous treatment, please refer to the work of Lee an schmitt [LM87][Lee86]. An SDF graph compactly represented by its The topology matrix, referred to henceforth as I", represents the SDF graph structure; this matrix contains one columnfor each vertex, and one row for each edge in the SDF graph. The ( i , j ) th entry in the matrix corresponds to the number of tokens produced by the actor numbered j onto the edge numbered i If the j th actor tokens from the i th edge, i.e., the th edge is incident into the j th actor, then the ( i , j ) th entry is negative. Also, if the j th actor neither produces nor consumes any tokens from the i th edge, then the (i,j ) th entry set to zero. For example, the topology matrix I" for the SDF graph in Figure 3.2 is:
where the actors B ,and are numbered 1 and 3 respectively; the edges (A,B) and (A,C) are numbered and 2 respectively. A useful property of I" is stated by the following Theorem. A connected SDF graph with S vertices that has consistent samS 1 which ensures that l? has ple rates is guaranteed to have null space.
Proo) See [LM87]. This can easily be verified for (3-1). This fact is utilized to determine the epetitions vector for an SDF graph with S actors numbered 1 to S is column vector of length with the property that if each actor i is invoked number of times equal to the i th entry of q then the number of tokens on each edge of the SDF graph remains unchanged. Furthermore, is the smallest integer vector for which this property holds.
Chapter 3
Clearly, the repetitions vector is very useful for generating infinite schedules for SDF graphs by inde~nitelyrepeating finite length schedule, while maintaining small buffer sizes between actors. Also, will only exist if the SDF graph has consistent samplerates. The conditions for the existence of determined by Theorem 3.1 coupled with the following Theorem. The repetitions vector for an SDF graph with consistent sample rates is the smallest integer vector in the nullspace of its topology matrix. That is, is the smallest integer vector such that See [ ~ ~ 8 ~ ] .
e easily obtained by solving set of linear equations; these are ~ ~ t ~ osince n s ,they represent the constraint that the number of samples produced and consumed on each edge of the SDF graph be the same after each actor fires number of times equal to its corresponding entry in the repetitions vector. For the example of Figure 3.2, from (3-l),
Clearly, if actors B ,and C are invoked 3 2 , and 3 times respectively, the number of tokens on the edges remain unalte~ed(no token on token on C) Thus, the repetitions vector in (3-2) brings the SDF graph back to its “initial state”.
An SDP graph in which every actor consum each of its inputs and outputs is called
G actor fires when it has one or more tokens on all its input es one token from each input edge when it fires, and produces one token on all its output is very similar to ns in the marked gra ond to edges, and initial tokens (or in arking) of the marked graph correinitial tokens (or delays) in H The repetitions vector defined ious i
section canbeused
to con-
GY AND NOTATION
outline this t r a n s f o ~ a -
of this transformation. invocations) of let us call B) in G , let represent fires, and let aB represent and consumes only one token from each of which is source, the co~espondst now be the source vertex for edges. Each of these
c o n s u ~ e s the origin~l
us call these o u t ~ u and t tively. The k th sample
enerated
and
the
F graph that is not an HSDFG can always be convertedinto an equivalent HSDFG [Lee86]. The resulting HSDFG has larger number of actors than the original SDF graph. It in fact has number of actors equal to the sum of the entries in the repetitions vector. In the worst case, the SDF to HSDFG transformation may result in an exponential increase in the number of actors (see for an example of family of SDF graphs in which this blowup occurs). Such transfo~ation,however, appears to be necessary when constructing periodic multiprocessor schedules from multirate SDF graphs, although there has been some work on reducingthe complexity of the HSDFG that results from transforming given SDF graph by applying graph clustering techniques to that SDF graph An SDF graph converted into an HSDFG for th sor scheduling can be further converted into an
rposes of multi roces-
Figure 3.3. Expansion of an edge in an SDF graph into multiple edgesin the e~uivalent G, Note the input and output ports on the verticesof
~ A C ~ 6 R O U N D T E R ~ I N O LAND O 6 YNOTATION
by removing from the HSDFG arcs that contain initial tokens (delays). Recall that arcs with initial tokensonthem represent dependencies between successive iterations of the dataflow graph. An APEGis therefore useful for constructing multiprocessor schedules that, for algorithmic simplicity, do not attempt to overlap multiple iterations of the dataflow graph by exploiting precedence constraints across iterations. Figure 3.5 shows an example of an APEG, Note that the precedence constraints present in the original HSDFG of Figure 3.4
Figure 3.4. HSDFG obtained by expanding the SDF graphin Figure 3.2.
Figure 3.5. APEG obtained from the HSDFGin Figure 3.4.
Chapter
are maintaine~by this APEG, as long efore the next iteration begins.
each iteration of the graph is c o ~ p l e t e ~
Since we are concerned with ~ultiprocessorschedules, we assume that we ith ~p~lication represented homo~eneous F graph hencefo~h, unless we state otherwise. This of course results in no loss of ~eneralitybecause
general SDF graph converted into homogeneous graph for the purposes of multiprocessor scheduling anyway. In Chapter 8 we discuss how the ideas that apply to HSDF graphs can be extended to graphs containing actors that display data-dependent behavior (i.e., actors).
resentation ofanalgorithm (for example, k, or Fast Fourier T r a n s f o ~ is ) called an For example, Figure shows an SDF representation of two-channel rnultirate filter bank that consists of pair of analysis filters followed by synthesis filters. This graphcanbetransformed into anequivalent which represents the application graph for the two-channel filter bank, as shown
Figure 3.7. (a) SDF graphrepres~nting ta~ o - c h a n nfilter ~ l bank. (b)Ap graph.
in Figure 3,7(b). Algorithms that map applications specified SDF graphs on to single and multiple processors take the equivalent application graph input. Such algorithms will be discussed in Chapters 4 and 5. Chapter 7 will discuss how the performance of multiprocessor system after scheduling commodeled by another HSDFG called the inte or IPG graph. The IPC graph derived original application graph, and the given parallel schedule. Furthermore, Chapters 9 to 11 will discuss how third HSDFC, called the synchronization graph, can be used to analyze and optimize the synchronization structure of multiprocessor system. The full interaction of the application graph, IPG graph, and synchronization graphs, and also the formal definitions of these graphs will then be further elaborated in Chapters 7 through 1
SDF should not be confused with sync (e.g., LUSTW, SIG~AL, and E S ~ ~ Lwhich ) , have very different semantics from SDF. Synchronous languages have been proposed for formally specifying and modeling reactive systems, Le., systems that constantly react to stimuli from given physical environment. Signal processing systems fall into the reactive category, and so do control and monitoring systems, communication protocols, man-machine interfaces, etc. In synchronous languages, variables are possibly infinite sequences of data of certain type. Associated with each such sequence is conceptual (and sometimes explicit) notion of In LUSTRE, each variable is explicitly associated with clock, which determines the instants at which the value of that variable is defined. SIGNAL and ESRREL do not have an explicit notion of clock. The clock signal in LUSTRE sequence of Boolean values, and variable in LUSTRE program assumes its th value when its corresponding clock takes its th TRUE value.Thus we may relate one variable with another by means of their clocks. In ESTEREL, on the other hand, clock ticks are implicitly defined in terms of instants when the reactive system co~espondingto an E S R W L program receives (and reacts to) external events. Allcomputations in synchronouslanguage are definedwithrespect to these clocks. In contrast, the term “synchronous” in the SDF context refers to the fact that SDF actors produce and consume fixed number tokens, of and these numbers are known at compile time. This allows us to obtain periodic schedules for SDF graphs such that the average rates of firing of actors are fixed relative to one another. ~e will not be concerned with synchronous languages, although these languages have close and interesting relationship with dataflow models usedfor specification signal processing algorithms [LP95].
BACKGROUND TERMINOLOGY AND NOTATION
DFG)is a directed multigraph
E)
f initial tokens) on by deZay(e) We say that is an output edge of and that is an input edge of snk( We will also use the notation for an edge directed from to The delay on the edge is denoted by delay or simply delay
ath in (V,E ) a finite, non-empty sequence where a member of and snk( e,) Wesaythat the path e2, e,) c o n ~ i n §each and each subsequence of is directedfrom to and each member of ( s r c ( e , ) , src(e,), is on nates atvertex and terminate§ atvertex a path that terminates at a vertex that has no successors. That IS, e,) isa dead-end path such that for all e E , h that directed from a vertex to itself is called a cycle, e is acycle of which no proper subsequence a cycle. If l I: i I: ( k
Clearly,
pk)
1
a finite sequence of paths such that for 1 S i c k , and snk(ei,,i) for then we define the concat~natiQnof pk),denoted by
is a path from
(e,, e,) If p of denoted Delay
a path in an by
to
WSDFG,then we define the pa
i=
Since the delays on all WSDFG edges are restricted to be non-negative, it easily seen that between any two vertices x, y V , either there is no path directed from to y or there exists a (not necessarily unique) minimu between x and y oGiven an HSDFG G , and vertices x, y in we define y ) to be equal to the path delay of a minimum-delay path from to y if there exist one or more paths from to y and equal to 00 if there is no path from to y If G is understood, then we may drop the subscript and simply write “p in place of It is easily seen that minimum delay path lengths satisfy the following inequaZ~~
of ( V , we mean the directed graph formed byany E V’} We denote h the set of edges {e E El the subgraph associated with the vertex-subset V’ by subgraph( V’) if for each pair of distinct vertiWe say that V , is stron~ly there is path directed from y y there path directed from to y i subgraph( V’) is say that subset V’ c: V onnected. A stron~lycoma strongly connected subset V’ c: V su properly contains V’. If V’ is an SCC, then when there is no ambiguity, we may also thatsay s u b g r a ~V’) ~ ( is are distinct in ( V , E ) , we say that G, is C2 if there is an edge directed from some vertex in Clto some vertex C2 is predecessor SCC of sor.SCC; and an SCC is si essor SCC. An edge e is a ge of ( V , if it is not contained in an SCC, or equivalently, if it in a cycle; an edge that is contained in at least one cycle is called
ces
A sequence of vertices
is chain that joins and if for i ( k 1 We say that directed multigraph f for any pair of distinct members A B of there is B . Given directed multigraph G ( V , there is unique partition (unique up to reordering of the members of the partition) V , , V2, V,, such that for i subgra~h(V ; ) is connected; and for each eE E, E V i for some j Thus, each V i can be viewed maximal connected subset of V , and we refer to each V ; of G . acent to
ical of an acyclic directed ~ultigraph(V,E) is an ordering the members of V such that for each e E E , (i
that is, the source vertex of each edge occurs earlier in the orderin than the sink vertex. An acyclic directed multigrapli is said to be one topological sort, and we say that an n -vertex 1) edges. ifit has L
For elaboration any of the graph-theor~ticconcepts presented in this section, we refer the reader to Cormen, Leiserson, and Rivest
AG
one of these
mation from “B” to “A” implies that polynomial time algorithm to solve “A” can be used to solve “B” in polynomial time, and if “B” is NP-complete then the transformation implies that “A” is at least complex any NP-complete problem. Such problem is called We illustrate this concept with simple example. Consider the set-coverwhere we are given collection of subsets C of finite set S , and positive integer The problem is to find out if there is subset C’ c: C such that and each element of S belongs to at least one set in C’ By finding polynomial transfor~ationfrom known NP-complete problem to the set-covering problem we can prove that the set cover problem is NPhard. For this purpose, we choose the problem, where we are given graph C ( V , E) and positive integer 5 IVI and the problem is to determine if there exists subset of vertices V’ V such that V’l and for each edge e E E either e ) E V’ or E V’. The subset V’ is said to be of the set of vertices V . The vertex cover problem is known to be NP-complete, and by transforming it to the set covering problem in polynomial time, we can show that the set covering problem is NP-hard. Given an instance of vertex cover, we can convertit into an instance of setcovering by first letting S be the set of edges E.Then for each vertex E V , we {e e ) or e ) } The construct the subset of edges set E V } f o m s the collection C’. Clearly, this transfo~ationcan be done in time at most linear in the number of edges of the input graph, and the is vertex resulting C’ has size equal to VI Our transformation ensures that cover for if and only if T V E 1 set cover for the set of edges E . Now, we may use solution of set cover to solve the transformed problem, since verexists if and only if corresponding set cover 5 exists tex cover V’l for E Thus, the existence of polynomial time algorithm for set cover implies the existence of polynomial time algorithmfor vertex cover. This provesthat set cover is NP-hard. It can easily be shown that the set cover problem is also NP-complete by showing that it belongs to the class NP. However, since fomal discussion of complexity classes is beyond the scope of this book, we will refer the interested reader to for comprehensive discussion of complexity classes and the definition of the class NP. In summa^, by finding polynomial transformation from problem that is known to be NP-complete to given problem, we can prove that the given problem is NP-hard. This implies that polynomial time algorithm to solve the given problem in all likelihood does not exist, and if such an algorithm does kquired to find it. exist, major breakthrough in complexity theory would be This provides justification for solving such problems using suboptimal polyno-
BACKGROUND TERMINO~OGY AND
NOTATION
mial time heuristics. It should be pointed outthat polynomial transformation of an NP-complete problem to given problem, if it exists, is often quite involved, and is not necessarily straightforward in the case of the set-covering example discussed here. In Chapter 10, we use the concepts outlined in this section to show that particular synchronization optimization problem is NP-hard by reducing the setcovering problem to the synchronization optimization problem. We then discuss efficient heuristics to solve that problem.
There is rich history of work on shortest path algorithms and there are many variants and special cases of these problems (depending, for example, on the topology of the graph, or on the values of the edge weights) for which efficient algorithms have been proposed. In what follows we focus on the most general, andfrom the pointofviewof this book,most useful shortest path algorithms.
( V , E) with real valued edge Consider weighted, directed graph G weights W ( U , for each edge (U, E E . The single-source shortest path problem finds path with minimum weight (defined the sum of the weights of the edges on the path) from given vertex E V to all other vertices U E V U whenever at least one path from to U exists. If no such path exists, then the shortest path weight is set to The two best known algorithms for the single-source shortest path algorithm are Dijkstra’s algorithm and the Bellman-Ford algorithm. Dijkstra’s algo(w(u, 0 The rithm is applicable to graphswithnon-negativeweights running time of this algorithm is O( The Bellman-Ford algorithm solves the single-source shortest path problem for graphs that may have negative edge weights; the Bellman-Ford algorithm detects the existence of negative weight cycles reachable from and, if such cycles are detected, it reports that no solution to the shortest path problem exists. If negative weight cycle is reachable from then clearly we can reduce the weight of any path by traversing this negative cycle one or more times. Thus, no finite solution to the shortest path problem exists in this case. An interesting fact to note is that for graphs containing negativecycles, the problem of determining the weight of the shortest path between two vertices is NP-hard A simple path is defined one that does not visit the same vertex twice, i.e., simple path does not include anycycles. The all-pairs shortest path problem computes theshortest path between all pairs of vertices in graph. Clearly, the single-source problem can be applied
3
eatedly to solve the all-pairs problem. owever, a moreefficient algorithm asedon dynamic programming the Floydall algorithm maybe used to solve the all-pairs shortest path problem time. This algorithm solves the all-pair§ problem in the absence of ne ding longest path pro~lemsmay be solved using theshortest e straightforw~dway to do this to simply negate all edge .e., use the edge weights U, algorithm for the sin~le-source roblem. If all the edge weights the longest simple path becomes NP-hard reachable from the source vertex. the following sections, where we briefly describe the s h o ~ e spath t algoiscussed thus far. ~e describe the algorithms in pseudo-code, and assume we only need the weight of the longest shortest path; these a l g o ~ t h ~ s actual path, but we do not need this information for the purposes of will we not delve into the correc ofs e algoI1 refer the reader to texts such an for detaile~discussion of these graph algorithms.
e pseudo-code for the algorithm is shown times, the total time spent in th e~entationof extracting the ~ i n i m u mele for each iteration of th lernented in time more clever implementation of the minimum extraction ste leads to tationofthe algorithm with
modified
lgorithm solves the sin ts are negative, proble from thedesigcycles when these are present. e nested For loop in Step 4 deter~inesthe complexity of the algorithrn; This algorithrn is based on the
techni~ue,
Next, consider the all-pairs shortest path problem. One simple me tho^ of solving this is to apply the single- urce problem to all vertices in the IEI) time using the ellman-Ford algorithm. "he Floy takes algorithm improves upon this. A pseudo-code speci~cationof this given in Figure 3.10, The triply nested loop in this algorithm clearly implies a c o m ~ l e x i of t~ This algorithm is based upon dynamic programmin~:At the k th iteration of the o u t e r ~ o s t loop, the shortest path from the vertex n u ~ ~ e i r e ~ e t e ~ i n among ~d all pathsthat do not visit any vertex n u m ~ e r ek ~ ain, we leave it to texts such for a formal
E),with non-n nd a source vertex S E V . rtest path from S to V €
3. tract
U E
d( t )
such that d( U )
min (d(t ) ,d( U )
Figure 3.8. Dijkstra's a l g o r i t ~ ~ ,
min(d(v)lvE
Chapter
proof of correctness.
discussed in subsequent chapters, fea obtained solution of system of straints are of the form
S~ng~eSourceShortestPath ighted directed graph C ( V , E),with edgewei~ht for each e E E ,and a source vertexS E V . & V ) , the weight of the shortest path from S to each vertex E V , or elseaBoolean indicatin~thepresence of negative cycles reachable from S
1. l n i t i a i i ~ ~
0, and
for ail other vertices
t- 63
3. V,+V
U)
U)
U,
U,
Set ~e~ative~yclesExist TRUE
Figure 3.9. The Bellman-Ford algorithm.
~ A C ~ G R O U N D T E R ~ ~ N O LAND O G YATIO ION
xi
xj
where x i are unknowns to be determined, and are given; this problem is a special case of linear programming. The data precedence constraints between actors in a dataflow graph often lead to a system of difference constraints, we shall see later. Such a system of inequalities can be solved using shortest path algorithms, by t r a n s f o ~ i n gthe difference constraints into a This graph consists of a number of vertices equal to the number of variables x i ,
~ e i g h t e ddirected graph G
( V , E),with edgeweight
for
e weight of the shortest path from S to each vertex V €
1.Let (V( yt number the vertices Let be an matrix, set A ( i , as the weight of the edge from i to thevertex If nosuchedgeexists, thevertexnumbered Also, i)
~ i y t ( A ( i , A(i, k )
4. For vertices U, V E V with enumeration U d(u,
Figure 3.1 0. The Floyd-~arshallalgorithm.
i and
set
Chapter
and for each di~erenceconstraint xj I the graph contains an edge with edge weight An additional vertex is with zero weight edges directed from to all other vertices in the he solution to the system of di~erenceconstraints is then simply given toall other vertices in the graph eights of the shortest path from That is, setting each to be the weight of the shortest path from to in feasible solution to the set of difference constraints. A feasible so~utionexists if, and only if, there are no negative cycles in the constraint graph. nce constraints can therefore be solved using the ~ e ~ l m a n - F algoor~ reason for adding is to ensure that negative cycles in the graph, if present, are reachable from the source vertex. This in turn ensures that given the source vertex, the ellman-Ford algorithm will determine the existence of feasible solution. For example, consider the following set of ine~ualitiesin three variables: -3
Il e constraint graph obtained from these ine~ualitiesis shown in Figure 3.l l. A sible solution is obtained by computing the shortest paths from to each xi thus -1 -3 and 0 , is feasible solution. Clearly, given such feasible solution if we add the same constant to each we obtain another feasible solution. make use of such solution of difference constraints in Chapter 7.
.l l, ~ o ~ s t r a igraph. nt
The ~
~
i
for an ~
we shall see insubsequent chapters, the aximum achi~vablethroughput for give
~ is defined Fu graph C
umcyclemean
related to the
comprehensiv~over vie^ of m cycle mean, Out of these, appears to have the most ef~cient is the sum of over
round relating to d a t a ~ omode ~ cussed conversion of general and generation of an Acyclic Precey described asymptotic not of NP-complete proble~s. described some useful shortest path algorith~sthat are used extensively in the f~llowingchapters, and define^ the maximum cycle mean. This bac~ground be used extensively in the remainder ofthis book.
This Page Intentionally Left Blank
This chapter discusses parallel scheduling of application graphs. The perce metric of interest for evaluating schedules is the avera T :the average time it takes for all the actors in the graph to once. Equivalently, we could use the throughput (i.e.; the number of iterations of the graph executed per unit time) performance metric. Thus an optimal schedule is onethat minimizes T.
In the execution of dataflow graph, actors fire when sufficient number of tokens are present at their i ts. A dataflow graph therefore lends itself naturally to or where the problem is to assign tasks cessors. Systolic andwavefront arrays ~araIle~ism; where the data set is partitioned among multiple processors executing the same program. Ideally, we would like to exploit data parallelism along withfunctional parallelism within the same parallel programming framework. Such combined framework currently an active research topic; several parallel languages have been proposed recently that allow programmer to specify both data well functional parallelism [BH98][RS~97].Ramaswamy et ([RSB97])propose hierarchical Macro Dataflow Graph representation of programs written inF O R ~ A NAtomic . nodes at the lowest level of the hierarchy represent tasks that are run in data parallel fashion on specified number of processors. The nodes themselves are run concurrently, utilizing functional parallelism. The work of Printz [Prig11on geometric scheduling,and the Multidimensional SDF modelproposed by Lee in [Lee93], are two other promising approaches for combining data and functional parallelism.
strategy is essentially a self-timed approach where the order inwhich processors communicate is determined at compile time, and the target hardware enforces the predetermined transaction order during run-time. Such a strategy leads to a low overhead interprocessor communication mechanism. ~e will discuss this modelin greater detail in the following two chapters. The trade-off between generality of the applications that can be targeted by particular scheduling model, and the run-time overhead and implementation complexity entailed by that model is shown in Figure 4.1. ~e discuss these scheduling strategies in detail in the following sections.
strategy, the exact firing time of each actor assumed to be known at compile time. Such a scheduling style is used in conjunction with systolic array ~chitecturesdiscussed in Section 2.4, for scheduling
Run-time overhead, implementation complexity
Figure 4.1 Trade-off of generality againstrun-time overhead and implement~tio~ ~omplexity.
processors discussed in 2.2.3, and also in high-level synthesis of applications that consist only of operations with guaranteed worst-case execution times [De 941. Under fully-static schedule,all processors run in lock step; the operation each processor performs on each clock cycle ispredetermined at compile time and is enforced at run-time either implicitly (by the program each processor executes, perhaps augmented with “nop”s or idle cycles for correct timing) or explicitly (by means of program sequencer, for example). A fully-static schedule of simple WSDFG G is illustrated in Figure 4.2. ”he fully-static schedule is schematically represented c ~ a r twhich , indicates the processors along the vertical axis, and time along the horizontal axis. The actors are represented rectangles with horizontal length equal to the execution time of the actor. The left edgeof each rectangle in the Gantt chart corresponds to the starting time of the corresponding actor. The Gantt chart can be viewed processor-time plane; scheduling can then be viewed mechanism to tile this plane while minimizing total schedule length, or equivalently minimizing idle time (“empty spaces’’ inthe tiling process). Clearly, the fully-static strategy is viable only if actor execution time estimates are accurate and dataindependent, or if tight worst-case estimates are available for these execution times. As shown inFigure 4.2, two different types of fully-static schedules arise, depenging onhow successive iterations of the HSDFG are treated. Execution times of all actors are assumed to be one time U in this example. The fully-static schedule in Figure 4.2(b) represents s c h e ~ ~ lsuccessive e: iterations of the HSDFG in blocked schedule are treated separately so that each iteration is completed before thenext one begins. A more elaborate blocked schedule on five processors is shown in Figure 4.3. The HSDFG is scheduled if it executes for only one iteration, i.e., inter-iteration dependencies are ignored; this schedule isthen repeated to get an infinite periodic schedule for the HSDFC. ”he length of the blocked schedule determines the average iteration period T. ”he scheduling problem is then to obtain a schedule that minimizes (which is also called the of the schedule). A wer bound on for blocked schedule issimply the length of the of the graph, which is the longest delay-free path in the graph. Ignoring the i~ter-iterationdependencies when scheduling an application graph is equivalent to the classical multiprocessor scheduling problem for an Acyclic Precedence Expansion Graph (APEG). As discussed in Section 3.8, the APEG is obtained from the given application graph by eliminating all edges with delays on them (edges with delays represent dependencies across iterations) and replacing multiple edges that are directed between the same two vertices in the same direction with single edge. This replacement done because such multiple edges represent identical precedence constraints; these edges are taken into
MU~TI~ROCESSOR S C ~ E D U ~ MODELS I~G
account individually during buffer assignment, however. Optimal multiprocessor scheduling of an acyclic graph is known to be NP-hard and a number of heuristics have been proposed for this problem. One of the earliest, and still popular, solutions to this problem first proposed by Hu [Hu61]. ~ist-schedulingis greedy approach: whenever atask is ready to run, it is sched-
(a) HSDFG
acyclic precedence graph
t
bloc ked schedule
Proc l Proc t
T
(c) overlapped schedule Fullystatic
4
as soon as a processor available to run it. Tasksare assigned priorities, and am on^ the tasks that are ready to run at any instant, the task with the highest priority is executed first. Various researchers have proposed different priority mechfor list-scheduling [ACD74], some of whichuse critical-path-based (76723[Koh75][Bla87] la871 summari~esa large number Execution A,B,F
:5
E
(a) HSDFG
Idle
T
=l1
t
(c) ~ully-staticexecution
N iterations of
is d e s c ~ b in e ~detail in Section 4.8.
Chapter 4
can be computedefficiently and optimally in polynomial time [P~91][GS92]. Overlapped scheduling heuristics have not been extensively studied blocked schedules. The main work in this area by Lam [Lam88], and deGroot [dGH92], who propose modified list-scheduling heuristic that explicitly constructs an overlapped schedule. Another workrelated to overlapped scheduling is the “cyclo-static scheduling”approachproposed by Schwartz.Thisapproach attempts to optimally tile the processor-time plane to obtain the best possible schedule. The search involved in this process has worst-case complexity that is exponential in the size of the input graph, althoughit appears that the complexity is manageable in practice, at least for small examples [SISS].
The fully-static approachintroducedin the previous section cannotbe usedwhen actors have variable execution times; the fully-static approach requires precise knowledge of actor execution times to guarantee sender-receiver sync~onization.It is possible to use worst-case execution times andstill employ fully-static strategy, but this requires tight worst-case execution time estimates that may not beavailable to us. An obvious strategy for solving this problem is to introduce explicit synchronization whenever processors communicate. Thisleads s ~ h e ~ ~ l(ST) i n strategy ~ in the scheduling taxonomy of Lee and [LH89]. In this stratkgy we first obtain fully-static schedule using techniques that will be discussed in Chapter5 , making use ofthe execution time estimates.Aftercomputing the fully-static schedule(Figure4.4 (b)), wesimply discard the timing information that is not required, and only retain the processor assignment andthe ordering of actors on each processor specified by the fullystatic schedule (Figure 4.4(c)). Each processor is assigned sequential list of actors, some of whichare and receive actors, which it executes in an infinite loop. When processor executes communication actor, it synchronizes with the processor(s) it communicates with. Exactly when processor executes eachactor depends on when, at run-time, all input data for that actor is available, unlike the fully-static case where no such run-time check is needed. Conceptually, the processor sending data writes data into FIFO buffer, and blocks whenthat buffer is full; the receiver, on the other hand, blocks when the buffer it reads from is empty. Thus flow control is performed at run-time. The buffers may be implemented using shared memory, or using hardware FIFOs between processors. In self-timed strategy, processors run sequential programs and communicate when theyexecute the communication primitives embeddedin their programs, shown schematically in Figure 4.4(c). The multiple DSP machines that wediscussedin the Section2.5 all employ some form of self-timed scheduling. Clearly, general purpose parallel
machines can also be programmed using the self-timed scheduling style, since these machines provide mechanisms for run-time synchronization and flow control.
A self-timed scheduling strategy is robust with respect to changes in execution times of actors, because sender-receiver sync~onizationis performed at run-time. Such a strategy, however, implies higher IPC costs compared to the fully-static strategy because of the need for synchronization (e.g., using semaphore management). In addition the self-timed scheduling strategy faces arbitration costs: the fully-static schedule guarantees mutually exclusive access of shared communication resources, whereas shared resources need to be arbitrated at run-time in the self-timed schedule. Consequently, whereas IPC in the fullystatic schedule simply involves reading and writing from shared memory (no synchronization or arbitration needed), implying a cost of a few processor cycles for IPC, the self-timed scheduling strategy requires of the order of tens of processor cycles, unless special hardware is employed for run-time flow control. Run-time flow control allows variations in execution times of tasks; in
Proc 1
Proc 1
Proc 2
start
start
Proc 2 (a) HSDFC (c) Self-timed implementation (schematic) t
Fully-static schedule
Figure 4.4. Steps in a self-timed scheduling strategy.
Chapter
p l i ~ e sthe compiler softw
e, since the c o ~ p i l e rno longer
m$$], that could potential~yuse f~l~y-static scheduli~~, still choose t such run-time flow control (at the expense of additional hardware) ting software si~plicity. pres~ntsan interestin the trade-off involved between hardware CO whenweconsider d y n a ~ i cflow e ~ e n t e din hardwareversus ow control enforced a compiler
ection 2. l, where an les instructions in the The dataflow
munication. E ~ b e d d e dsignal processing systems will usually not require this type of scheduling owing to the run-time overhead and complexity involved, and the availability of compile time i n f o ~ a t i o nthat makes static scheduling techniques practical.
Actors that exhibit data dependent execution time usually do so because they include one or more data-dependent control structures, for example CO tionals and data-dependent iterations. In such case, if we have some know1 about the tati is tics of the control variables (number of iterations loop will go through or the boolean value of the control input to an if-then-else type construct), it possible to obtain static schedule that optimi~esthe aver me of the overall computation. The key idea here is to define an for each actor in the dataflow graph, An execution profile for construct consists of the number of processors assigned to it, and local schedule of that construct on the assigned processors; the profile essentially defines the shape that dynamic actor takes in the processor-time plane. In case the actor execution data-dependent, an exact profile cannot be pre-determined at compile time. In such case, the profile is chosen by making use of stati~ticalinformation about the actor, e.g., average execution time, probability distri control variables, etc. Such an approach is called [Lee$$b]. Figure 4.5 shows quasi-static strategy applied to conditiona~construct (adapted from [Lee$$b]). I
.
[HL97] has applied the quasi-static approach to data~owconstructs representing data-dependent iteration, recursion, and conditionals, where optimal profiles are computed assuming the knowledge of the probability density functions of data-dependent variables that influence the profile. The data-dependent constructs must be identified in given dataflow graph, either manually or automatically, before Ha’s techniques can be applied. These techniques make the simplifying assumption that the +controltokens for different dynamic actors are independent of one another, and that each control stream consists tokens that take TRUE or FALSE values randomly and are independent and identically distributed (i.i.d.) according to statistics known at compile time.
Ha’s quasi-static approach constructs blocked schedule for oneiteration of the dataflow graph. The dynamic constructs are scheduled in a hierarchical fashion; each dynamic construct is scheduled on certain number of processors, and is then convertedinto single node in the graph and is assigned certain exehen scheduling the remainder of the graph, the dynamic construct treated as an atomic block, and its execution profile is used to d e t e ~ i n e how to schedule the remaini~gactors around it; the profile helps tiling actors in
4
the processor-time plane with the objective of minimizing the overall schedule length. Such a ~ierarchicalscheme effectively handles nested control constructs, e.g., nested conditionals. The locally optimal decisions made for the dynamic ~onstructsare shown to be effective when the variability in a dynamic construct
is small. We will return to quasi-static schedules again in Chapter 8.
To model execution times of actors (and to perform static scheduling), we associate an execution time (non-negative integer) with each actor in the HSDFG; assigns execution time to each actor (the actual execution time can be interpreted as t ( cycles of base clock). Interprocessor communication costs are represented by assigning execution times to the and actors. The values t ( may be set equal to execution time when exact execution times are not available, in which case results of the computations that make use of these values (e.g., the iteration period are compile-time estimates. Recall that actors in an HSDFG are executed essentially infinitely. Each of that actor.An it~ratiQ firing ofan actor is called an in~o~atiQn HSDFG corresponds to one invocation of every actor in the HSDFG. schedule specifies processor assignment, actor ordering andfiring times of actors, and these may be done at compile-time or at run-time, depending on the scheduling strategy being employed. To specify firing times, we let the function k)E represent the time at which the k th invocation of the actor k) represents the time at which starts. Correspondingly, the function the k thexecution of the actor completes, at which point produces data tokens at its output edges. Since we are interested in the k th execution of each 0, 1,2, we set k ) 0 and k ) 0 for actor for k k 0 the "initial conditions". If the k th invocation of an actor takes t( time units to complete for all k then we can claim:
k)
k)
Recall that fully-static schedule specifies processor assignment, actor ordering on each processor, and also the precise firing times of actors. We use the following notation for fully-static schedule:
A fully-static schedule S (for P processors) specifies triple: S
TFS}
7
where 1,2, P} is the processor assignment, and is the iteration period. A fully-static schedule specifies the firing times k ) ofall actors, and since we want finite representation for an infinite schedule, fullystatic schedule is constrained to be periodic:
k) is thusthe
kTFS
starting time of the first execution of actor Clearly, the t~oughputfor such schedule is
(i.e.,
4
The op( function and the values are chosen so that all data precedence constraints and resource constraints are met, ~e define precedence constraints follows: dge
k)
E
k
in an HSDFG for all k
E) represents the (data) (4-1)
The above definition arises because each actor consumes one token from each of its input edges when it fires. Since there are already tokens on each incoming edge e of actor another (k l tokens must be produced on e before the k th execution of can begin. Thus the actor e) must have completed its ( k l )th execution before can begin its k th execution. The “-1 arise because we define k) for k 0 rather than k 0 This done fornotational convenience. Any schedule that satisfies all the precedence constraints specified by G is called an G [Rei68]. A n HSDFC correspon k) admissible schedule, That a valid execution respects all data precedences specified by the HS For the purposes of the techniques presented in this book, we are only recedence relationships between actors in the HSDF graph. In ne or more pairs of vertices can havemultiple edges connecting them in the same “direction”; in other words general HSDFG is 3.1). Such multi~graphoften arises when multirate SDF d intoan HSDFG. ~ u l t i p l eedges between the same pair of vertices in the same direction are redundant far precedence relationships are concerned. Suppose there are multiple edges from vertex to and amongst these edges, the minimum edge delay is equal to dminThen, if we replace all of these edges by single edge with delay equal to dmin it is easy to verify that this single edge ~aintainsthe precedence ts for of the edges that were directed from to Thus general maybe preprocessed into form where the source and sink vertices uniquely identify an edge in the graph, and we by the ordered pair (e), represent an edge e E FG that directed multigraph may be t r a n s f o ~ e dinto directed graph such that the precedence constraints of the original HSDFG are maintained by the transfo~ation.Such transformation illustrated in Figure 4.6. The multiple edges are taken into account individually when buffers are assi~nedto the arcs in the graph. We p e ~ such o ~ transformation to avoid needless clutter in analyzing HSDFGs, an to reduce the running time of algorithms that operate on HSDFCs.
In a self-timed scheduling strategy, we determine a fully-static schedu~e, ng the execution time esti~ates,but we retain only the and the ordering of actors on each processor as speciwe discard the preci formation specified in static schedule. Although we may sta setting 0) subse~uent k ) values are d e t e ~ i n e dat run-time based on th The average iteration period of a self-ti~ed ity of data at the input of each a analy~ethe evolution of a self-timed sched-
As we discussed in Section 4.3, in some cases it is advanta~eousto ~ n ~ o E d graph by a certain unfold factor, say andschedule iterations of the graph together in order to e oit inter-iteration ~ a r a ~ ~ e l more i s m effectively. In this section, we describe the unfo~dingtransformat~on.
it
G ( V , E ) unfolded times represents iterations of the the unfold in^ transformation therefore results inanother copies of each of the vertices of G . rtex V and the copies of VI. From the definitio obvious that: m) m)
end(
mN
mN
E) and
E) for all
0 , and 0 S E
Also, G N maintains exactly the same precedence constraints therefore the edges EN must reflect the same inter and intra-iteratio~
f o r ~ i an n~ that isadirected while main recedenee constraints.
(4-2)
G , and
~ ~ i t i ~into r aon~ h
7
Chapter
constraints the edges E ,For the precedence constraint in G represented by the edge E E ,there will be set of one or more edges in EN that represents the same precedence constraint in G N .The construction of EN is follows. From (4- l), an edge
E
E represents the precedence constraint: for all k
k)
Now, we can let k
and write
as modN,
(4-4)
where mod y) equals the value of taken modulo and equals the quotient obtained when is divided by Then, (4-1) can be written
We now consider two cases: 1. If to yield:
mod
2. If 0 to yield:
then (4-5) may be combined with (4-2) and (4-
mod
then (4-5) may be combined with (4-2) and (4-
Equations (4-6) and (4-7) are summarized N
contains edges from
edgesfrom forthe
to
such that N
E
E,
edges in E N ,which is the edge set of GN In particular, EN
there are a set of
for values of
follows. For each edge
each with delay In addition, EN contains
mod N
to
each with delay
values of l such that 0 0 , then there
azero-delay
1
mod edge fromeach
Note that if for
7
N . Figure 4.7 shows an example of the unfolding transformation. Figure 4.8 lists an algorithm that may be usedfor unfolding. Note that this algorithm has complexity of When constructing schedule for an unfolded graph, the processorassignment and the actor starting times are defined for all vertices of the u ~ f o l graph ~ e ~ (i.e., and are defined for N invocations of eachactor); T,, is the iteration period for the unfolded graph, andthe average iteration period for the original graph is then In the remainder of this we assume we are dealing with the unfolded graph and we refer only to the iteration period and throughputof the unfoldedgraph,ifunfolding is in fact employed,with the understanding that these quantities can be scaledby the unfolding factor to obtain the corresponding quantities for theoriginal graph.
G:
G3:
Figure 4.7. Example of an unfolding transformation: HSDFG G is unfolded by a factor of 3 to obtain the unfolded HSDFG G3
4
assume that we have reasonably good estimates of actor execution times available to us at compile time to enable us to exploit static s c h e d u ~ i n ~ t e c h n i ~ ~ ehowever, s; these estimates need not be exact, and execution times of actors may even be data-dependent, Thus we allow actors that have d i ~ e r e nexet cution timesfromone iteratio FG to the next, long these variations are small rare. This the casewhenestimates are available for the execution times, andcutiontimes are close to the c o ~ e s p o n ~ ~ ing estimates with high ~robability,but deviations from the estimates of (eEectively) arbitrary magnitude occasio~allyoccur due to ~ h ~ n o m e nsuch a cache misses, i n t e ~ p t suser , inputs, or error handling. ~ o n s e ~ u e n t ltight y ; worst-case e bounds cannot generally bed e t e ~ i n e dfor such operations; how-
es
lo
S
mod N
I N
N
vp
N c I-
mod N S
N
I
N to EN
ever, reasonably good execution time estimates can in fact be obtained for these operations, so that static assignment and ordering techniques are viable. For such applications self-timed scheduling ideal, because the performance penalty due to lack of dynamic load balancing is overcome by the much smaller run-time scheduling overhead involved whenstatic assignment and ordering employed. The estimates for execution times of actors can be obtained by several different mechanisms. The most straightforward method is for the programmer to provide these estimates while developing the library of primitive blocks (actors). In this method, the programmer specifies the execution time estimates for each actor as a mathematical function of the p~ametersassociated with that actor (e.g., number of filter taps for an FIR filter, or the block size of a block operation such as an FFT). This strategy is used in the Ptolemy system EPto98) for example, and is especially effective for libraries in which the primitives are written in the assembly language of the target processor. The programmer can provide a good estimate for blocks written in such a low-level library by counting the number of processor cycles each inst~ctionconsumes, or by profilingtheblockonan inst~ction-setsimulator. It is more difficult to estimate execution times for blocks that contain control constructs such as data-dependent iterations and conditionals within their body, and when the target processor employs pipelining and caching. Also, it is difficult, if not impossible, for the programmer to provide reasonably accurate estimates of execution times for blocks written in a high-level language (as in the C code generation library in Ptolemy). The solution adopted in the G tern [LEP90] is to automatically estimate these execution times by compiling the block (ifnecessary) and ~ n n i n git by itself in a loopon an instruction-set simulator for the target processor.To take into account data-dependent execution behavior, different input data sets can be provided for the block during simulation. Either the worst-case or the average-case execution time is used as the final estimate. The estimation procedure employed by CRAPE obviously time-consuming; in fact, estimation turns out to be the most time-consuming step in the PE design flow. Analytical techniques can be used instead to reduce this estimation time; for example, Li and Malik ELM951 have proposed algorithms for estimating the execution time of embedded software, Their estimation technique, which forms a part of a tool called cin~erella,consists of two components: 1) determining the sequence of inst~ctionsin the program that results in maximum execution time (program path analysis) and 2) modeling the target processor to determine how much time the worst case sequence determined in step 1 takes to execute (micro-~chitecturemodeling). The target processor model also takes the effect of instruction pipelines and cache activity into account. The input to the tool is a generic C program with annotations that specify the loop bounds (Le.,
4
the maximum number ofiterations for which loop runs). Although the problem is formulated an integer linear program (ILP), the claim is that practical inputs to the tool can be efficiently analyzed using standard ILP solver. The advantage of this approach, therefore, is the efficient mannerinwhichestimates are obtained compared to simulation. It should be notedthat the program path analysis component of the Li and Malik technique is, in general, an undecidable problem; therefore for these techniques to function, the programmer must ensure that his or her program does not contain pointer references, dynamic data structures, recursion, etc. and must provide bounds on all loops. Li and Malik’s techniquealso depends on the accuracy of the processormodel,although one canexpectgoodmodels to eventually evolve for DSPchips and microcontrollersthat are popular in the market. The problem of estimating execution times of blocksis central for us to be able to effectively employ compile time design techniques. This problem is an important area of research in itself, and the strategies employed in Ptolemy and CRAPE, andthoseproposed byLiandMalik are useful techniques, andwe expect better estimation techniques to be developed inthe future.
In this chapter, wediscussedvariousschedulingmodels for dataflow graphs on multiprocessor architectures that differ in whether scheduling decisions are made at compile time or at run-time. The scheduling decisions are actor assignment, actor ordering, and d e t e ~ i n a t i o nof exact firing times of each actor. A fully-static strategy lies at one extreme, in which all of the scheduling decisions are made at compile time, whereas dynamic strategy makes all scheduling decisions at run-time. The trade-off involved is the low complexity of static techniques against the greater generality and tolerance to data dependent behavior in dynamic strategies. Fordataflow-oriented signal processing applications, the availability of compile time information makes static techniques very attractive. A self-timed strategy is commonly employed for such applications, where actor assignment and ordering is fixed at compile time (or system design time) butthe exact firing time of each actor is determined at run-time, in data driven fashion. Such strategy easily implemented in practical systems through sender-receiver sync~onizationduring interprocessor co~munication.
In this chapter, we focus on technjquesthat are used in self-timed scheduling algorithms to handle IPC costs. Since tremendous variety of scheduling algorithms have been developedto date, it not possible here to provide comprehensive coverage of the field. Instead, we highlight some of the most fundamental developments to date in I ~ ~ - c o n s c i o umultiprocessor s schedulin~strategies for HSDFGs.
date, mostof the research on scheduling DFGs has focused on the problem of minimizing the schedule ch the timerequired to execute all actors in the HSDFG once. When a schedule executed repeatedly for example by being encapsulated within an infinite loop, would typically be the case for a DSP application the resulting throughput equal to the reciprocal (1 /p) of the schedule makespan if all processors synchronize (perform a “barrier sync~onization” described later in Section 9.1) at the end of each schedule iteration. The throughput can often be improved beyond 1/p) by abandoning global barrier synchronization? and implementinga self-timed execution of the schedule, described in Section4.4 further exploit parallelism between graph iterations, one may employ the technique of unfold in^, which discussed in Section model the transit time of inte~rocessorcommunication data in multiprocessorsystemmple, the time to write andread data values to andfrom F edges are typically weighted by an estimate of the delay to transmit and receive the associated data if the source and sink actors of the edge are assigned to different processors. Suchestimates are similar
to the execution time estimates that we use to model the r~n-timeof individual dataflow actors, discussed in Section 4.9. In this chapter, we are concerned p r i m ~ i l ywith efficient scheduling of in which an cost is associate lern of constructing ~ i n i m makespan u~ given target ~ultiprocessorarchitecture nition suggests, solutions to the schedu~in~ problem are heavily dependent on the underlying target architecture. In an a t t e ~ pto t decompose this problem and separate target-speci~caspects from aspects of the problem that are fundam~ntalto the s t ~ c t u r eof the input some rese~chershave applied t~o-phased approach, pioneered by Sark first phase involves schedu~ingthe input which consists of an i n ~ n i t enumber of interconnection network. can perform interproces-
early, the complexity of this second phase of the scheduling process is uniprocessor target or
chain-st~cturedprocessor interconnection to~ology7
focused on the ~erivationof e~ectiveheuristic r Since the inter-iteration dependencies represented relevant in thecontext of mini~um-makespan sc~eduling,
graph is often r e f e ~ e dto that for each e E E we app~icationis speci~ed section 3.8) obtain~dby
classic algorithm for computin n ~ e n of t actors to processors based on etw work flow principles was developed Stone [Sto77]. This algorithm is designed for heterogeneous ~u~tiprocessor syste~s, and its goal is to
map actors to processors so that the sum of computation time and time spent on IPC is minimized. More specifically, suppose that we are given target multiprocessor architecture consisting of (possibly heterogeneous) processors P,, a set of actors A I , A 2 , A,,, set of actor execution times { t i ( A j ) } ,where for each i 1,2, and each j 1,2, m } , ti(Aj) gives the execution time of actor A , on processor and set of inter-actor communi~ation costs { C , } where if actors A,. and A , do not exchange data, and otherwise, C, gives the cost of exchanging data between and A , if A , and A , are assigned to different processors. The goal of Stone's assignment algorithm is to compute an assignment
such that the net computation and communication cost
is minimized. Note that minimizing (5-2) is not equivalent to minimi~ingthe makespan. For example, if the set of target processors is homogeneous, then an optimal solution with respect to (5-2) results from simply assigning all actors to single processor. The core of the a l g o ~ t an ~ elegant approach for t r a n s f o ~ i n g given instance of the assignment problem into an instance ( V ( ] ) ,E ( I ) ) of the minimum-weight cutset problem. For example, for two-processor system, two vertices p 1 and are created in Z(1) co~espondingto the two heterogeneous target processors, and vertex is created for each actor A i . For each A , an undirected edge in is instantiated between and p 1 and the weight of this edge set to the execution time of on This edge models the execution time cost of actor A, that results if and p1 lie on opposite sides of the cutset that is computed for Similarly an edge is instantiated with weight Finally, for each pair of vertices and such that 0 , an edge in instantiated with weight From minimu~-weightcutset in Z(1) that separates p 1 and an optimal solution to the heterogeneous processor assignment problem can easily be minimum-weight cutset that separates derived. Specifically9if R c:E ( I ) and an optimal assignment can be derived from by:
Chapter 5
The net computation and communicationcost of this assignment cost(F)
simply (5-4)
R(I)
An illustration of Stone’s Algorithm shown in Figures 5.2 and 5.1. Figure 5.2(a) shows the actor interaction structure of the input application; Figure 5.2(b) specifies the actor execution times; and Figure 5.2(c) gives all non-zero communication costs C, 0 . The associated instance of the minimumweight cutset problem that is derived by Stone’s Algorithm is depicted in Figure 5.1 and minimum-weight cutset
cutset
shown in Figure 5.l(b). The optimal assignment that results from this given F(A,)
beused
(n
F(A,)
and F ( A , )
F(A,)
(5-6)
For the two-processor case ( n 2 a variety of efficient algorithms can to derive a minimumweight cutset Stone’s constructio~ hen the target architecture containsmorethantwoprocessors the weight of each edge is set to a weighted sum of the values
ure 5.1. (a) The instance of the minimum-weight cutset problem that is derive from the example of Figure5.2. (b) An illustrationof a solution to this instance of the minimum- eight cutset problem.
IPC-CONSCIOUSS C H ~ ~ U L ~ I LN~~O R I T H M S
c,
cji
Figure 5.2. An example that is used to illustrate Stone's Algorithm for computing heterogeneous processor assignments.
Chapter
tj(~i) and an optimal assignment is derived by computin~ minimum -way cutset in When 4 Stone9sapproach becomes cotnputationally intractable.
Although Stone9salgorithm has high intuitive appeal and has had considerable in~uenceon the SDI;scheduling community?the most effective algorithms ~ n o w ntoday for self-timed scheduling of SDI; graphs have jointly considered both the assi~nmentand ordering sub-~roblems.The approaches used in these joint algorithms fall into two broad categories a~proachesthat are driven by iterative, list-based mapping of individual tasks, and those that are based on cons t ~ c t i n gclusters of tasks that are to be executed on the same processor. These ories of schedulin~techniques are discussed in the following three sections.
L of actors in constructed; global time clock cc is maintained; and each task I" is eventually mapped into time interval on some processor (the time intervals for two distinct actors assigned to the sameprocessor cannot overlap). The priority list L linordering vlvl) of the actors in the input task graph E) v19 such that for any pair of distinct actors e given higher scheduling ~rioritythan vi if and only if i ma~pedto an available processor soon it becomes the highe according to L among all actors that are An actor is ot yet been mapped, but its predecessors have all been mapped t where is the current value of cc. For self-timed implementation, actors on each processor are ordered according to the order of their associated time intervals. An impo~antgeneralization of list scheduling, which we call been formalized by Printz eady-list scheduling maincheduling convention that schedule is constructed by repeatedly ch~dulingready actors, but eliminates the notion of static priortime clock. the only list that fundamental to the the list of actors that are readyat ven schedulin~step. t of effective ready-list algorithms for scheduling roble^
To be effective when I C costs are notne ible, list-scheduling or y-list algorithm must incorporate the latencies associated with I
US S ~ H E ~ U L I ~ ~
tions. This involves either explicitly scheduling IPC operations onto the communication resources of the target architecture the scheduling process progresses, or incorporating estimates of the time that it takes for data that is produced by an actor on one processor to be available for consumption by an actor that has been assigned to another processor. In either case, an additional constraint is imposed on the earliest possible starting times of actors that depend on the arrival of data.
ties have been shown to Its and properties have -case of the s~heduling problem, wewhich call In ideal s c h e d ~ ~ ~the ng, target multiprocessor ar ocessors that homogetime of an actor is independent of the processor it is assigned to), and IPG performed in zero time. ~lthough,issues of heterogeneous processing times and IPC cost are avoided, the ideal scheduling problem intracta~le hen list scheduling is applied to an instance of the ideal sche problem and a given priority list for the problem instance, the resulting schedule is not necessarily uni~ue,and generally depends on the details of the particular list schedulin~algorithm that used. ~pecifically,the schedule depends on the processor selection scheme that is used when more than one processor is availat given scheduling step. For exam consider the simple task graph in t( t( and the targetmultiure and suppose that processor architecture consists of two processors and If list scheduling applied to this example with ~rioritylist C) then any one of the four schedules illus~atedin Figure S.3(b) may result, depending on the processor selection scheme.
(C, t, denote the instance of the ideal scheduling problem E) actor execution times (on each processor that consists of task graph C E and target, zero-IPC architect in the target ~chitecture) that consists of identical processors. n given a list-scheduling algorithm fo fine S,( G, t , L ) to be and priority list (v1, dule produced by when it is G, t, L) to be the makesp t, n,
{S,(
t, n,
is list scheduling al~orithm}
of schedules thatcanbe produced when list t, with priority list L . For the example of Fi ure 5.3, we have
Chapter
Figure 5.3. An example thatillustrates the dependenceof list scheduling on processor selection (for a given priority list).
It is easily shown that schedules produced by list-scheduling algorithms on given instance ISP( G, t, all have the same makespan. That is, t, n7 L)I(A
E
$7
1
(5-9)
This property of uniform makespan does not generallyhold, however, if we allow heterogeneous processors in the target architecture or if we incorporate non-zero IPC costs. Clearly, effective construction of the priority list L is critical to achieving high-quality results with list scheduling. Graham has shown that when arbitrary priority lists are allowed, it is possible for the list scheduling approach to produce unusual results. In particular, it is possible that the number of processors, the execution time of one or more actors, or the precedence constraints in an SDF graph (removing one or more edges) can all cause list scheduling algorithm to produce results that are (longer total execution time) than those obtained whenthe algorithm is applied with the original number of processors, the original set of SDF edges, or original execution times respectively [Gra69]. Graham, however, has established a tight bound on the anomalous performance degradation that can be encountered with list scheduling [Gra69]. This result is summarized by the following theorem. ( V , E) is task graph; L is Suppose that a priority list for G and are positive integers such t V 0, 1, and t’ V 0, 1 2, are assignments of non-negative integers to members of V (sets of actor execution times) such that for each t’( t( E’ E t, and S’ E G’, t’, where G’ ( V , E’) Then 10)
and this is the tightest possible bound. Graham has also established tight bound on the variation in list scheduling performance that can be encountered when different priority lists are used for the same instance of the ideal scheduling problem.
( V , E ) is task graph; L and L’ are a priority Suppose that lists for G positive integer; t V 0, 1 are a s s i ~ n ~ e nof t s execution times to members of V S E L ) and S’ E C( G, t, L’) Then
Chapter
and this isthe tightest possible bound.
an actor in an acyclic SDF graph C is defined to be the length of the longest directed path in C that originates at A ere, the length of path taken to be the sum of the execution times of the actors on the path. Intuitively, actors with high level values need to bescheduled early since long sequences of computation depend on their com~letion.One of the earliest and most widely-used list-scheduli is the HLF’ET (highest level first with estimated times) algorithm In this al~orithm, is created by sorting the actors in decreasing order of their levguaranteed to produce an optimal result if there are only twoprocessors, both processors are identical, and all tasks have identical execution times For the general ideal schedulin~ ~roblem (any finite number of cessorsis allowed, ecution times need not be identical, are uniformly zero), has been proven to frequently produce ne~-optimalschedules [ACD7 Early strategies for i n c o ~ o r a t i nIPC ~ costs intolist scheduling include the algorithm of Yu [Yu84]. algorithm, modification of H L F E ~scheduling, repeatedly selects the ready actor that has the highest level, and schedules it on the processor that can finish its execution at the earliest time. The earliest finishing time of ready actor on processor depends both on the time intervals that have already been scheduled on and on the IPC time required for the data required from the predecessors of A to arrive at P. In contrast to these early algorithms, the ETF (earliest task first) algorithm wang, Chow and Angers uses the level metric only tie-breaking criterion. At each scheduling step in ETF, the value P) the earliest time at which actor can commence execution on processor is comp~tedfor every ready actor A andevery target processor Ifan actor-processor pair uniquely minimizes $,(A, then is scheduled to execute on starting at time otherwise, the tie resolved by selecting the actorprocessor pair that has the highest level.
osed list-scheduling algorith ich attempts to account for within an arbitrary, ~ u l ~ - int~rconnection h o ~ network. Since ~aintainingand a ~ ~ l y i precise n~ accounting of traffic within such network can be com~uta-
tionally expensive, the algorithm hasbeen devised to maintain an approxiwhich reasonable estimates of mate view of network state E($) from communication delaycanbederived for scheduling purposes. Atany given matrices H , L and scheduling time step t E( t ) incorporates three where the set of processors in the target multiprocessor arc~itecture.Given pl, E P , gives the number of hops between and in the interL@,, givesthe prefe~edoutgoing communication ~ o ~ n e c t i onetwork; n channel of p1 that should beusedwhen communicating data to and D( gives the communication delay between p1 and that arises due to contention with other IPC operations in the system. H algorithm, actors are first prio~tizedby static, modi~edlevel metric that incorporates the communication costs that are assigned to the task graph edges. For actor x is the longest path length in the task graph that originates at x , where the length of path taken the sum of the actor execution times and edge communication costs along the path. A list-scheduling loop then carried out in which at any givenscheduling time step t actor is selected from a ~ o n gthe actors that are ready at t Prothat maximi~es cessor selection is thenachieved by ing to the processor thatallows the earliest estimate^ completion time. estimated completion time derived from the network state approximati wini and Lewis observe that there is signi~canttrade-off‘ between cywithwhichthenetwork state approximation E($) isupdated (which affects the accuracy of the app on), and the time complexity the resulting scheduling algorithm. The orithm addresses this trade-off by updating onlywhen schedule gins sending IPC datato successor actor that is scheduled on another processor, or when the IPC data associated with task graph edge ( x , y ) arrives at the processor that y is assigned to. Loosely speaking, thepr ~echanismin the H algorithm the converse of that employed in EW. H,the “earliest ac processor mapping’’ is used tie-brea~ng crite~on, while the modified level is used the primary priority function. Note also that when thetarget processor set is homogeneous, selecting an actor-processor pair that minimizes the starting time is equivalent to selecting pair that minimizes completion time, while this equivalence does not necessarily hold for heterogeneous architecture. Thus, the concept of “earliest actor-processor mapping” that is employed by ETF is different from that in MH only in the heterogeneous processor case.
In the DL§ (dynamic level scheduling) algorit~mof §ih and Lee, the use of levels in traditional HLFET scheduling is replaced by measure of scheduling priority that is to be continually re-evaluated the schedule is constructed
[SL93a]. Sih and Lee demonstrated that such concept preferable because the "scheduling affinity" between actor-processor pairs depends not only on longest paths in the task graph, but g on the current s c h e d ~ l i ~state, which includes the actor/time"interval pairs that have already been scheduled on processing resources, and the IPC operations that have already been scheduled on the communication resources. with E W , the DLS algorithm also abandons the useofthe global scheduling clock c c , and allows all target processors to be considered candidates in every scheduling step (instead of just those processors that are idle at the current value of c c ) . With the elimination of c c , Sih's metric for prioritizing can be formulated actors the d y ~ a m i c
where represents the scheduling state at the current scheduling step; denotes the conventional (static) level of actor A D ( A , P, denotes the earliest time at which all data required by actor can arrive at processor P and F(P, gives the completion time of the last actor that is presently assigned to While the incorporation of scheduling state by the DLS algorithm represents an important advancement in IPC-conscious scheduling, the fornulation of the dynamic level in (S-12) contains subtle limitation, which was observed by Kwok and Ahmad [ U 9 6 ] . This limitation arises because the relative contributions of the two componentsin 12) (the static level and the data arrival time) to the dynamic level metric vary the scheduling process progresses. Early in the scheduling process, the static levels are usually high, since the actors considered generally have relatively many topological descendants,and similarly, data arrival times are low, since the scheduling of actors on each processor begins at the origin of the time axis and progresses towards increasing values of time. more and more scheduling steps are carried out, the static level parameters of ready actors will decrease steadily (implying lower influence on the dynamic level), and the data arrival times will increase (implying higher influence). Thus, the relative weighting of the static level and the data arrival time are not constant, but rather can vary strongly between different scheduling steps. This variation not taken to account in the DLS algorithm.
Motivated partly by their observation on the limitations of dynamic level scheduling,KwokandAhmadhavedevelopedan alternative variation of list scheduling, called the DCP (dynamic critical path) algorithm, that also dynamically re-evaluates actor priorities [ U 9 6 ] . The DCP algorithm motivated by the observation that the set of critical paths in task graph can change from one
I~C-CO~SCIOUS se ~ ~ U L ALGO~ITHMS I ~ G
scheduling step to the next, where the critical path defined to be directed path along which the sum of computation and communication times m a x i ~ i ~ ~ d . For example, consider the task graph depicted in Fig. 5.4. Here, the number beside each actor gives the execution time of the actor, and each numeric edge weight gives the IPC cost associated with the edge. Initially, in this graph, B C and the length of this path is 16 time units. the critical path is If the first two scheduling steps map both actors and B to the same processor (e. g., to minimize the starting time of B then the weight of the edge B ) in Fig. 5.4 effectively changes to zero. The critical path in the new “partially scheduled” graph thus becomes the path which has length of 14 time units. Because critical paths can change in this manner the scheduling process progresses, the critical path of the partially scheduled graph called the The DCP algorithm operates by repeatedly selecting and scheduling actors on the dynamic critical path, and updating thepartially scheduled graph scheduling decisions are made. An elaborate processor selection scheme is also incorporated to map the actor selected at each scheduling step. This scheme not only considers the arrival of required data on each candidate processor, but takes into accountthe possible starting timesof the taskgraphsuccessorsof the selected actor
ltiprocessor scheduling operate by incrementally constructing groupings, called of actors that are to be executed
Figure 5.4. An illustrationof “dynamic” critical paths inmultiproc~ssorschedu~in~.
5
on the same processor. Clustering and list scheduling can be used in complem ~ n t afashion. ~ ~ypically,clustering is applied to focus the efforts of listscheduler on effective processor assignments. When used ef~ciently,clustering can signi~cantlyenhance the results produced by list scheduler (and variety of other scheduling techniques). A scheduling algorithm (such list scheduler) processes clustered H ~ by constraining ~ F the ~ vertices of V that are encompassed by each cluster to be assigned to the same processor. More than one cluster may be mapped by the scheduling algorithm to execute on the same processor; thus, a sequence of clustering operations does not necessarily specify a complete processor assignment, even when the target processors are all homogeneous. The net result of a clustering algorithm is to identify family of disjoint subsets M kc: V such that the underlying scheduling algorithm is forced to avoid IPC costs between any pair of actors that are members of the i . In the remainder of this section, we examine variety of algorithms for computing such a family of subsets.
In the of Kim and Browne, longest paths in the input task graph are iteratively identi~edand clustered until every edge in the graph is either encompassed by a cluster or is incident to cluster at both its source and sink. The path length metric based on function of the computation (e,, e,,) is a task graph and communicatio~along given path. If p path, then the value of Kim and Browne’s path length metric for p given by (5Tp
where
is the set of actors traversed by p
the IPC cost associated with an edge
e;
the total IPC cost between an actor
T, and the “normalization factors”
E
T, and actors that are not contained in
and
are parameters of the algorithm.
Kim and Browne do not give systematic technique for determining the normalization factors that should be used with the Linear Clustering A~gorithm. Indeed, the derivation of the most appropriate normalization factors based on characteristics of the input task graph and the target multiprocessor architecture 0.5 and appears to be an interesting direction for &furtherstudy. When 1 linear clustering reduces to clustering of critical paths.
Sarkar’s ~ ~ t e r ~ u l i zalgorithm u t i ~ ~ [Sar89] for graph-clustering is based on determining set of clustering operations that do not degrade the per task graph on machine with boundless processing resources (i.e., In internalization, the task graph edges &efirst sorted in decreasing order oftheir associated IPC costs. The edges in this list are then traversed according to this ordering, When each edge e is visited in this traversal, an estimate T, of the parallel execution time is computed with the source and sink vertices of e constrained to executeon the sameprocessor. This estimate derived for an unboundednumberof processors, and fully-connectedcommunicationnetwork. If does not exceed the parallel executjon time estimate of the current clustered graph,then the c u ~ e n tclustered graph is modified by merging the source and sink of e into the same cluster. An important strength of internalization its simplicity, which makes it easily adaptable to accommodate additional scheduling objectives beyond minimizing execution time. For example, hierarchical scheduling framework for multirate DSP systems has been developed usingSarkar’s clustering technique substrate [PBL95]. This hierarchical framework provides systematic method for combiningmultiprocessorschedulingalgorithms that minimizeexecution time with uniprocessor scheduling techniques that optimize target program’s code and data memory re~uirements.
algorithm of Yang and Gerasoulis inco~oratesprinciples similar to those used inthe DCP algorithm, but applies these princi~lesunder the methodologyof clustering than list scheduling [YG94]. withDCP, “partially scheduledgraph” is repeatedlyexamined and updated scheduling steps are carried out. The IPC costs of intra-cluster edges in the PSG are all zero; other IPC costs are the same the co~espondingcosts in the task graph. Additionally, the DSc algorithm inserts new intra-cluster edges into the PSG so that linear (total) ordering of actors is always maintained within eachcluster. Initially, each actor in the task graph assigned to its own cluster. Each clustering step selects a task graph actor that has not been selected in any previ-
Chapter 5
ous clustering step, and determines whether or not to merge the selected actor with one of its predecessors in the PSG. The selection process is based on priority function ~ ( A )which 9 is defined to be the length of the longest path {computation and communication time) in the PSG that traverses actor This priority function fully captures the concept of dynamic critical paths: an actor maximizes if, and only if, it lies on critical path of the PSG.
At given clustering step, an actor is selected if it “free” which means that all of its PSG predecessors have been selected in previous clustering steps and it maximizes over all free actors. Thus, if free actor exists that on a dynamic critical path, then an actor on the dynamic critical path will be selected. However, it possible that none of the free actors are on the PSC critical path. In such cases, the selected actor is not on the critical path {in contrast, the DCP algorithm alwaysselects actors that are on the dynamic critical path). Once an actor A is “selected,” its predecessors are sorted in decreasing order of the sum of t(x) % ( x ) where t(x) is the execution time of predecessor is the IPC cost of edge A) and h ( x ) is the len longest direct path in the PSG that terminates at A set of one or more predecessors is then chosen from the head of this sorted list such that “zeroing” {setting the IPC cost to zero) the associated output edges ...) minimizes the value ofh(A) and hence f ( A ) ,in the A), new PSG that results from clustering the subset PSG of vertices X2t x,, A I The DSc algorithm was designed with low computationa~complexity the primary objective. The algorithm achieves time complexity of O( where is the number of task graph actors, and E the number of edges. Incontrast, linear clustering E ) ) [GY92]; linearization an O ( E ( N E ) ) algorithm; ETF is O ( P N * ) where is O ( P 3 ~ *DLS ) is
is the number of target
where g
the complexity
of the data routing algorithmthat is used to compute D(A, DCP and the D e c l ~ ~ t e r i ~ g A Zdiscussed g ~ r i t ~ ~in, Section5.4.4 below, has Complexity with internalization, DSC designed for fully connected network containing an unbounded number ofprocessors, and for practical, processor-constrained systems it can be used as preprocessing or intermediate compilation phase. discussed in Section 5.1, optimal scheduling in the presence of IPC costs intractable even for fully connected, infinite processor systems, and thus, given the polynomial complexity of DSc and internal~zation,we cannot expect guaranteed optimality from these algorithms. However, DSC is shown to be opti-
IPC-CONSCIOUS S C ~ ~ ~ U L I N G
mal for number of non- trivia^ sub-classes of task graphs [YC94].
Sih and Lee have developed clustering approach called that is based on examining pairs of paths in the task graphto systematically determine which instances of paral~el~sm should be preserved during theclustering process [SL93b]. Rather than exhaustively examining allpairs of paths (in general, task that is hopelessly time-consuming) the Declustering technique focuses on paths rs, which are actors that have mu~tip~e successors. Branch actors are examined in increasing order of their static levels. Examination of a branch actor B begins by sorting its successors in decreasing order of their static levels. The two successors C, and at the head of this list (highest static levels) are then categorized as being either an ~ b r u (“non~ c ~ intersecting branch”) or an (“intersecting branch”) pair To perform this categorization, it is necessary to compute the tr of C, and C, The transitive closure ofan actor X , denoted TC(X) in a task graph is simply the set of actors Y such that there is delayless path in C directed from to Y Given the ~ansitiveclosures TC( C,) and C,) the successor pair (Cl, C,) is an TC( C,)
TC( C,)
63,
16)
and otherwise (if the transitive closures have non-empty intersection), (C,, C,) an instance. Intuitively, the transitive closure is relevant to the derivation of parallel schedules since two actors can execute in parallel (execute over overlapping segments of time) if, and only if, neither actor is in the transitive closure of the other. Once the branch-actor successor pair (C,, C,) is categorized as being an Ibranch or Nbranch instance, from it to determine an efTective means for capturing the parallelism associated with (C,, C,) within clustering framework. If (C,, C,) is an Nbranch instance, then the TPPI associated with (C,, is the subgraph formed by combining a longest path (cumu~ativeexecution time) from C, to any task graph sink actor (an actor that has no outputedges), a longest path from to any task graph sink actor, the associated branch actor B , and the connectingedges (B, C,) and For example, consider the task graph shown in Figure for simplicity, assume that the execution t of each actor is unity; observe that the set of T,U ) andconsider the TPPI computation branch actors in this graph associated with branch actor T.The successors of this branch actor, U and V , satisfy TC( U ) W, X, and TC( V ) Y} Thus,wehave
TC( U ) TC( V ) which indicates that for branch actor the successor pair ( U , V ) is an branch instance. The TPPI associated with this branch instance shown in Figure 5.5(b). If (Ct, C,) an Ibranch instance then the TPPI associated with C,) derived by first selecting an actor called fromthe intersecstatic level. The TPPI the tion TC( U ) TC( V ) that has maxi combining longest pathfrom C, t longest pathfrom to connecting edges (B,C,) and (B,C,) Among the branch actors in Figure 5S(a), only actor has an Ibranch instance associated with it. The corresponding TP I, derived from merge actor U , is shown in Figure 5.5(c). After TPPI is identi~ed,an optimal schedule of the TPPI onto two-processor arc~itectureis derived. Because of the restricted structure of TPPI topologies, such an optimal schedule can be computed efficiently. Furthermore, depending on whether the TPPI corresponds to an Ibranch or an ~ b r a n c ~ instance7and on whether the optimal two-processor schedule utilizes both target processors, the optimal schedule can be represented by removing zero, one, or arcs, from the TPPI: after removing the cut arcs from the TPPI, the (one or two) co~nectedcomponents in the resulting subgraph give the processor assignment associated with the optimal two-processor schedule. The declustering algorithm repeatedly applies the branch actor analysis discussed above for all branch actors in the task graph, and keeps track of all cut arcs that are found during this traversal of branch actors. After the traversal is complete, all cut arcs are temporarily removed from the task graph, and the connected components of the resulting graph are clustered. These clusters are then combined in pairwise fashion two clusters at time to produce hierarchy of two-actor clusters. Careful graph analysis used to guide this hierarchy formation to preserve the most useful instances of parallelism for large depth possible within the cluster hierarchy. Then7during the and phases of the Declustering Algorithm, the cluster hierarchy systematically broken down and scheduled to match the ch~acteristicsof the target multiprocessor ~chitecture,For full details on and the reader is encouraged to consult [Sih9 SL93bJ.
Due in part to the high complexity of the assignment and ordering problems in the presence of IPC costs, independent comparisons on subsets of algo-
Figure 5.5. An illustration of TPPIs in the Declustering Algorithm.
Chapter
rithms developed for the scheduling problem consistently reveal that no single algorithm dominates clear “best-choice” that handles most applications better than all of the other algo~thms(for example, see [LAAG94, Thus, an important challenge facing tool designers for application-speci~cmultiprocessor implementation is the development of efficient methods for integrating the variety of algorithm innovations in IPC-conscious scheduling so that their advantages can be combined in systematic manner. One example of an initial effort in this direction is the DS (dynamic selection) strategy [LAAG94]. DS is list scheduling algorithm that compares the number of available processors to n,) the number of executable (ready) actors n, at each scheduling step. If then one step of the DLS algorithm is invoked to complete the current scheduling step; otherwise, minor variation of HLFET is applied to complete the step. This algorithm was motivated experiments that revealed certain “regions of operation” in which scheduling algorithms exhibit particularly strong or weak performance compared to others. The performance of DS is shown to be significantly better than that of DLS or alone.
Pipelinedschedulingalgorithmsattempt to efficiently partition task graph into stages, assign groups of processors to stages, and construct schedules for each pipeline stage. Under such scheduling model, the pipeline determines the throughput of the multiprocessor implementation. In general, pipelining can significantly improve the throughput beyond what is achievable by the classical (minimum-makespan) scheduling problem; however, this improvement in throughput may corneat the expense of s i ~ n i f i c aincrease ~t in latency (e.g., overthe latency that achievable by employing minimum makespan schedule). Research on pipelined schedulin~is at significa~tlyless mature state than on the classical ~ r o ~defined l e ~ in Section 5.1. Due to its high relevance to DSP and multi~ediaapplications, we expect that in the coming years, there will be increasing activity in the area of pipelined scheduling. Bokhari developed fundamentalresults on the mapping oftask graphs into pipelined schedules. Bokharidemonstratedan efficient, optimalalgorithm for mapping chain-structured task graph onto linear chain of processors (the is based on an innovative data for modeling the chain pipelining problem. Figure 5.6 illustrates an instance of the chain pipelining problem, and the co~espondinglayered assignment graph.The task graph to be scheduled; the linearly-connected target multiprocessor architecture; and the layered assignment
Figure 5.6. An instance of the chain pipelining problem ciated layered assignment problem.
and (b)), and the
Chapter
graph associated with the given task graph and target ~chitectureare shown in Figures 5.6(a), 5.6(b) and 5.6(c), respectively. The number above each task graph actor A, in Figure 5.6(a) gives the execution time $(A,)of A, and the number above each task graph edge gives the IPC cost from the associated source and sink actors if the source and sink are mapped to successive stages in the linear chain of target processors. If the source and sink actors of an edge are mapped to the same stage (processor), then the co~municationcost is taken to be zero. Given an arbitrary chain-structured task graph actors {Xi, such that
(V,E) consisting of
and
where each triple ( i , b, c) corresponds to the assignment of actors X69 x b + to processor The set S, represents the set of all valid assignments of actor subsets to processor P, under the chain pipelining scheduling model. Edges in the layered assignment graph model compatibility relationships between elements of successive ‘S, and weights on these edges model the computation and communication costs associated with specific processor assignments. For each pair of vertices b, c} and b’, c’} in that satisfy 6’ c 1 an edge is “adjacent” layers and a instantiated in the layered assignment graph, The weight assigned to is the total computatio~time of the subset of actors associated with the assignment plus the IPC cost between the last actor, c associated with and the first actor, b’ associated with In other words,
(5-19) i=b
where c(e) denotes the PG cost associated with edge e in the input task graph. From the above formulatjons for the const~ctionof the layered assignment graph, the graph illustrated in Figure 5.6(c) easily seen to be the layered assignment graph associated with Figures 5.6(a-b). For clarity, the layer identifier a is omitted from the label of each vertex (a, b, and instead, the grouping of vertices into layers is designated by the annotations “Layer “Layer “Layer 3” on the right side of Figure 5.6(c). Each dead-end path in Figure 5.6(c) that originates at vertex in Layer 1 represents possible solution to the chain pipelining problem for Figure 5.6(a-b). For example, the path 1, 1, (2,2,3)), 2,2,3), (4,4,4))) corresponds to the processor assignment illustrated in Figure 5.7(a), and the single-edge path (2,3,4))) co~espondsto the assignment showing in Figure 5.7(b). In general, the processing rate, or throughput, of the pipelined implernentation associated with given dead-end path p (el,e2, e,) SE( e,) E S , in layered assignment graph is given by (5-20) that is, the throughput is simply the reciprocal of the maximum weight (computation plus comm~nication)of an edge in the path p An edge in p that achieves this maximum weight is called Thus, the chain pipelining i n i r n ~bottleneck ~ dead-end problem reduces to the problem of CO path in the layered assignment graph that originates in Layer eferring back to the example of Figure 5.6, the t~oughputsof the assignments corresponding to Figures 5,7(a) and 5.7(b) are easily seen to be 1/9 and 11 respectively, and thus, Figure 5.7(a) leads to more efficient implementation. However, the paths
bothlead to more efficient pipelined implementations. Both of these paths achieve theminimum achievable t~oughputof 1/7 for Figure 5.6(a-b). The associated processor assignments are shown in Figure 5.8 (a-b), respectively. Bokhari observed that computing minimum bottleneck paths in layered assignment graphs can be performed in polynomial time by applying an adaptation by Edmonds and [EK72]of Dijkstra’s shortest path algorithm [Dij59].
5
Using the Edmonds-~arpadaptation allows an optimal chain pipelining to be computed in O(m2n4) time, where m is the number of processors in the target multiprocessor architecture, and n is the number of actors in the chain-structured task graph, However, Bokhari has deviseda significantly more efficient algorithm that exploits the layered structure of his assignment graph model. Usingthis technique, optimal solutions to the chain pipelining problemcanbecomputed in ~ ( mtime. ~ ~ ~ ) Bokhari developed a number of extensions to his algorithm for chain pipelining. These compute optimal solutions for ~ost-s~telzjte pipeline systems under various restriction on the application structure [Bok88]. A host-satellite systems
Figure 5.7. Two possible chainpipelining implementations for the system shown in Figure 5.6. Both of these processor assignments are suboptimal.
IPC-CONSCIOUS SCHE~ULINGA L G O ~ I T H ~ S
consists of an arbitrary number of independent chainpipelining systems that have access to a single, shared host processor. Heuristics for more general formsof the pipelined scheduling problem for example, for pipelined scheduling that considers arbitrary task graph topologies, and more general classes of target multiprocessor architectures have been developed by Hoang and Rabaey [H Banerjee et al. [BHCF95]]; and Liu and Prasanna [LP98]. more detailed discussion of pipelined scheduling techniques is beyond the scope of this book. For further elaboration on this topic, the reader encouraged to consult the aforementioned references.
Figure 5.8. Two alternative chainpipelining implementations for the system shown 0th of these solutions attain the maximum achievable throughput.
5
chapter has surveyed I ~ ~ - c o n s c i o schedu~ing us techniques for application-speci~cmultiprocessors, and has emphasi~edthe broad range of fundamental, graph-theoretic insights that have been established in the context of IPCconsciousscheduling.More speci~cally,wehavereviewed key algorithmic developments in four key areas relevant to the scheduling of H ~ ~ F static ~ s : assignment, list scheduling, c l u ~ t e ~ n and g , pipeline scheduling. For ~ i n i ~ u m makespan schedulin~, techni~ues that jointly address the assign~entand ordering sub-proble~sare typically much mor^ effective than techni~uesthat are based on static assignment algorithms, and thus, list schedulin~and ~lustering have received signi~cantlymore atte~tionthan static assignment techniques in this problem domain. ow ever, no p a ~ i c u list l ~ scheduling or clustering algorithm has emerged clear, widely~acceptedbest algorithm so that outperforms all other algorithms on most applicat oreover, there recent evidence that techniques to systematically in rent scheduling gies canlead to algo~thmsthat s i g n i ~ c a ~ t l r i t h ~ on s which such i~tegratedalgorithms are based. The development of such integrated algorithms appears to be promising direction for further work on scheduling. noth her recent trend that is relevant to application-speci~cmultiprocessorimplementation is the investigation of pipelinedscheduling strate which on throughput the key performance metric. In this chapter, we have outlined fundamental results that apply to restricted versions of the pipelined scheduling problem. The development of algorithmsthat address more general formsofpipelinedscheduling,which are commonlyencountered in the designof application-speci~cmultiprocessors, cu~entlyan active research area.
The self-timed scheduling strategy described in Chapter 4 introduces synchronization checks when processors communicate; such checks permit variations in actor execution times, but they also imply run-time sync~onizationand arbitration costs. In this chapter we present a scheduling model called orderedtransactions that alleviates some of these costs, and in doing so, trades off some of the run-time ~exibilityafforded by the self-timed approach. The ordered-transactions strategy was first proposed by Bier, Lee, and Sriram ELBOO this chapter, we describe the idea behind the ordered-transactions then we discuss the design and hardware implementation of a shared-bus multiprocessor that makes use of this strategy to achieve a low-cost interprocessor communication using simple hardware. The software environment for this board, for application specification, scheduling, and object code generation for the DSP lemy system developed at the~niversityof to98].
In the ordered-transactions strategy, we first obtain a fully-static schedule using the execution time estimates, but we discard the precise timing i n f o ~ a t i o n specified in the fully-static schedule; as in the self-timed schedule we retain the processor assi~nment(oP and actor-ordering on each processor as specified by in addition, we also retain the order in which processors communicate with one another and we enforce this order at run-time. ~e formalize the concept of transaction order in the following. Suppose r,),
there are
rk)
k inter-processor communication where each r i ) is asend-receive pair
points in the
Chapter
fully-static schedule that we obtain first step in the construction of selfactors, and S be the set of actors timed schedule. Let R be the set of and S { S ] , s2, sk} We define
where S
R
(each communication actor is present in the sequence 0 order 0 defined above) is on multiprocessor if at run-time the and actors are for ecute inthesequencespecified by then imposing 0 means ensuring the constraints:
I k)
k ) I start( k ) "dk 0
Thus, the ordered-transactions schedule is essentially ule with the added transaction order constraints specified by transaction order constraints must satisfy the data precedence FC being scheduled; we call such transaction order an One simple mechanism to obtain an admissible transactjon order is follows. After fully-static schedule is obtained using the execution time estimates, an ad~issibletransaction order obtainedfrom the function by setting the transaction order to where
n admissible transaction order can therefore be determined by sorting the set of communication actors (S R ) according to their start times B,. Figure 6. l shows an example of how such an order could be derived from given fully-static schedule.This fully-static schedulecorresponds to the HSDFC andschedule illustrated in Chapter 4 (Figure 4.3). Such an order is clearly not the only admissible transaction order; an order (S17
rl,
r4, sg,
S57
r,>
satisfies all precedence constraints, andhence is admissible. In the next chapter we will discuss how to choose good transaction order, which turns out to be close to optimal under certain reasonable assumptions. For the purposes of this chapter, we will given ad~issibletransaction order, and defer the details of how to choose good transaction order to the next chapter. he transaction order is enforced at run-time by
controller implemented
THE O R ~ E ~ E ~ - T R A ~ S A C T STRATEGY IO~S
in hardware. The main advantage of ordering interprocessor transactions is that it allows us to restrict access to communication resources statically, based on the communication pattern d e t e ~ i n e dat compile time. Sincecommunication resources are typically shared between processors, the need for run-time arbitration of these resources, well the need for sender-receiver synchronization eliminated by ordering processor accesses to them; this results in an efficient IPC mechanism at low hardware cost. We have built prototype four-processor DSP board, called the OrderedemoryAccess (OMA) architecture, thatdemonstrates the ordered-transacnsconcept.The OMA prototypeboard utilizes shared memory and single shared busfor IPC the sender writes data to particular shared memory location that is allocated at compile time, and the receiver reads that location. In this multiprocessor, very simple controller on the board enforces the pre-dete~inedtransaction order at run-time, thus eliminating the need for run-time bus arbitration or semaphore synchronization. results in efficient IPC (comparable to the fully-static strategy) at relatively low hardware cost. in the self-timed scenario, the ordered-transactions strategy is tolerant of variations in execution times of actors, because the transaction order enforces correct sender-receiver sync~onization;however, this strategy moreconstrained than self-tirned scheduling, which allowsthe order in which communication actors fire to vary at run-time. The ordered-transactions strategy, therefore, falls in between ful~y-staticand self-timed strategies in that, like the self-timed strategy, it tolerant of variations in execution times and, like the fully-static strategy, has low communication and synchronization costs. These p e ~ o ~ a n c e issues will be discussed quantitatively in the following chapter; the remainder of this chapter describes the hardware and software implementation of the 0 prototype.
rod 1 rod roc Proc Proc 5
Figure
One possible transaction order derived from fu~ly-~tatic a schedul
The OMA architecture uses single shared bus and shared memory for inter-processor communication. This kind of shared memory architecture is attractive for embedded multiprocessor implementations owingto its relative simplicity and low hardware cost and to the fact that it is moderately scalable a fully interconnected processor topology, for example, would not only be much more expensive than a shared bus topology, but wouldalso suffer from its limited us bandwidth limits scalability in shared bus multiprocessors, but for medium throughput applications (digital audio, music, etc.), a single shared bus provides sufficient bandwidth the order of lOOMBytes/s). One solution to the scalability problem is the use of multiple busses and hierarchies of busses, for which theideas behind the OMA architecture directly apply. The reader is referred to Lee and Bier [LB9O] for how the OMA concept is extended to such hierarchical bus structures, Although in this book we apply the ordered-transactions strategy to a single shared bus ~chitecture,the synchronization optimization techniques described in Chapters 9 through 11 are applicable to more general platforms and are not restricted to medium throug~putapplications. From Figure 4.4 we recall that the self-timed scheduling strategy falls naturally into a message-passing paradigm that is i~plementedby the send and receive primitives inserted in the Accordingly, the shared memo^ in an architecture implement in^ such a scheduling strategy is used solely for message passing: the send primitive corresponds to writes to shared memory locations, and the receive primitive corresponds to reads from shared memory. Thus the shared memory is not usedfor storing shared data structures or forstoring shared program code. In self-timed strategy we can further ensure, at compile time, that each shared memory location is written to by only one processor. One way of doing this is to simply assign distinct shared buffers to each of the send primitives; this is the scheme implemented in the multiprocessor DSP code generation domain in the Ptolemy environment [Pto98].
Let us now consider the implementation of IPC in self-timed schedules on such shared bus multiprocessor. The sender has to write into shared memory, which involves arbitration costs it has to request access to the shared bus, and the access must be arbitrated by bus arbiter. Once the sender obtains access to shared memory, it needsto perform a synchronization check on the shared memlocation to ensure that the receiver has read data that was written in the previous iteration, to avoid overwriting previously written data. Such synchronization typically implemented using semaphore mechanism; the sender waits until semaphore is reset before writing to shared memory location, and upon writing
that shared memory location, itsets that semaphore (the semaphore could be bit in shared memo^, one bit for each send operation in the parallel schedule). The receiver, on the other hand, busy-waits until the semaphore set before reading the shared memo^ location, and resets the semaphore after completing the read operation. It can easily be verified that this simple protocol guarantees correct sender-receiver synchronization, and, even though thesemaphore bits have multiple writers, no atomic test-and-set operation required of the hardware. In summa^, the operations of the sender are: request bus, wait for arbitration, busy-wait until semaphore is in the correct state, write the shared memo^ location if semaphore is in the correct state, and then release the bus. The corresponding operations for the receiver are: request bus, wait for arbitration, busy wait on semaphore, read the shared memory location if semaphore is in the correct state, and release the bus. The IPC costs are therefore due to bus arbitration time and due to semaphore checks. If no special hardware support is employed for IPC, such overhead consumes on the order of tens of instruction cycles, and also expends power an importa~tconcern for portable applications"In addition, semaphore checks consume shared bus bandwidth. An example of this four-process0 la D S P ~ 6 ~ O - b a s eshared d bussystem designed by Dolby Labs for digi processing applications. In this machine, processors communicate through shared me ry, and central bus arbiter resolves bus request conflicts between processors. en processor gets the bus it performs semaphore check, and continues he shared memo^ transaction if the semaphore is in the correct state. It explicitly releases the bus after completing the shared memory transaction. A receive and a send together consume 30 instruction cycles, even if the semaphores are in their correct state and the processor gets the bus immediately upon request. Such high cost of communication forces the scheduler to insert few interprocessor communication nodes possible, which in turn limits the amount of parallelism that can be extracted from the algorithm. One solution to this problem is to send more thanone data sample when processor gets access to the bus; the arbitration and sync~onizationcosts are then amortized over several data samples. A mannerhasbeen proposed by ~ivojinovic, 91 is used to move delays in t FG such that data canbe transferred in blocks, instead of one sample at time. Several issues need to be taken care of before the vectorization strategy can be employed. First, retiming SDFCs to be done very carefully: moving delays across actors can change FC causing undesirable transients inthe algorithm the initial state of the implementation. This can potentially be solved by including preamble code to compute the value of the sample co~espondingto the delay when that delay is moved across actors. This, however, results in increased code size, and other
Chapter
associated code generation complications. Second, the workof Zivojinovic et al. does not apply uniformlyto H S ~ F G sif: there are tight cycles in the graph that need to be partitioned among processors, the samples simply cannot be “vectores881. Thus, presence of tight cycle precludes arbitrary blocking of data. Third, vectorizing samples leads to increased latency in the imp~ementation; some signal processing tasks such interactive speech are sensitive to delay, and hence the delay introduced dueto blocking of data may be unacceptable. Finally, the problem of vectorizing data in EISDFGs into blocks, even with the above limitations, appear to be fundamentally hard; the algorithms proposed by Zivojinovic et al. have ex~onentialworst case run-times. Code generated cu~entlyby the Ptolemy system does not support blocking (or vectorizing) of data for many of the above reasons, Another possible solution to use special hardware. One could provide full interconnection network, thus obviating the needto go through shared memory. Semaphores could be imp~ementedin hardware. One could use multi-ported memories. Needless to say, this solution is not favorable because of cost and potentially higherpowerconsumption, especially when targeting embedded applications. A general-pu~osesharedbusmachine, the S~quentBalance for example, will typically use caches between the processor and the bus. Caches lead to increased shared memory bandwidth due to the averaging effect provided by block fetches and due to probabilistic memory access speedup dueto cache hits. In signal processing and other real time applications, however, there are stringent requirements for deterministic p e ~ o ~ a n guarantees ce opposed to probabilistic speedup.In fact, the unpredictability in task executiontimes introduced due to the use of caches may be disadvantage for static scheduling techniques that utilize compile time estimates of task execution times to make scheduling decisions (we recall the discussion in Section 4.9 on techniques for estimating task execution times). In addition, due to the d e t e ~ i n i s t i cnature of most signal processing problems (and many scientific computation problems), shared data can bedeterministically prefetched because i n f o ~ a t i o nabout when particular blocks data are required by pa~icularprocessor can often be predicted by compiler. This feature has been studied in authorsproposememory allocation schemes that exploit predictability in the “smart allocation” scheme memoryaccess pattern in DSP algorithms;such alleviates someof the memorybandwidthproblemsassociatedwithhigh throughput applications. Processors with caches can cache semaphoreslocally, so that busy waiting can be donelocal to the processor without havingto access the shared bus, hence saving the bus bandwidth normally expended on semaphore checks. Such procedure, however, requires special hardware snoopingcache controller, for
example) to maintain cache coherence; cost of such hardware usually makes it prohibitive in embedded scenarios. Thus, for the embedded signal, image, and video signal processing applications that are the primary focus of this book, we argue that caches do not often have significant role to play,andweclaim that the ordered-transactions approach discussed previously provides cost-effective solution for minimizing IPC overhead in implementing self-timed schedules,
The ordered-transactions strategy, we recall, operates on the principle of determining (at compiletime) the orderinwhichprocessorcommunications occur, and enforcing that order at run-time, For shared bus implementation,this translates into determining the sequenceofsharedmemory(or, equivalently, shared bus) accesses at compile time and enforcing this predete~inedorder at run-time. This strategy, therefore, involves no run-timearbitration; processors are simply granted the bus according to the pre-dete~inedaccess order. hen processor obtains access to the bus, it performs the necessary shared memory transaction, and releases the bus; the bus then granted to the next processor in the ordered list. The task of ma~ntainin ordered access to sharedmemory is done by hen the processors are downloaded with code, the pre-determined access order list. At run-time the controller simply grants bus access to processors according to this list, granting access to the next processor in the list when the current bus owner releases the bus. Such mechanism is robust with respect to variations in execution times of the actors; the functionality of the system unaffected by poor estimates of these execution times, although the real-time performance obviously suffers in any scheduling strategy that involves static ordering and assignment. will show that if we are able to perform accurate compile time analysis, then the new transaction ordering constraints do not significantly impact performance. no arbitration needs to be done since the transaction controller grants exclusive access to the bus to each processor. In additjon, no semaphore synchronization needs to be performed, because the transaction ordering constraints respect data precedences in the algorithm; when processor accesses shared memory location and correspondingly allowed access to it, the data accessed by that processor is certain to be valid. result, under an orderedtransactions scenario, send (receive) operation always occupies the shared bus for only one shared memory write (read) cycle. This reduces contention for the and reduces the number of shared memory accesses required for each IPC operation by at least a factor of two and possibly much more, depending on the
amount of polling required in conventional arbitration-based shared bus implementation. The perfor~anceof this scheme depends on how accurately the execution times of the actors are known at compile time. If these compile time estimates are reasonably accurate, then an access order can be obtained such that processor gains access to shared memory whenever necessary. ~therwise, processor may have to idle until it gets bus grant, or, even worse, processor when granted the bus may not complete its transaction immediately, thus blocking other processors from accessing the bus. This problem would not arise in normal arbitration schemes, because dynamic reordering of independent shared memo^ accesses is possible. will ~uantifythese p e r f o ~ a n c eissues in the next chapter, where we show that when reasonably good estimates of actor execution times are available, ~orcing run-time access order does not in fact sacrifice performance significantly.
with floating point level block diagram of a
processors are connected to the shared bus, and shared m e ~ resides o ~ on the
the asse~ion. After processor obtains access to the shared bus, it performs a single shared memory o~eration(send or receive) and releases the bus. The ans sac ti on controller detects the release of the bus and steps throu~hits ordered list, ranti in^ the bus to the next processor in its list. The cost of transfer of one word of data bet wee^ processors is 3 instruc-
THE O ~ ~ E ~ E ~ - T R A N S A C T ISTRATEGY ONS
tion cycles inthe ideal case where the sender and the receiver obtain access to the shared bus immediately upon request; two of these correspond to shared memwrite (by the sender) and shared memory read (by the receiver), and an extra i n s t ~ c ~ i ocycle n is expended in bus release by the sender and acquisition by the receiver.Such low-overhea~ interprocessor communication is obtained with the transaction controller providing the only additional hardware support. described in subsequent section, this controller can be impleme~ted with very simple h a r d w ~ e .
In the designdiscussedabove,processor-to-processorcommunication one write and one occurs through central shared memory; two transactions read must occur over the shared bus for each sender-receiver pair. This situation can be improved by distributing the shared memory among processors, shown in Figure 6.3, where each processor assigned shared memory in the form of hardware FIFO buffers. Writes to each FIFO are accomplished through the shared bus; the sender simply writes to the FIFO of the processor to which it Bus Access Schedule lnfor~ation
Figure 6.2. Block ~ i a g r of a~ the OMA prototype.
Chapter
wants to send data by using the appropriate shared memory address. Use of FIFO implies that the receiver must know the exact order in which data is written into its input queue. This, however, is guaranteed by the ordered-transactions strategy. Thus replacing RAM (random access memory)based shared memory with distributed FIFOs does not alter the functionality of the design. The sender need only block when the receiving queue is full, which can be accomplished in hardware by using the ‘Transfer Acknowledge (TA)’ signal on the DSP96002; device can insert an arbitrary number of wait states in the processor memory cycleby de-asserting the TA line. Whenever particular FIFO accessed, its ‘Buffer Full’ line enabled onto the TA line of the processors (Figure 6.4). Thus full FIFO automatically blocks the processor trying to write into it, and no polling needs to be done by the sender. At the receiving end, reads are local to processor, and do not consume shared bus bandwidth. Thereceiver can be made to either poll the FIFO empty line to check for an empty queue, or one can use the same TA signal mechanism to block processor reads from an empty queue.The TA mechanism will thenuse the (“A”) bus control signals (“A” bus TA signal, “A”address bus, etc.). This illustrated in Figure 6.4.
Schedule Information
THE ORDE~ED-TRANSACTIONSSTRATEGY
Use of such distributed shared memory mechanism has several advantages. First, the shared bus traffic is effectively halved, because only writes need to go through the shared bus. Second,in the design of Figure 6.2, processor that is granted the bus is delayed in completing its shared memory access, all other processors waiting for the bus get stalled; this does not happen for half the transactions in the modified design of Figure 6.3 because receiver reads are local. Thus there is more tolerance to variations in the time at which receiver reads data sent to it. Last, processor can broadcast data to all (or any subset) of processors in the system by simultaneously writing to more than one FIFO buffer. Such broadcast is not possible with central shared memory. The modified design, however, involves significantly higher hardware result, the prototype cost than the design proposed in Section 6.5.1. discussed in the following sections (Sections 6.6 to 6.9) was built around the centralsharedmemorydesignandnot the FIFObased design. In addition, the DS~96002processor has an on-chip host interface unit that can be used deep FIFO; therefore, the potential advantage of usingdistributed FIFOs can still
“A” Add.
Data
Figure 6.4. Details of the “TA line mechanism (only one processor is shown).
be evaluated to some degree by using the chip host interface even in the absence of external FIFO h ~ d w a r e , imulation models were written for both the abovedesigns using the Thor re simulator [Tho861 under the Frigg multi-processor simulator system Frigg allows the Thor simulator to communicate with a timing-driven function~lsimulator for the processor provided by otorola simulator also simulates Inpu~Output operations of the pins of essor, and Frigg interfaces the signals on the pins to the rest of the Thor sim~lation;as aresult, hardware associated with each pro~essor(memories, de~odinglogic, etc.) and intera~tionbetween processors can be simuFrigg. This allows functionality of the entiresystem to be verified by ~ n n i n gactual programs on the processor simulators. This model was not used for performance evaluation of the A prototype, however, because with just a four-~r~cessor system the cycle-by- le Frigg simulation was far too slow, even for very simple programs. higher“leve1(behavioral) simulation would be more useful than a cycle-by~cyc simulation for the purposes of p e ~ o ~ h such high-level simulation was not carried out on the he remainder of this chapter describes hardware and software design A board prototype.
A architecture has been designed ard design comprise er is implemented on ilinx chip also handles the host interface functions, and im~lementsa simple mechanism. A hierarchical description of the hardware design follows.
ure 6.5. At the top level, there are four “processelement” blocks that consist of the processor, local memo^, local address oder, and some glue logic. Address, data, and control busses from the PE blocks are connected to form the shared bus. Shared m e ~ isoconnected ~ to this bus; address decoding is done by the “shared address decoder” PAL ( ~ r o ~ r a m m a ble array logic) chip. A central clock ene era tor provides a common clock signal imple~entsthe transaction controller and a is also used to irn lement latches and buffers during 8) stores the bus
Chapter
access order in the form of processor identifications (IDS). The sequence of processor is stored in this “schedule RAM”,and this determines the bus access order. An external latch is used to store the processor ID read from the schedule This ID is then decoded to obtain the processor bus grants.
A subset of the 32 shared bus address lines connect to the Xilinx chip, for addressing the I/O registers and other internal registers. All 32 lines from the shared data busare connected to the Xilinx. The shared data bus can be accessed from the external connector (the “right side” connector in Figure 6.5) only through the Xilinx chip. This feature can be made use of when connectingmultiple OMA boards: shared busses from different boards can be madeinto one contiguous bus, or theycanbe left disconnected,withcommunicationbetween busses occurring via asynchronous “bridges” implemented on the Xilinx FPGAs. We discuss this further in Section 6.6.7. Connectorsonboth ends of the boardbringout the sharedbus in its entirety. Both left and right side connectors follow the same format, so that multiple boards can be easily connected together. Shared control and address busses are buffered before they go off board via the connectors, and the shared data bus buffered within the Xilinx. The DSP96000 processors have on-chip emulation (“OnCE’ in Motorola terminology) circuitry for debugging purposes, whereby serial interface to the OnCE port of processor can be used for in-circuit debugging. On the OMA board, the OnCE ports of the four processors are multiplexed and brought out single serial port; host may select any one of the four OnCE ports and communicate to it through serial interface. We discuss the design details of the individual components of the prototype system next.
The task of the transaction order controller is to enforce the predetermined bus access order at run-time. A given transaction order determines the sequence processor of bus accesses that must run-time. We refer to this sequence of bus accesses by the term Since the bus access order list is progra~-dependent,the controller must possess memory into which this list is downloaded after the scheduling and code generation steps are completed, and when the transaction order that needs to be enforced is determined. The controller must step through the access order list, and must loop back to the first processor ID in the list when it reaches the end. In addition, the controller must be designed to effectively use bus arbitration logic present on-chip, to serve hardware.
THE ORDERED-TRANSACTIONS STRATEGY
(m)
The bus grant signal on the DSP chip is used to allow the processor signal is used to tell the to perform shared bus access, and the bus request controller when processor completes its shared bus access.
(m)
Each of the two ports on the DSP96002 has its own set of arbitration signals; the and signals are the most relevant signals for the OMA design, and these signals are relevant only for the processor port connected to the shared bus. line (which is an input to the processor) must the name suggests, the be asserted before processor can begin bus cycle: the processor forced to wait for to be asserted before it can proceed with the instruction that requires access to the bus. Whenever an external bus cycle needs to be performed, processor asserts its signal, and this signal remains asserted until an instruction that does not access the shared bus is executed. We can therefore use the signal to determine when shared bus owner has completed its usage of the shared bus (Figure 6.6
m
m
m
m
m m
The rising edge of the line is used to detect when processor releases the bus. To reduce the,number of signals going from the processors to the consignals from all processors onto common BR troller, we multiplexed the signal. The current busownerhas its outputenabledonto this common reverse signal; this provides sufficient information to the controller because the controller only needs to observe the line from the current bus owner. This arrangement is shown in Figure 6.6 (b); the controller grants access to a processor by asserting the corresponding line, and then it waitsfor an upper edge on the reverse line. On receiving positive going edge on this line it grants the bus to the next processor in its list.
m m
m
One straightforward implementation of the above functionality is to use counter addressing R A M that stores the access order list in the form of processor IDS. We call this counter the and the memory that stores the processor is called the Decoding the output of the RAM prolines. The counter is incrernented at the beginning of provides the required cessor transaction by the negative going edge of the common signal and the output of the RAM is latched at the positive going edge of thus granting the bus to the next processor soon the current processor completes its shared memory transaction. The counter reset to zero after it reaches the end of the list (Le., the counter counts modulo the bus access list size). This shown in Figure goes low ensures enough time for 6.7. Incrementing the counter soon the counteroutputsand the RA outputs to stabilize. For 33MHz processor withzerowait states, width is minimumof 60 nanoseconds.Thusthe this time. counter incrementing and the RAM access must both finish before
m
m m,
m
m
Chapter
need fast counterand fast static for the schedulememf the counter d e t e ~ i n e sthe maximum allowa~lesize the unter width of size rz implies maximum list size of 2" wider cou~ter,however, im lies a slower count~r. for certain width, the counter (i~plementedon the part in our case) turns out to betooslow i.e., the outputof the schedul will not stabilize least one latch set upperiod before the positive goingedgeof arrives wait states have to be inserted in the processor bus cycle to delay the positive edge of
rocessor gr
that
10- it-wide counter does not re~uireany wait states, and allows a maxis in the access order list.
address contains access list (address ~ r o c ~ D )
Deco li
BGO
addr.
countup
latch
latch out
Chapter
A single bus access list implies we can only enforce one bus access pattern at run-time. In order to allow for some run-time flexibility, we have implemented the QMA controller using presettable counter. The processor that currently owns the bus can preset this counter by writing to certain shared memory location. This causes the controller to jump to another location in the schedule memory, allowing the multiple bus access schedules to be maintained in the schedule RAM and switching between them at run-time depending on the outcome of computations in the program. The counter appears an address in the shared memory map of the processors. The presettable counter mechanism is shown in Figure 6.8. An arbitrary number of lists may, in principle, be maintained in the schedule memory. This feature can be used to support algorithms that display data dependency in their execution. For example, dataflow graph with conditional construct will, in general, require different access schedule for each outcome of the conditional. One of two different SDF subgraphs are executed in this case, depending on the branch outcome, and the processor that determines the branch outcome can be assigned the task of presetting the counter, making it branch to the access list of the appropriate SDF subgraph. The access controller behaves in Fig 6.8 (b). We discuss the use of this presettable feature in detail later in the book.
The function of the host interface is to allow downloading programs onto the QMA board, controlling the board, setting parametersof the application being run, and debugging from host workstation. The host for the QMA board connects to the shared bus through the Xilinxchip, via one of the shared bus connectors. Since part of the host interface is configured inside the Xilinx, different hosts (32 bit, 16 bit) with different handshake mechanisms can be used with the board. The host that is being used for the prototype is Motorola DSP56000based DSP board called the S-56X card, manufactured by Ariel Corp [Ari9 1). The S-56X card is designed to fit into one of the Sbus slots in Sun Sparc workstation; user level process can communicate with the S-56X card via a unix device driver. Thus the board too can becontrolled (via the S-56X card) by user process running on the workstation. The host interface configuration is depicted in Figure 6.9. Unlike the DSP56000 processors, the DSP96002 processors do not have built-in serial ports, so the S-56X board also used serial processor for theQMAboard. It essentially performs serial-to-parallel conversion of data,
THE ORDERED-TRA~SACTIO~S STRATEGY
Figure 6.8. Presettable counteri ~ p ~ e ~ e n t a t i o n .
buffering of data, and i n t e ~ u p tmanagement. The Xilinx on the OMA board implements the necessary transmit receive registers, and synchronization Aags we discuss the details of the Xilinx circuitry in Section 6.6.5. The S-56X card communicates with the Sparc Sbus using DMA (direct memory access). A part of the ~ S P ~ bus 6 ~ and0control signals are brought out of the S-56X card through another XilinxFPCA on the S-56X. For the g S-56X board with the OMA board, the Xilinx on theSpurpose of i n t e ~ a c ~ nthe 56X card is con~guredto bring out 16 bits of data and 5 bits of address from the ~ S ~ 5 processor 6 0 ~ onto the cable connected to the OMA (see Figure 6.9). In addition, the I/O port (the §SI port) is also brought out, for interface with V 0 devices such A/Dand D/A convertors. By making the ~ § ~ 5 6 0 0 write 0 to appropriate memory locations, the 5 bits of address and 16bits of data going into A may beset and strobed for read or a write, to or fromthe OMA board. In other words, the OMA boardoccupies certain locations in the memory map; host communication is doneby reading and writing to these memlocations.
Figure 6.9. Host
~ P 9 6 0 0processor, ~ local ress decoder,andsomeaddressdecoding $>isvery similar to the design of the ent ~ystem)board ht out into a 96 pin euro-connecconnector can be used for local m e ~ o r yexpansion; we have used it for providing local interface to the proA the cessing el~ment(as an alternative to using the shared bus for processor forms the local bus, connecting to local ~ ~ m o and r y address decoding o contains address buffers, and logic to set up the b o o t u ~mode the processor connected to the shared bus.
cating with one noth her on each processor can b
Chapter 6
Section 6.5.2, except that the FIFO is internal to each processor.
As mentioned previously, the X ~ 3 0 9 0 ilinx F ~ ~usedAto implement interface. It also configured to the transaction controller as well as a simple provide latches and buffers for addressing the Host Interface (HI) ports on the ~ S ~ 9 6 0 during 0 2 bootup and down~oadingof code onto the processors. For this to work, the Xilinx first configured to i ~ p l e ~ ethe n t bootup- and downloadrelated circuitry, which consists of latches to drive the shared address bus and to access the schedulememory.Afterdownloadingcode on the processors, and downloading the busaccessorder into the schedule RA the Xilinxchip is reconfigured to implementthe transaction controller and interface. Thus the process of downloading and running a program requires configuring the inx chip twice. possible waysinwhich
a Xiiinxpartmaybeproitmap downloaded bytecard). The bitmap file, genread in by a function i which describes the en into the appropriate mem-
erated and stored as a bin mented in the qdm softw software interface) and t ory location on the Sstrobes these bytes int user can reset and rec ng the Xilinx control pins by writing to a A board. con~gurationpins of the different values into this latch.
linx con~gurationlatch” inx chip are manipulated
~e use two different Xilinx circuits, one during bootup and theother during run-time. The ilinx configuration during bootup helps e l i ~ i n a t esome glue logic that would otherwise be requiredto latch and decode address and data from the S-56X host, This configuration allows the host to read and write from any of the HI ports of the processors, and also to access the schedule memory and the shared memory on board. un-time configuration onthe Xilinx consists of the transaction controller i ~ p ~ e m e n t easad presettable counter. unter can be preset through the shared bus. It addressesan external fas (8 n~nosecondaccesstime) that contains processor IDS corresponding to the bus access schedule. Output from the schedule memory is externally latched and decoded to yield grant lines (Figure 6.7).
(m)
schematic of the Xilinx con~gurationat run-time is given in Figure 6.1 1. This con~gurationis for YO with an S-56X (16 bit data) host, although it
THE ~ R ~ E R E ~ - T ~ A ~ S ASTRATE~Y CTIO~S
can easily be modified to work with 32-bit host. -56X board reads data from the Transmit (Tx) register and writes register on the Xilinx. These registers are memory-mapped h that any processor that possesses the bus may write to the TX register or read from the Rx register. For 16-bit host, two transactions are required to perform read or write with the 32-bit Tx and Rx registers. The processors themselves need only one bus access to load or unload data fromthe interface. Synchronization on the S-56X (host) side is done by polling status bits that indicate an Rx empty (if true, the host performs write, otherwise it busy-waits) and Tx full (if true, the host performs read, otherwise it busyside, synchronization is done by the use of the TA (transfer acknowledge)pinon the processors. hen processorattempts to read write Tx, the approp~atestatus nabled onto the TA line, and wait are automatically inserted in the processor bus cycle whenever the line not asserted, which in our implementation translates to wait states whenever the status are false. Thus, processors do not have the overhead of polling the status flags; an transaction identical to normal bus access, with zero or more wait states inserted automatically. SP56000 processor on the S-56X card is responsible for performing with the actual ossibly asynchronous) data source and acts the inte~upt processor for the A board, relieving the board of tasks such interrupt servicing and data buffering. This of course has the downside that the needs to be dedicated an unit for the 0 A processor board, and limits other tasks that could potentially run on the host.
ory modules are provided, so that to 512 n reside on board. The memory must have an access time of 25ns to achieve zero waitstate operation.
Several features have been included in the design to facilitate connecting together multiple 0 boards. The connectors on either end of the shared bus are compatible, so that boards may be connected together inlinear fashion (Figure 6.1 2). As mentioned before, the shared data bus goes to the “right side connector” through the Xilinx chip. By configu~ngthe Xilinx to “short” the external and internal shared data busses, processors on different boards can be m share one contiguous bus. Alternatively, busses can be “cleaved” on the chip, withcommunicationbetweenbussesimplementedon the Xilinxviaan
r"""""""""
Host Data Bus
l
THE
STRATEGY
asynchronous mechanism (e.g., read and write latches synchronized by “full” and “empty” flags). This concept is similar to the idea used in the SMART processor array [Koh90], where the processing elements are connected to switchable bus: when the busswitches are open,processors are connectedonly to their neighbors (forming linear processor array), and when the switches are closed, processors are connected onto contiguous bus. Thus the SMART array allows formation of clusters of processors that reside on common bus; these clusters then communicate with adjacent clusters. When we connect multiple OMA boardstogether, we get similar effect: in the “shorted” configuration processors on different boards connect to single bus, whereas in the “cleaved” configuration processors on different boards reside on common busses, and neighboring boards communicate through an asynchronousinterface. Figure 6.12 illustrates the above scheme. The highest bits of the shared address bus are used the “board ID” field. Memory, processor Host Interface ports, configuration latches, etc. decode the board ID field to determine if shared memory or host access is meant for them. Thus, total of 8 boards can be hooked onto common bus inthis scheme.
esi ~e used single-sided through-hole printed circuit board technolo OMA prototype. The printed circuit board design was done using th system developed in Professor Brodersen’s group at University of California at Berkeley [Sri92]. Under this system, design is entered hierarchically using netlist language called SDL (Structure Description Language). ~eometricplacement of components can be easily specified in the SDL netlist itself. A ‘tiling’ feature also provided to ease compact fitting of components. The SDL files were written in modular fashion; the schematics hierarchy is shown in Figure 6.12. The SIERA design manager(RMoct) was then used to translate the netlists into an input file acceptable by Racal, commercial PCB layout tool, which then used to auto-route the board. Figure 6.13 shows photograph of the board.
As discussed earlier, we use an S-56X card attached to Sparc host for the OMA board. The Xilinx chip on the S-56X card configured to provide l 6 bits of data and 5 bits of address. We use the [Lap911 software an interface for the S-56X board; is debugger/monitor that has several useful built-in routines for controlling the S-56X board, for example data can be written and read from anylocation in the D S ~ 5 6 address ~0 space through functioncalls
Chapter
in Another useful feature of q d is~that it uses an embeddable, extensible, shell-like interpreted command language [0us94]. Tcl provides a set of built-in functions (such as an expression evaluator, variables, control-flow statements, etc.) that can be executedvia user commands typed atits textual interface, or from a specified command file. Tcl can be extended with application-specific commands; in our case, these commands correspond to the de~ugging/monitor as well as commands specific to the OMA. commandsimplemented in Another useful feature of Tcl is the scripting facility it provides; sequences of commands can be convenientlyintegrated into scripts, which are in turn executed by issuing single command. Some functions specific to the 0 A hardware that have been compiled into qdm are the following:
Busses on different boards connected together, to have more than four processors on a single bus.
ure 6.12. connect in^ multiple
Processors on separate busses with handshake between busses. Helpsin scalability of the system.
ilinx with configuration specified by file.bit proc# load bootstrap monitor codeinto the specified processor lod pro& load DSP96002 .lod file into the specified processor
A bus access schedule memory These functions use existing qdm functions for reading and writing values to the DSP56000 memory locations that are mapped to the 0 interface. Each processor is programmed through its Host Interface via the shared First, a monitor pro~ram (oma~on.lod) consisting of interrupt routines loaded and run on the selected processor. Code then loaded into processor memory by writing address and data values into the HI port and i n t e ~ p t i n gthe processor. The interrupt routine on the process responsible for inserting data into the specified memory location. The S-5 host forces different i n t e ~ p t
Chapter
routines, for specifying which of the three Y, or P) memories the address refers to and for specifying read or write to or from that location. This scheme similar to that employed in downloading code onto the S-56 Status and control registers on the board are memory m a p p e ~to the address space and can be accessed eset, reboot, monitor, and debug the Tcl scripts were written to simplify commands that used are most often (e.g., ‘change y:fffO was aliased to ‘omareset’). tolemy multiprocessor hardwaretarget[Pro9 waswritten for the for automatic pa~itioning,code ene era ti on, and execution ofan m block diagram s~eci~cation. A simple heterogeneous multiprocessor target was written in Ptolemy for the 0 target generates ~SP56000code X card, and generates 9 6 0 multiprocessor ~ code for the 0
mechanism has been implemented for the applications is periodic; samples (or blocks of samples) typically arrive at constant, periodic intervals, and the processed output is again required (by, say, digital-to-a~alogc o n v e ~ o ~ at) periodic intervals, With this observation, it is in fact possible to schedule the I D operations within’ the ~ u l t i p r ~ c e s s schedule, or sequently determine when, relative to the other shared bus accesses due e shared is required for This allows us to include accesses for YO in the busaccess order list. In our p a r t i c u l ~ i m ~ ~ e m ~ n d address locations that address the Tx and tion 6.6.5), which in turn communicate with processor accesses these registers if they were part of shared memory. It obtains access to these registers when the transaction controller grants access to the shared bus; busgrants for the purpose of I/O are taken into account when cons t ~ c t i n gthe access order list. Thus accesses to shared resources can be ordered much accesses to shared bus and memory. emory access strategy can be applied ~ n - t i m eparameter control we mean controll hm (gain of some component, bit-rate of coder, pitch of synthesized music sounds, etc.) while the algorithm is ~ n n i n gin real time the h ~ ~ ~ a r e . a feature ~ u c his obviously very useful and someti~esindispensable. Usually, one associates such parameter control with an async changes parameter (ideally by means of suitabl r) and this change causes ‘an inte~uptto occur on i n t e ~handler ~ t then performs the appropriate operations that cause the p a r ~ e -
ter change that the userre~uested. Forthe architecture.,however.,unpredictable i n t e ~ p t are s not desirable, was noted earlier in this hapter; on the other hand, shared and I are relatively inexpensive owin to the ordered-transactions mechanism. ntrol is implemented in the followin g user i n t e ~ p t swhenever ; paraminte~uptand it modA board, on the other ifies a pa~icularlocatio whether hand, receives the contents of rocessors never "see" userally modified or not. Thus the o ~ e s p o n d i nto~ the value stored in rut;they in essence U ph. Since reading in the value of n iteration of olved in this scheme is minimal. instruction cycles, the ove ed practical advantage of the above scheme is that the tclltkp~mitivesthat have been implemented in Ptolemy for the to98J) can be directly used with the ter control purposes.
applications that are implemented usin
831 is well known approach for synstring. "he basic idea to pass n a delay, low pass filter, and multipli of less than one. d e t e ~ i n e the s pitch of the generated sound, a multiplier gain determines the rate of decay. ultiple voices can b and com~inedby implementing one feedback loop for each voice ing the outputs from all the If we want to generate sound at sampling rate (compact disc sampl rate), we can im~lement7 voices on a single processor in re blocks from the Ptole~y ene era ti on library voices consume 370 i n s ~ u c t 0 inst~ctioncycles available board, we impleme~ted2 hose output is multiplie four hierarchical blocks consisting 7 copies of the basic feedbac~ for eachvoice. "he outputs are added together, an this sum fed to an analog to di~italconvertor after being conve~edinto fixed-
Chapter
ure voices.
~ierarchicalspecification of the Kar~lus-Stron~ al~orithm in 28
THE ORDERED-TRANSACTIONS STRATEGY
point representation from floating point representation. A schedule for this application is showninFigure 6.15. The makespan for this schedule 377 instruction cycles, which just within the maximum allowable limit of 380. This schedule uses pairs of sends and receives, and is therefore not communication-intensive. Even so, higher IPC cost than the three instruction cycles the OMA architecture affords us would not allow this schedule to execute inreal time 44.1 KHz sampling rate, because there only three-instruction-cycle marginbetween the makespan of this scheduleandthe m a x i ~ u mallowable makespan. To schedule this application, we employed Hu-level scheduling along with manual assignment of some of the blocks.
A Quadrature Mirror Filter (QMF) bank consists of set of u~ulysisfilters used to decompose signal (usually audio) into frequency bands, and bank of synt~esisfilters is used to reconstruct the decomposed signal [Vai93]. In the analysis bank, filter pair is used to decompose the signal into high pass and low pass components, which are then decimated by factor of two. The low pass component is then decomposed again into low pass and high pass components, andthis process proceeds recursively. The synthesis bank performs the complementary operation of upsampling, filtering, and combining the high pass and low pass components; this process again performed recursively to reconstruct the input
igure 6.16(a) shows block diagram of by an analysis bank.
synt esis filter bank followe
filter banks are designed such that the analysis bank cascaded with bank yields transfer function that is pure delay (i.e., has unity response except for a delay between the input and the output). Such filter banks are also called filter banks, and they find applications in high quality audio compression; each fre~uencyband is quantized according to itsenergy content andits perceptual importance. Such coding scheme is the audio portion of the implemented perfect-re filter bank to decompose audio from compact disc player into bands. The synthesis bank implether with the analysis part. T ~ e r eare total of 36 m taps each. This is shown hi~rarchicallyin Figure blocks are required in the first output paths of the analysis ban for the delay t~roughsuccessive stages of the analysis filter bank.
tion cycles of co~putationper sarn ic Level (DL) scheduling heuristic ration period of 366 inst~ctionc tually constru~ted( ~ a n t tchart of F i ~ u r e mples ~ecausethis num~erof samples is h fire at least once; this makes manual scheduling very difficult. found that heuristic p e r f o ~ sclose to 2 better than the classic Hu-level heuristic in this example, althou~hthe ute the schedule compared to
There are 1010 example. Using Sih’s able to achieve an av
processors had i n d e p e n d ~ access ~t to the s h a r ~ d m e r n(if o ~the shared ~ e m o r y were 4-ported, for example),we could achieve an ideal speedup of four, because
is independent of the others except for datainput and ~utput, For this example, data partitioning, shared memory allocation, scheduling, g the assembly program was done by hand, using the 256-point com6 as a building block.The Gantt chart block in the Ptolemy ~ G 9 domain
delay blocks In
rn for a band analysis and synthesis on four~ r o ~ e s s o(using rs Sih's DL heuristic [Sih
Chapter
for the hand-generated schedule, including IPG costs through the 0 ler, is shown in Figure 6.17.
In this chapter, wediscussed the ideas behind the ordered-transactions schedulin~strategy. This strategy combines compile timeanalysis of the IPC pattern with simple hardware support to minimize interprocessor co~munication discussed the hardwaredesignand ntation details of a prototypeshared multiprocessor the OrderedAccess architecture statically assign the that uses the ordered-transactions scheduling sequence processor accesses to shared memory. External and user-specilied control inputs can also be taken into account when scheduling accessesto the shared bus. ~e also discussed the software interface details of the prototype and illustrated some applications that were implemented onthe
1024 complex values read each processor
write result (256 complex .values)
n this chapter the limits of the ordered-transactions scheduling strategy are systematically analyzed. Recall that the self-timed schedule is obtained by first generating fully-static schedule o“,(v), and then ignoring the exact firing times specified by the fully-static schedule; the fu~~y-static schedule itself derived using compile time estimates of actor execution times of actors. defined in the previous chapter, the 0rdered”transactions strategy essentially the self-timed strategy with added orderingconstraints that force processors to communicateinanorderpredetermined at compile time. The questions addressed in this chapter are: What exactly are we sacrificing by imposing such Is it possible to choose transaction such that this penalty miniat is the effect of variations of task (actor) execution times on the throughput achieved by self-timed strategy and by an ordered transactions strategy? The effect of imposing transaction order on self-timed schedule is best illustrated by the following example. Let us assume that we use the d a t a ~ o ~ graph and its schedule that was introduced in Chapter 4 (Figure and that we enforce the transaction order (obtained by sorting the o“,values) of Figure 6. we reproduce these for convenience in Figure7.1 and (b). If we observe how the schedule “evolves” it is executed in self-timed manner (essentially simulation in time of when each processor executesactors assigned to it), we get the 6 6 u n f o ~ d schedule e~’ of Figure 7.2; successive iterations of the HSDFG overlap in natural manner. This is of course an idealized scenario where IPCcosts are ignored; we do so to avoid unnecessary detail in the diagram, since IPC costs can be included in our analysis in straightforward 7.2 eventually settles to manner.Note that the self-timed schedule in Fig periodic pattern consisting of two iterations of the DFG; the average iteration
7
period under the self-timed schedule is 9 units. The average iteration period (which we will refer to TSr for such an idealized (zero IPC cost) self-timed on the iteration period achievable by any schedule represents ~ o ~ e r schedule that ~ a i n t a i n the s same processor assignment and actor-ordering. This because the only run-time constraint on processors that the self-timed schedule imposes is due to datadependencies: each processor executes actors assigned to it (includin~the communication actors) according to the compile-time-determined order. An actor at the head of this ordered list isexecuted soon data is available for it. Any other schedule that maintains the same processor assignment and actor ordering, and respects data precedences in G , cannot result in an execution where actors fire earlier than they do in the idealized self-timed schedule. In particular, the overlap of successive iterations of the HSDFG in the idealized self-timed schedule ensures that TST in general. The self-timed schedule allows reordering among IPCs at run"ti~e.In fact, we observe from Figure 7.2 that once the self-timed schedule settles into a
D
G
H ~ D F ~
5
ANA~YSISOF
O R ~ E ~ E ~ - T R A N S A C T ST IO~S
periodic pattern, IPCs in successive iterations are ordered di~erently:in the first iteration, the order in which IPGs occur is indeed the unique tion order shown in Figure 7.1 (b):
owever, once the schedule settles into a periodic pattern, the order alternates between:
and
In contrast, ifwe enforces the order
impose the transaction order in Figure 7,l(b) that rl,
s29
r27
s 3 7 r3t $41 r4t
r59
r6)
9
the result in^ ordered transactions schedule evolves as shownin otice that enforcing this schedule introduces idle time (hatched rect result, the average iteration period for the ordered transactions schedule, is units, which (as expected) larger than the it~rationperiod of the ideal selftimed schedule with zero bitr ration and synchronization overhead (9 units) but is smaller than (1 units). In general T,, T,, the self-timed schedule only has assignment and ordering constraints9the ordered transactions schedule has the transaction ordering constraints in addition to the constraints in the self-timed schedule, whereas the fully-static schedule has exact timing constraints that subsume the cons~aintsin the self-timed and ordered transactions schedules. The ~uestionwe would like to answer is: is it possible to choose the
ST
Chapter
transaction ordering more intelligently than the straightforward chosen in Figure 7.1(b)?
-sorted order
As first step towardsdetermininghowsucha “best” possible access order might be obtained, we attempt to model the se~f-timedexecution itself and try to determine the precise effect (e.g., increase in the iteration period) of adding transaction ordering constraints. Note again that the schedule evolves in selftimed manner in Figure7.2, it eventually settles into periodically repeat in^ pattern that spans two iterations of the dataflow graph, and the average iteration period, is 9. We would like to determine these properties of self-timed schedules analytically without having to resort to simulation,
In self-timed strategy schedule S specifies the actors assigned to each processor, including the IPC actors and and specifies the order in which these actors must be executed. At run-time each processor executes the actors assigned to it in the prescribed order. When processor executes send it writes into a certain buffer of finite size, and when it executes receive, it reads from a co~espondingbuffer, and it checks for buffer overflow (on send) and buffer underflow receive) before it p e r f o ~ communication s op~rations;it blocks, or suspends execution, when it detects one of these conditions.
EiF)derived model a self-timed schedule using an HSDFC G, from the application graph G E) graph G,, which we will refer to the for short,models the fact that actors of G assigned to
Proc
10
idle time due to ordering co~str~nt
chedule evolution when thetrans~ctionorder of enforce^.
the same processor execute se~uentia~ly, and it models constraints due to interprocessor communication. For example, the self-timed schedule in Figure 7.1 (b) can be modeledby the IPC graph in Figure 7.4. The IPG graph has the same vertex set V G , corresponding to the set of actors in G . The self-timed schedule specifies the actors assigned to each processor, and the order in which they execute. For example in Figure 7.1, processor 1 executes and then model this in C, by draw in^ cycle and E ?and placing delay on the edge around the vertices CO from to The delay-free edge from to E represents the fact that the k th recedes the k th execution of and the edge from E to with delay represents the fact that the k th execution of can occur only after the ( k 1) thexecutionof hascompleted. Thus if actors are assigned to the sameprocessor in that order, then Gip wouldhave cycle v3)9 with (because is executed first). If there are P processors in the schedule, then we have P such cycles co~espondingto each processor. The additional edges due to these constraints are s ~ o w n dashed arrows in Figure7.4. ((V19
mentioned before, edges in G that cross processor boundaries after scheduling represent inter-processor communication.Communication actors (send and are inserted for each such edge;these are shown in Figure 7.1,
critical cycle send receive raph for the schedule Figure
Chapter 7
The IPC graph has the same semantics an H~~~~~ and its execution models the execution of the co~espondingself-timed schedule. The following definitions are useful to formally state the constraints represented by the IPC graph. Time is modeled an integer that can be viewed multiple of base clock. ecall that the function start( k ) represents the time at which the k th execution of actor starts in the self-timed schedule. The function k ) Z’ represents the time at which the k th execution of the actor ends and produces data tokens at its output edges, and we set start( k ) and k) for k 0 as the “initial conditions99,The values are specified by the schedule: of
e~nition4.2, per the semantics ofan each edge re~resentsthe f o l l o ~ i ndata ~ dependence constr~int: start(vi9
for all
E
.EiF9
(7-4
for all k
e constraints in (7-4) are due both to com~unicationedges (representing synchroni~ationbetween processors) and to edges that represent se~uentialexecution of actors assigned to the same processor. to model execution times of actors we associate execution time t( with each vertex of the IPC graph; t( assigns positive integer execution time to each actor (which can be interpreted cycles of base clock). Interprocessor communication cost be represented by assigning exec~tiontimes to the and actors. wemay substitute t(
in (7-4) to obtain for all In the self-timed schedule, actors fire soon uch an soon possible” (A
data
.EiF
(7-5)
available at all
n contrast, recall that in the fully-static schedule we would force actors to fire pe~odicallyaccording to
e IPC graph has the same semantics
a
A~ALYSISOF THE O R D E R E ~ - T R A ~ S A ~ T I OSTRATEGY NS
net theory [Pet$l][RCG$OJ the of marked graph correspond to the nodes of the IPC graph, the of marked graph correspond to edges, and the of marked graph corresponds to initial tokens the edges. The IPC graph is also similar to Reiter's computation graph [Rei68]. The same properties hold for it,and we state some of the relevant properties here. The proofs listed here are similar to the proofs for the co~espondingproperties in marked graphs and computation graphs in the references above. The number of tokens in any cycle of the IPC graph is always conall possible valid firings of actors in the graph, and is equal to the path delay of that cycle. r each cycle C in the IPC graph, the number of tokens on C can only change when actors that are on it fire because actors not on C remove and place tokens only on edges that are not part of If
and any actor 1 k S fires, then exactly one token is moved from the edge vk to the edge vk, where and v1 This conserves the total number of tokens on C . SDFG G said to be actors cannot firean infinite number of times in any actors in G . Thus, when executing valid schedule for some actor fires k number of times, and never
if at least one of its equence offirin of deadlocked enabled to fi
SDFG (in particular, an IPC graph) is free of deadlock if and only if it doesnot contain delay free cycles. Suppose there is delay free cycle
y Lemma 7.1 none of the e (v1, can contain tokens during any valid execution of G Then each of the actors has at least one input that never contains any data. Thus, none of the actors on C are ever enabled to fire, and hence G is deadlocked. Conversely, suppose is deadlocked, i.e., there is one actor thatnever fires after certain sequence of firings of actors in G . Thus, after this seque~ce v l ) that never contains data. This of firings, there mustbean input edge implies that the actor in turn never gets enabled to fire, which in turn implies that there must be an edge that never contains data. In this manner we can trace path p v l ) ) for IVl backfrom
Chapter 7
to
that never contains data on its edges after a certain sequence of firing of Since C contains only IVI actors, must visit some actor twice, and hence must contain cycle C . Since the edges of do not contain data, C delay-free cycle. A schedule S is said to be if after certain finite time at least one processor blocks (on buffer full or buffer emptycondition) and stays blocked.
If the specified schedule deadlock-freethen the co~espondingIPC h deadlock-free. This because deadlocked IPC graph would implythat set of processors depend on data from one another in cyclic manner, which in turn implies schedule that displays deadlock. actors execute
The iteration period for soon data available at
his result
0 for an I been in ved
connec~edIP inputs is given by:
raph constructed from an admissible schedso many different contexts l]) that we do not present another proofof
this fact here. The quotient in (7-9),
10)
tire quantity on the right hand of the strongly connected IPC graph C . If the IPC graph contains more than oneSCC, then different may have digerent asy~ptoticiteration periods, depending on their individual maximum cycle means. In such case, the iteration period of the overall graph (and hence the self-timed schedule) is the over the maxi mu^ cycle means of C because the cution of the schedule constrained by the ponent in the system. ceforth, we will define the maximum cycle mean follows. Of all
ANALYSIS OF THE ORDERED-TRANSACTIONS STRATEGY
of G,
That is,
(7- 11)
Note that may be non-integer rational quantity. We will use the term instead of when the graphbeing referred to clear from the context. fundamentalcycle in G, whosecyclemean is equal to is called eri e of G,. Thus the throughput of the systemofprocessors executing particular self-timed schedule is equal to the corresponding
For example, in Figure '7.4, G, has one SCC, and its maximal cycle mean is 7 time units. This corresponds to the critical cycle
((B,E),(E=,0 , B)) have not included IPC costs in this calculation, but these can be included in straightforward manner by appropriately setting the execution times of the and actors. in time
explained in Section 3.15 the maximum cycle mean can be calculated lEl I-)where is the sum of over all actors in the HSDFG.
If we only have execution time estimatesavailable instead of exact values, and we set in the previous section to be these estimated values, then we obtain the iteration period by calculating Henceforth we will assume that we know the
calculated by setting the values to the available timing estimates. discussed in Chapter 1, for most practical scenarios, we can only assume such compile time estimates, rather than clock-cycle accurate execution timeestimates. In fact, this is the reason we had to rely on self-timed scheduling, and we proposed the ordered transaction strategy as means of achieving efficient IPC despite the fact that we do not assume knowledge of exactactor execution times. Section 4.9 discusses estimation techniquesfor actor execution times.
7
an
edge with zero delays represents the constraint k) k ) ”he ordering constraints can therefore be expressed set of edges between communication actors. For example, the constraints (sl, r3,s4, r4, ss, st;, applied to the IPC graph of Figure y the graph in Figure 7.5. If we call these additional ordering (solid mows in Figure then the graph ( V , Ei,U represents constraints in the ordered transa~tionsschedule, it evolves inFigure 7.3. Thus, the maximum cycle mean of ( V , U represents the effect of
receive
~r~ns~~tion
A~ALYSISOF THE
A~SA~TIO~S
adding the ordering constraints. The critical cycle C of this graph drawn in Figure 7.5; it different from the critical cycle in Figure 7.4 because of the added transaction ordering constraints. Ignoring communication costs, the 9 units, which was observed from the evolution of the transaction nstrained schedule in Figure 7.3. The problem of ~ n d i n gan “optima transaction order can therefore be stated ete ermine transaction order such that the resultant constraint edges .EOT do not increase the
noted earlier that the self-timed schedule in Figure 7.2 evolves, it eriodic repeating pattern that spans two iterations of the dataflow graph. Itcanbeshown that irned schedule always settles down into ~ e ~ o dexecution ic pattern; in [B 2) the authors showthat the fi times of transitions in marked graph odic asy~ptotically.Inte our notation, for any strongly connecte start(
for all
E
k
N)
start(
k)
and for all k
Thus, after
“transient” that lasts pattern. The period pattern itself spans The periodicity depends the numbe critical cycles G, it can be high in the critical cycles of G, has one critical cycle with t odicity of two for the schedule in Figure 7.2. The “transient” region define^ by (which is in Figure 7.2) can also be ex~onential. e effect of transients followed periodic regime is essential~y due to of longest paths in weighted directed graphs. These effects have been xt of ins~uctionsche~ul -as-possible firing of transiti sche~ulesfor se~uentiallogic circui the authors note that if inst~ctionsin an iterative progr (represented dependency graph) are schedule^ fashion, pattern of parallel instructions “e~erges’, the authors show how determining t ern (essential1 b simulation leads p parallelization. In echni~uefor deterrni
Chapter 7
n the author studies periodic firing patterns of transitions in etri nets. The iterative algorithms for determiningclockschedules in 21 haveconvergence properties similar to the transients in self-timed (their algorithmconvergeswhenanequivalent self-timed schedule reaches periodic regime). ~eturningto the problem of d e t e ~ i n i n gthe optimal transaction order, one possible scheme to derive the transaction order from the repeating pattern that the self-timed schedule settles into. That is, instead of using the transaction order of Figure 6. if we enforce the transaction order that repeats over two iterations in the evolution of the self-timed schedule of Figure 7.2, the ordered transactions schedule would ‘6mimic”the self-timed schedule exactly, and we would obtain an ordered transactions schedule that performs well the ideal self-timed schedule, and yet involves low IPC costs in practice. However, pointed out above, the number of iterations that the repeating pattern spans depends on the critical cycles of G,, and it can be exponential inthe size of the HSDFC [BCOQ92]. In addition the “transient” region before the schedule settles into repeating pattern can also be exponential. Consequently, the memory requirementsfor the controller that enforces the transaction order can be prohibitively large in certain cases; in fact, even for the example of Figure 7.2, the doubling of the controller memory that such strategy entails may be unacceptable. ~e therefore restrict ourselves to determining and enforcing transaction order that spans only one iteration of SDFC; in the following section we show that there is no sacrifice in imposing such a r~strictionand we discuss how such an“opti~al’,transaction order is obtained.
In this section we show how to determine an order O* on the in the schedule such that imposing O* yields an ordered transactions schedule that has iteration period within one unit of the ideal self-ti~ed schedule S Thus, imposing the order results in essentially no in p e r f o ~ a n c eover an unrestrained schedule, and at the same time we get the benefit of cheaper IPC. r approach to d e t e ~ i n i n gthe transaction order O* is to modify given equal to c schedule so that the resulting fully-static schedule has and then to derive the transaction order from that modified schedule. Intuitively it appears that, for given processor assignment and ordering of actors on processors, the self-timed approach performs better than the fullystatic or ordered transactions approach simplybecause it allows successive iterations to overlap. llowing result, however, tells us that it is always possible to modify any given fully-static schedule so that it per-
ANALYSIS OF THE O R ~ E ~ E ~ - T ~ A N S A C STRATEGY TIO~S
forms nearly
well
its self-timed counterpart. Stated more precisely:
Given fully-static schedule S let T,, be the average iteration period for the corresponding self-timed schedule men2 TS,). Suppose then, there exists valid fullytioned before, S , the same order static schedule S’ that has the same processor assignment of execution of actors on each processor, but an iteration period That S’ where, if actors are on the same processor (i.e., op( oP( then Furthermore, S’ is obtained by solving the following set of linear inequalities for 5
S’,(
in Giv. (7- 14)
for each edge
t(
Let S’ have a period equal to T.Then, under the schedule S’ the k th starting time of actor given by k)
kT
Also, data precedence constraints imply
(7- 15)
in (7-5)) for all
k)
Substituting (7- 15) in (7-
E
EiF
(7-
we have
kT for all
E
EiF That is,
T
for all
E
(7- 17)
Note that the construction of C, ensures that processor assign~entconstraints are automatically met: if and is to be executed immediately after then there is an edge in G,. The relations in (7-17) represent system of inequalities in IVl unknowns (the quantities The system inequalities in (7-17) is difference constraint problem that lEiFl V i ) using the ellm man-Ford shortestcan be solvedin polyno~ialtime path algorithm, described in Section 3.14. Recall that feasible solution to given set of difference equations exists if and only ifthe corresponding constraint graph does not contain negative weight cycle; this is equivalent to condition
T>
in Gipc
and, from (7-g), this is equivalent to T
TS,
(7-
Chapter 7
If we set then the right hand sides of the system of inequali7-17 are integers, and the -Ford algorithm yields integer solutions This is because the on the edges of the constraint graph, are equal to the right hand side of the difference c o ~ s t r ~ n tare s , integers if is an integer; conse~uently,the shortest paths calculated on the constraint graph are integers.
Jv),
is valid fully-static schedule.
Theorem 7.1 essentially states that fully-static schedule can be skewing the relative starting times of processors so that the resulting sched~lehas iteration period less than 1 the resulting iteration period lies within one time unit of its lower bound for the specified processor assignment and actor o r ~ ~ r i nItg .is possible to unfold the graph and generate fullystatic schedule with average period exactly but the resulting increase in code size is usually not worth the benefit of (at most) one time unit the iteration period. Recall that “time unit” is essentially the cl therefore9one time unit can usually be neglected. or example9thestaticschedule S co~espondingtoFigure 7.1 has ll 9 units. Using the procedure outlined in the proof of Theorem 7.1, we can skew the starting times of processors in the schedule S to obtain a schedule S’ as shown in (7-16), that has period equal to 9 units (Figure 7.6). ote that the processor assignment and actor ordering in the schedule of Figure .6 is identical tothatofthe schedule in Figure 7.1. The values are: 6, 0, 5
eor rem 7.1 may not seem useful at first sight: whynot obtain fullystatic schedule thathas period to with, thus eliminating the postprocess in^ step suggested in Theorem 7.1 l1 from Chapters and 5 that a
ANALYSIS OF THE QRDERED~TRANSACTIQNSS T R A T ~ ~ Y
fully-static schedule usually obtained using heuristic techniques that are either based on blocked non-overlapped scheduling (which use critical path based heuristics) [Sih91]or are based on overlapped scheduling techniques that employ list scheduling heuristics [d~H92][Lam88].None of these techniques guarantee that the generated fully-static schedule will have aniteration period within one unit of the period achieved if the same schedule were run in self-timed manner. Thus for schedule generated using anyof these techniques, we might be able to obtain gain in performance, essentially for free, by performing the post-processing step suggested in Theorem 7.1. m a t we propose can therefore be added an efficient post-processing step in existing schedulers. Of course, an exhaustive search procedure like the one proposed in[SI851will certainly findthe schedule S’ directly. to be the transaction order suggested by We set the ans sac ti on order the modified schedule S‘ opposed to the transaction order from S used in Figure 6.1). Thus for the example of Figure 7.1 rl,
r3,
$29
r4,
r5)
Imposing the transaction order in Figure 7.6 results in 9 units instead of 10 that we get if the transaction order of Figure 7.1(b) used. Under the transaction order specified by S’, TO, thus imposing the order ensures that the average period is within one unit of the unconstrained self-timed strategy. Again, unfolding may be required to obtain transacti0~ordered schedule that has period exactly equal to but the extra cost of larger controller (toenforce the transaction ordering) outweighs the small gain of at most one unit reduction in the iteration period. Thus for all practical purposes Q* is the u ~ transaction t ~order. ~ The “optimality” ~ ~ is in the sense that the transaction order we determine statically is the best possible one, given t timing information available at compile time.
We recall that the execution times we use to determine the actor assignment and ordering in self-timed schedule are compile time estimates, and we have been stating that static scheduling is advantageous when we have “reasonably good”’ compile time estimates of execution time of actors. Also, intuitively we expect an ordered transaction schedule to be more sensitive to changes in execution times than an unconstrained self-timed schedule. In this section we attempt to formalize these notions by exploring the effect of changes in execution times of actors on the throughput achieved by static schedule. Compile time estimates of actor execution times may be different from their actual values at run-time due to errors in estimating execution times of actors that otherwise have fixed execution times, and due to actors that display
7
run-time variations in their execution times, becauseof conditionals or datadependent loops within them, for example. The first case is simple to model, and we will show in Section 7.6.1 how the throughput of given self-timed schedule changes function of actor execution times. The second case is inherently difficult; how do we model run-time changes in execution times dueto data-dependencies, or due to eventssuch error-handling, cache misses, and pipeline effects? In Section 7.6.2 below we briefly discuss very simple model for such run-time variations; we assume actors have random execution times accordingto some known probability distribution. We conclude that analysis of even such simple model for the expected value of the throughput is often intractable, and we discuss efficiently computableupperandlowerbounds for the expected throughput.
Consider the IPC graph in Figure 7.7, which is the same IPC graph as in Figure 7.4 except that we have used different execution time for actor H to make the example more illustrative. The number next to each actor represents execution times the actors. We let the execution time of actor C be t( C) tc and we determine the iteration period function of given particular value of t , TsT(tc) The iteration period is given by the maximum cycle mean. The function TsT(tc)is shown in Figure 7.8. When I tc 1 the cycle E)(& critical, and the is constant at 7 , since C not on this cycle; when I t , I 9 the cycle
((B,SI)(Sl, (S59
E)(E,
r3M-3,
r5)(r5,B))
is critical, and since this cycle has two delays, the slope of is 0.5 in this region; finally, when 9 I tc the cycle ((C, s5)( s5, G)( G,C)) becomes critical, and the slope now is onebecause there is only one delay on that cycle. Thus the iteration period is piecewise linear function of actor execution times. The slope of this function is zero if the actor is not on critical cycle, otherwise it depends on the number of delays on the critical cycle(s) that the actor lies on. The slope at most one (whenthe critical cycle containing the particular actor has single delay on it). The iteration period is a execution times. for every
X],x2 E
function b ) and
is said to be
over an interval
b) if
Ih I
~eometrically,if we plot convex function along line drawn between two points on the curve lies above the curve (but it may overlap sections of the
ANALYSIS OF
A ~ S A ~ T I O NSTRATEGY S
curve). It easily verified geomet~callythat is convex: since this function is piecewise linear with slope that is positive and non-decreasing, a line joining two points on it must lie above (but may coincide with) the curve. plot function of execution times of more than one can actor (e.g., this function will be a convex surface consisting of intersecting planes. Slices of this surface along each variable look like Figure 7.8, which is slice parallel to the tc axis, with the other execution times held constant 3 tB 3 etc.). The modeling described inthis section is useful for determining how “sensitive” the iteration period is to fixed changes in execution times ofactors, given processor assignment and actor ordering. We observe that the iteration period increases linearly (with slope one) at worst, and does not change at all at best, when execution time of an actor is increased beyond its compile time estimate.
The effect of variations in execution times ofactors on the performance of statically scheduled hardware inherent~ydifficult to quantify, because these
Figure 7.7. G ,,, of c.
where actorC has execution time tc,constant over all invocations
Chapter 7
variations could occur due to large number of factors conditional branches or data-depen~entloops within an actor, error handling, user inte~upts,etc. and because these variations could have a varietyof different characteristics, from beingperiodic, to being dependent on the input statistics, and to being completely random. result, thus far we have had to resort to statements like “for static scheduling strategy to be viable, actors must not show signi~cantvariations in execution times.” In this section we point out the issues involved in modeling the effects of variations in execution times of actors.
A very simple model for actors with variable execution times is to assign to each actor an execution time that is a random variable (r.v.) with discrete probability d i ~ t ~ b u t i o(p.d.f.); n successive invocations of each actor are assumed statistic all^ independent, execution times of di~erentactors are also assumed to be independent, and the statistics of the random execution times are assumed to be time-invariant. Thus, for example, an actor could have execution time t , withprobability(w.p.) p and execution time t2 w.p. (1 The model is essentially that flips a coin each time it is invoked to decide what its execution time should be for that invocation. Such model could describe data-dependent conditional branch for example, but it is of course too simple to capture many real scenarios.
11 slope= l12
1
Tsdtc)
2
S
A ~ A L ~ S OF I S THE O ~ ~ E ~ E R ~ T ~ A ~ S ASTRATEGY CTIO~S
Dataflow graphs where actors havesuchrandom execution times have been studied by Olsder [Ols89][0 in the context of modeling dataS [KLL87]) where the multi~ly driven networks called wave-fro operations in the array display data-dependent execution times. The authors show that the behavior of such system can be described by discrete-time chain. idea behind this, briefly, is that such system is described by state space consist in^ of set of state vectors Entries in each vector represent the k th starting time of each actor normalized with respect to one (any arbitrarily chosen) actor: L
k)
k) (7- 19)
k)
k)
The normalization (with respect to actor in the above case) is done to make the state space finite; the number of distinct values that the vector fined above) can a s s u ~ eis shown to be finite in [ORV 901. The states of the arkov chain correspond to each of the distinct values of S . The average iteration period, which is defined T
lim
K) kl
(7-20)
can then be derived from the stationary distribution of the Markov chain. There are several technical issues involved in this definition of the average iteration period; for example, when does the limit in (7-20) exist, and how do we show that the limit is in fact the same for all actors (assuming that the H strongly connected)? These questions are fairly non-trivial because th process k may not even be stationary. These questions are answered 3, where it isshownthat: rigorously in T
lim
K) K
us the limit T is in fact constant
E[T]
(7-21)
surely [Pap91].
such exact analysis, however, is the very large state found that for an IPC Graph similar to Figure 7.4, with ion times, and as sum in^ that only is random (takes two different valueon weighted coin flip), we could get several thousand states for the chain. A graphwithmore vertices leads to aneven larger state space. The size of the state space can be exponential in the number of
7
vertices (exponential in IVl Solving the stationary distribution for such chains would require solving set of linear equations equal in number to the number of states, which is highly compute intensive. Thus we conclude that this approach has limited use in d e t e ~ i n i n geffects of varying execution times; even for unrealistica~lysimple stochastic models, computation of exact solutions is prohibitive. Ifweassume that all actors haveexponentially distributed execut~on es, then the system can be analyzed using continuous-time This is done by exploiting the memoryless property of dis~ibution~ when an actor fires, the state of the system at any moment does not depend on how long that actor has spent executing its function; the state changes only when that actor completes execution. The number of states for such a system equal to the number of different valid token configurations on the edges of the ~ a t a ~ ograph, w where by “valid” we imply any token configuration that can be reached by sequence of firings of enabled actors in the HSDFG.This is equal to the number of [LS91] that exist for t number, unfo~unately,can again be exponential in the size o f t Analysis of such graphs with exponentiallydist~butede been extensively studied in the area of stochastic Petri nets (i provides large and ~omprehensivelist of references on Petri a number of which focus on stochastic Petri nets). There is considerable bodyofwork that attempts to cope h the state explosionproblem. Some of these works attempt to divide given tri net into parts that can be solved separately (e.g., [VvV93]), some others propose simplified solutions when the graphs have particular structures (e.g., [CS93]), and others propose approxim tions for values such the expected firing rate of transitions (e.g., e of these methods are general enough to handle even significant class of graphs. Again, exponentially distributed execution times for actors is clearly crude approximation to any realistic scenario to make the com~utations involved in exact calculations worthwhile,
As an alternative to d e t e ~ i n i n gthe exact value of E[T] we discuss how to determine efficiently computable bounds for it. G V , E that has actors with random execution tunes, define G,,, to be an equivalent graph with actor execution times equal to the expected value oftheir execution times in G . [Dur91] (Jensen’s inequality) If f(E[xl)
convex function of
then:
ANALYSIS OF THE ORDERED-TRANSACTIONS STRATEGY
In [RS94] the authors use Fact 7.1to show that E[TI MCM(G,,,) This followsfrom the fact that is convexfunctionof the execution times of each of its actors. This result is especially interesting because of its generality; it is true no matter what the statistics of the actor execution times are (even the various independence assumptions we made can berelaxed!). One might wonder what ~ a x i ~ ucycle m mean following:
the relationship between E[T] and can again use Fact 7.1, along with the fact that the convex function of actor execution times, to show the
However, we cannot say anything about E[T] are IPC graphs where
in relation to C)] and
If the execution times of actors are all bounded V , e.g., if all actors have execution times uniformly interval b] then we can say the following: E
others
where
t( t,,,( distributed in some
where C,,, ( V , E) is same C except the random actor execution times are replaced by their upper bounds t,n,,( and similarly Gmin ( V , is the same G except the random actor execution times are replaced by their lower bounds t, in Equation (7-23) summarizes the useful boundsweknow for expected value of the iteration period for graphs that contain actors with random execution times. It should be noted that good upper bounds on E[T] are not known. baumand Sidi proposeupperbounds for exponentially distributed execution times [RS94]; these upper bounds are typically more than twice the exact value of E[T] and hence not very useful in practice. We attempted to simplify the Markov chain model (i.e., reduce the number of states) for the self-timed execution of stochastic HSDFG by representing such an execution set of selftimedschedules of deterministic betweenwhich the systemmakes transitions randomly. This representation reduces the number of states of the arkov chain to the number of different deterministic graphs that arise from the stochastic HSDFG. We were able to use this idea to determine an upper bound for TI however, this bound also proved to be too loose in general (hence we omit the details of this construction here).
Intuitively, an ordered transactions schedule is more sensitive to variations
Chapter 7
in e~ecutiontimes; even though in a functional sense, the computations performed using the ordered transactions schedule arerobust with respect to execution time v~iations (the transaction orderensures correct sender-receiver sync~onization).The ordering rest~ctionmakes the iteration period more dependent on execution time vari~tionsthan the ideal ST schedule. This is apparent from our IPC graph model; the transaction ordering constraints add additional edges to G, For example, an IPC graph with transaction ordering constraints represented as dashed arrows shown in Figure 7.9 (we use the transac(S,, r,, r3, s4, r4, s5, r5) determined in Section communication times are not included). The graph for TOT(tC) is now ~ i ~ e r e and n t is plotted in Figure 7.8. Note that the TOT(tC) curve for the ordered transactions schedule (solid) is “above” the corresponding curve for the unconstrained schedule (dashed): this shows precisely what we mean by an ordered transact~onsschedule being more sensitive to v ~ a t i o n sin execution times of actors. The “optimal” transaction order we d e t e ~ i n e densures that the transaction cons~aintsdo not sacrifice t ~ r o u ~ h p u(ensures t was when actor execution times are equal to in Section 7.5, and sure enough, tc) calculated using tc when tc 3 .
Figure 7.9.
graph
with transaction ordering constraints repres~nted as
A ~ A L Y S I SOF THE Q R ~ E R E D - T R A ~ S A C T I QSTRATEGY ~S
odel ling using random variables for the ordered transactions schedule can again be done before, and since we have more constraints in this schedule, the expected iteration period will in some cases be larger than that for selftimed schedule.
In this chapter we presented quantitative analysis of self-timed and ordered transactions schedules andshowedhow to determine the effects of imposing transaction order on self-timed schedule. the actual execution times do not deviate signi~cantlyfrom the estimated values, the difference in performance of the self-timed and ordered transactions strategies minimal. If the execution times do in fact vary signi~cantly,then even self-timed strategy not practical; it then becomes necessary to use more dynamic strategy such static assignment or fully dynamic scheduling to make the best useof computing resources. Under the assumption that the variations in execution times are small enough so that self-timed or an ordered transactions strategy is viable maybe wiser to use the ordered transactions strategy rather than self-timed because of the more efficient of the ordered transactio~sstrategy. This is because transaction order can be efficiently determined such that the ordering constraints do not sacrifice performance; if the execution times of actors are
Figure 7.10.
and ToT(tC)for the example of Figure 7.9.
close to their estimates, the ordered transactions schedule with O* the transaction order hasiteration period close to the minimum achievable period Thus we make the best possible use of compile time information when we determine the transaction order O* The complexities involved in modeling run-time variations in ~xecution times of actors were also discussed; even highly simplified stochastic models are difficult to analyze precisely. We pointed out bounds that have been proposed in Petri net literature for the value of the expected iteration period, and concluded that although lower bound available for this quantity for rather general stochastic models (using Jensen's inequality), tight upper bounds are not known to date, except for the trivial upper bound using maximum execution times ofactors
The techniques of the previous chapters apply compile time analysis to static schedules for application graphs that have no decision-making at the dataflow graph (inter-task) level. This chapter considers graphs with data-dependent ecall that atomic actors in an SDF graph are allowed to perform data-dependent decision making within their body, long their inpu~output behaviour respects SDF semantics. We show how some of the ideas we explored previously can still be applied to dataflow graphs containing actors that display data-dependent firing patterns, and therefore are not SDF actors.
model was proposed by Lee [Lee911 and 931 for extending the SDF model to allow data-dependent control actors in the dataflow graph. BDF actors are allowed to contain control input, and the number of tokens consumed and produced onthe arcs of BDF actors can be two-valued function of token consumed at the control input. Actors that follow SDF semantics, i.e., that consume and produce fixed number of tokens on their arcs, are clearly subset of the set of allowed BDF actors (SDF actors simply do nothaveany control inputs). Two basic dynamic actors in the BDF model are the SWITCH and SELECTactors shown in Figure 8.1. The switch actor consumes one Boolean-valued control token and another input token; if the control token is TRUE, the input token copied to the output labelled T, otherwise it is copied to the output labelled F. The SELECT actor performs the complementary operation; it reads an input token from its T input if the control token TRUE, otherwise it reads from its F input; in either case, it copies the token to its output. Constructs such conditionals and datadependent iterations can easily be represented in BDF graph, illustrated in
8
Figure 8.2. The vertices A, B, C, etc. in Figure 8.2 need not be atomic actors; they could also be arbitrary SDF sub-graphs. A BDF graph allows SWITCH and SELECT actors to be connected in arbitrary topologies. Buck [Buc93] in fact shows that any Turing machine can be expressed as a BDF graph, and therefore the problems of d e t e ~ i n i n gwhether such a graph deadlocks and whether ituses boundedmemory are undecidable. Buck proposes heuristic solutions to these problems basedonextensions of the techniques for SDF graphs to the BDF model.
Buck presents techniques for statically scheduling BDF graphs on a single processor; hismethodsattempt to generate a sequential programwithout a dynamic scheduling mechanism, using structs where required. Because of the inherent undecidability of determining deadlock behaviour and bounded memory usage,these techniques are not always guaranteed to generate a static schedule, even if one exists; a dynamically scheduled implementation?where a run-time kernel decides which actors to fire, can be used when a static schedule cannot be found ina reasonable amount of time. Automatic parallel scheduling of general BDF graphs is still an unsolved problem. A mechanism for scheduling graphs that contain S ~ I T ~andH SELECT actors is to generate an Acyclic Precedence Extension Graph (APEC), similar to the APEG generated forSDF graphs discussed in Section 3.8, for every possible assignment of the Boolean valued control tokens in the BDF graph. For example, the if-then-else graph in Figure 8.2(a) could have two different APEGs, shown in Figure 8.3, and APEGs thus obtained can be scheduled individually using a self-timed strategy; each processor now gets several lists of actors, one
E X T E ~ THE ~ IOMA ~ ~ARCHITECT~RE
list for each possible assignment of the control tokens. The problem with this approach is that for graph with n different control tokens, there are possible distinct APEGs, each co~espondingto each execution path in the graph. Such set of APEGs can be compactly represented using the so-called Annotated AcyPC) of [Buc93] inwhich actors and arcs are annoclic Precedence Graph tated with conditions under which they exist in the graph. Buck uses the AAPG construct to determine whether bounded-len~thuniprocessor schedule exists. In the case multiprocessor scheduling, it not clear how such an AAPG could be used to explore scheduling options for the different values that the control tokens could take, without explicitly enu~eratingall possible execution paths.
Figure 8.2. (a) ~onditional(if-then-else)d~taflowgraph. The branch outcome is determined at run-time by actorB. (b) Graph represent in^ data-depend~ntiteration. The terminatiQnconditiQn for the loopis determined by actor D.
A useful body of work in parallel scheduling of dataflow graphsthat have dynamic actors is the Section approach, discussed in tech ed that statically schedule stan4.6. In this work, darddynamic constructs such data-dependent conditionals, data-dependent iterations, and recursion. Such quasi-static scheduling approach clearly does not handle general BDF graph, although it is a good starting point for doing so.
will consider only the conditional and the iteration construct here. We e that we are given quasi-static schedule, obtained either ma techniques that were described briefly in Section 4.6. explore how the techniques proposed inthe previous chapters for multiprocessors that utilize self-timed scheduling strategy apply when we implement quasistatic schedule on a multiprocessor. First, we propose an implementation of quasi-static schedule on shared memory multiprocessor, and then we show how we can implement the same program on the OMA architecture, using the hard-
ure 8.3. Acyclic precedence extension graphs (APEGs) corresponding to the if-the~-elsegraph of Figure 8.2. (a) corresponds to theTRUE assignment of the control token, (b) to the FALSE: assignment.
E X T ~ N ~ THE I N ~OMAARCHITECTUR~
ware support provided in the 0 A architecture prototype.
A quasi-static schedule ensures by means of the execution profile that the pattern of processor availability is identical regardless of how the data-dependent construct executes at run-time; in the case of the conditional construct this means that irrespective of which branch is actually taken, the pattern of processor availability after the construct completes execution the same. This hasto be ensured by inserting idle time on processors when necessary. Figure 8.4 shows a quasistatic schedule for conditional construct. Maintaining the same pattern of processor availability allows static scheduling to proceed after the execution of the conditional; the data-dependent nature of the control construct can be ignored at that point. In Figure 8.4 for example, the scheduling of subgraph-l can proceed independent of the conditional construct because the pattern of processor availability after this construct the same independent of the branch outcome; note that “nops” (idle processor cycles) have been inserted to ensure this.
ltiprocessor i~plementationof a quasi-static schedule direct~y,howies enforcing global synchronization after each dynamic construct in order to ensure particular pattern of processor availability. We therefore use m e c h a n i s ~similar to the self-timed gy; wefirstdeterminea quasi-static schedule using the methods of Lee and and then discard the timing information and the restrictions of maintaining a processor availability profile. Instead, we only retain the assignment of actors to processors, the order in which they execute, and under what conditions on the Boolean tokens in the system the actor should execute. Sync~onizationbetween processors is done at run-time whenever processors communicate, This scheme is analogous to constructing a self-timed schedulefrom fully-static schedule, discussed in Section 4.3. Thus the quasi-static schedule of Figure 8.4 can be implemented by the set of programs in Figure 8.5, for the three processors. Here, rcl,rc2,r17 are the receive actors, and s , ~ ,S,, S , > are the send actors. The subscript refers to actors that communicate control tokens. The main difference between such an i~plementationand the self-timed implementation we discussed inearlier chapters are the control tokens. conditional construct is partitioned across more than one processor, the control token(s) that determine its behavior must be broadcast to all the processors that execute that construct. Thus, in Figure 8.4, the value c , which is computed by ocessor 2 (since the actor that produces c is assigned to Processor 2), must be broadcast to the other two processors. In shared memory machine this broadcast can be implemented by allowing the processor that evaluates the control
Chapter 8
token (Processor 2 in our example) to write its value to particular shared mernlocation preassignedat compile time; theprocessorwillthenupdate this location once for each iteration of the graph. Processors that require the value a articular control token simply read that value from shared memory, and the processor that writes the value of the control token needs to do so only once. In
i
'OR
proe p~oe proe
proe
for
E ~ T E N ~ I NTHE G
ARCHITEC~RE
this way, actor executions can be conditioned upon the value of control tokens evaluated at run-ti~e.In the previous chapters, we discussed synchronization associated with data transfer between processors. Synchronization checks must also be performed for the control tokens; the processor that writes the value of a tokenmustnot overwrite the shared memory location unless all processors requiring the value of that token have in fact read the shared memory location, and processors reading a control token must ascertain that the value they read corresponds to the current iteration rather than a previous iteration. The need for broadcast of control tokens creates additional communication overhead that should ideally be taken into account during scheduling. The methods of Lee and Ha, and also prior research related to quasi-static scheduling that they refer to in their work, do not take this cost intoaccount. Static multiprocessor scheduling applied to graphs with dynamic constructs taking costs of distributing control tokens into account is thus an interesting problem for further study.
Recall that the QMA architecture imposes an order in which shared memis accessed by processors in the machine. This is done to implement the OT
Proc 1 receive c if E receive F else I receive
Proc 2 B send c (scl) C if (c) send ( S , ) G else K sub~raph-l>
Proc 3 D receive c (rc2) if (c) H else L send (S$ ode for subgraph-l>
Figure 8.5. Programs on three processors for the quasi-static schedule of Figure 8.4.
Chapter
strategy, and is feasible because the pattern of processor communications in self-timed schedule of an HSDFG in fact predictable. What happens when we want to run program derived from quasi-static schedule, such the parallel programinFigure 8.5, whichwasderivedfrom the scheduleinFigure Clearly, the order of processor accesses to shared memory is no longer predictable; it depends on the outcome of run-time evaluation of the control token c . The quasi-static schedule of Figure 8.4 specifies the schedules for the TRUE and FALSE branches of the conditional. If the value of c were always TRUE,then we can determine fromthe quasi-static schedule that the transaction order would be (sei, sl, r l , caccess order for subgraph-l>) and if the value of c were always FALSE,the transaction order would be s2,
caccessorder for subgraph-l>)
ote that writing the control token c once to shared memory is enough since the same shared location can be read by all processors requiring the value of c.
A architecture, possible strategy to switch between these two access ordersat run-time. This is enabled by the preset feature of the transaction controller (Section 6.6.2). Recall that the transaction controller is implemented presettable schedule counter that addresses memory containing the corresponding to the bus access order. To handle conditional con-
sche~ulefor subgraph-l proc l
2 proc 3 sl, rl,
proc I proc 2 proc 3
THE
structs, we derive two bus access lists co~espondingto each path inthe program, and the processor that determines the branch condition (processor 2 in our example) forces the controller to switch between access lists by loading the schedule access schedule ofFigcounter with the appropriate value (address“7” in the ure 8.7). NotefromFigure 8.7 that there are two points where the schedule counter can be set; one is at the completion of the TRUE branch, andthe other is a jump into the FALSE branch. The branch into the FALSE path best taken careof by processor 2, since itcomputes the value of the control token c , whereas the branch after the TRUE path (which bypasses the access list of the FALSE branch) is best taken care of by processor 1, since processor 1 already possesses the bus at the time when the counter needs to be loaded. The schedule counter load operations are easily incorporated into the sequential programs of processors 1 and 2. The mechanism of switching between access orders works well when the number of control tokens is small. But if the number of such tokens is large,
bus access list
Addr
forces co~trollerto ju ccess listfor the E branch if c is proc forces c o ~ ~ o l lto e rbypass the
value of c is
access list thatis stored in the schedule ure 8.6. L o a ~ i oper n ~ tion of the schedule countercon~itionedon sho~n.
Chapter 8
then this mechanisms breaks down, even if we can efficiently compute quasistatic schedule for the graph. see why this is so, consider the graph in Figure 8.8, which contains conditional constructs in parallel paths going fromthe input to the output. The functions and “g,” are assumed to be subgraphs that are assigned to more than one processor. In Ha’s hier~chical scheduling approach, each conditional is scheduled independently; once scheduled, it is converted into an atomic node in the hierarchy, and profile is assigned to it. Scheduling of the other conditional constructs can thenproceed based on these profiles. Thus, the scheduling complexity in terms of the number of parallel paths is if there are parallel paths. If we implement the resulting quasi-static schedule in the manner stated in the previous section, and employ the mechanism above, we would need one bus access list for every combination of the bk.This is because each fi and Q will have its ownassociated bus access list, which then has to be combined with the bus access lists of all the other branches to yield one list. For example, if all Booleans are true, then all the are exe-
E X T E ~ THE ~ IOMA ~ ~ARCHITEC~~RE
cuted, and we get one access list. If b1 is TRUE, and b2 through b, are FALSE, executed, and through are executed. This corresponds to another cess list. This implies bus access lists for each of the combination of that execute, i.e., for each possible execution path in thegraph.
Although the idea of maintaining separate bus access lists simple mechanism for han const~cts,itcan sometimes be i~practical, in the example above. an alternative mechanism based on ~u~~~~~that handles arallel conditional constructs more effectively. e main idea behind masking to store an ID of along with the processor ID in the bus access list. The Boolean ID determines whether p ~ i c u l a bus r grant is “enabled.” This ows us to combine the access lists of the nodes through and g, through The bus grant co~esponding to each is tagged with the boolean ID of the corresponding b,, and an additional bit indicates that the bus grant is to be enabled when “RUE. Similarly, each bus grant co~espondingto the access list of gi is tagged with the ID of and an additional bit indicates that the bus grant must be enabled only if the correspon~ing control token has FALSE value. At run-time, the controller steps through the bus access list as before, but instead of simply granting the bus to the procesat the head of the list, it first checks that the control token corre§ponding to field of the list is in its correct state. If it is in the correct state for bus grant corresponding to an and FALSE for bus grant corresponding to gi), then the bus grant is performed, otherwise it masked. Thus the run-time values of the Booleans must be made available to the transaction controller for it to decide whether to mask a pa~icularbus grant or not. rant should be enabled by product the dataflow graph, and the completed conditionals in parallel branches of the g
hus, in general we need to implement an (cl)Pr~cZDl, ( c ~ ~ P r ~ c Z D ~each , bus access valued condition r~spondingto
bus access list of the form annotated with
indicating that the bus should be granted to the processor corwhen
evaluates to
uct function of the ooleans b,, b,, ooleans (e.g., complement),
UE;
could be an arbitr
b,} in the system, and the complements
K ,where the bar over
variable indicates its
Chapter
This scheme is implemented shown in Figure 8.9. The schedule memnow containstwo fields corresponding to eachbus access: CCondiCID> instead of the
field alone that wehad before. The n> field encodes unique product associated with that particular bus access. In the prototype, we can use 3 bits for , and bits for the This would allow us to handle 8 processors and 32 product ooleans. Therecanbe to 3” productterms in the worst case corresponding to Booleans in the system, because for each Boolean b, a product term could contain bi or it could contain its complement or else b, could be a “don’t care”. It is unlikely that all 3” possible product terms
shared address bus
Cl
Signal indicating whether to mask currentBG or not
BGl
ccess mechanism that ontrol tokens that are evalua
BGn
will be required in practice; we therefore expect such scheme to be practical. can be implemented within the controller at The ne cess^ product terms compile time, based on the bus access pattern of the particular dynamic dataflow graph to be executed. In Figure 8.9, the flags bl,b,, are l-bit memory elements (Aipflops) that are m e ~ o r ymapped to the shared bus, and store the values of the oolean control tokens in the system. The processor that computes the value of each control token updates the conesponding b, by writing to the shared memcation that maps to bi The product combinations c,, c,, c, are just could be functions of the b,S and the complement of the b, e.g. the schedule counter steps through the bus access list, the bus grant is ranted only if the condition conesponding to that access evaluates to us if the entry appears at the head of the bus access list, G , then processor 1 receives a bus grant only if the control token E and b, is FALSE, otherwise the bus grant is masked and the schedu~e counter moves upto the next entry in the list. This schemecanbe inco~oratedinto the transaction controller inour architecture prototype, since the controller is implemente product terms cl, c, may be programmed into the F compile time; when we generate p r o g r a ~ s ate the a n n ~ t a ~ ebus d access list and hardware description for the L, say) that imple~entsthe required product terms.
bus daccess
list
straightforward, even if inef~cient,mechanism for obtaining such list is to use enumeration; we simply enumerate possible combinations of ooleans), and determine system (2" combinations for sequence(sequence of 'S) for each combination.achcombination conesponds to an execution path in the graph, an we can estimate the time of occurrence of bus accesses conesponding to eac combination from the quasi-static schedule. For example, bus accessesc o ~ e s p o n d i nto~one sche two execution paths in the quasi-static schedule of Figure 8. along the time axis shown in Figure 8.10 (we have igno co~espondingto su~graph-lto keep the illustration simple). he bus access schedules
for each of the combinations can now be col-
Chapter
sed into one annotated list, in Figure 8.10; the fact that accesses for each u s to enforce global order combination are ordered with respect to time on the accesses in the collapsed bus ac list are annotated with their respective The collapsed list obtained above canbe used, in the maskedcontroller scheme; however, there is potential for optimi~ingthis list. Note, however, that the same transaction may appear in the cess list c o ~ e s ~ o n d i ntogdi~erent olean combinations, because particular that bus access. For example, the first t in both execution paths, because they ar t ofthe value of C , In the worst case, bus access that end ill ind up a p ~ e ~ n g in the bus access lists all of the these bus accesses appear contiguously in the collapsed bus access sequence, we can combine them into one. For example, “(c Proc2, P~ocLZ” inthe annotated schedule of Figure 8.10 can be combined into single “Proc 2” entry, which is not conditioned on any control token. Consider another example: ifwe get contiguous entries
c=T
t
C
FAL
lists and the annotated listc o r r e s ~ o nto~ i ~ ~
b, Proc3” and b, b2 Proc3” in the collapsed list, we can replace the two entries with a singleentry bl Proc3”’. ore generally, if the collapsed list contains contiguous segment of the form: k,
c ~ ) ~ r o c I L ) k 7(c,)ProcIL)k7
each of the contiguous segments can be written c,)~rocIL)k7 where the bus grant condition is an expression (c, c,) which sum of products function of the Booleans in the system. Two-level logic minimization can then be applied to determine minimal representation of each of these expressions. Such 2-level minimization can be done by using a logic ‘mization tool such 0 ~ B H ~ S V c ~ which 4 1 , simpli~es given expression into an entation with minimal number of product c,) can be minimized into another terms. Suppose the expre SOP expression where p The segment ( c ~ ) ~ r o c I L ) k ,( c , ) P r o c ~
k,
can then be replaced with an equivalent segment of the form: k)
ProcIL)k7
This procedure results in minimal set of contiguous appearances of a bus grant to the same processor. noth her optimization that can be performed is to combine annotated bus access lists with the switching mechanism of Section Suppose we have the following an~otatedbus access list:
(b, Then, by “factoring”
(b,
(b,
bS)~rocIL)~)
out, the above list may be equivalently written
(b~){(G)~~uc~L)i,
(b4 bS)ProcIL),},
W, all the three bus accesses may be skipped whenever the Boolean
is
LSE by loading the schedule counter and forcing it to increment its count by three, instead of evaluating each access separately, and s ~ p p i n gover each one individually. This strategy reduces overhead, because it costs an extra bus cycle access when condition co~espondingto that bus access evaluby s ~ ~ p i over n g three bus accesses that we know are going to be disabled, we save three idle bus cycles. There is an added cost of one cycle for loading the schedule counter; the total savings in this example therefore two bus cycles. One of the problems with the above approach is that it involves explicit enumeration of all possible combinations of ooleans, the complexity of which
Chapter
limits the size of problems that can be tackled with this approach. An implicit mechanism for representing all possible execution paths therefore desirable. One such mechanism is the use of Binary Decision Diagrams (BDDs), which have been used to efficiently represent and manipulate Boolean functions for the purpose of logic minimization [Bry86]. BDDs have been used to compactly represent large state spaces, and to perform operations implicitly over such state spaces when methods based on explicit techniques are infeasible. One difficulty encountered in applying BDDs to the problem of representing execution paths is that it is not obvious how precedence and orderingconstraints can be encoded in representation. The execution paths co~espondingto the various Boolean co~binationscan be represented using BDD, but it isn't clear how to represent esponding to the different execution
in Figure 8.2(b). A quasistatic schedule for such a construct may look like the one in Figure 8.1 l. The C, and D of Figure 8 4 b ) are assumed to be subgraphs rather than Such quasi-static schedule can be i~plementedin s t r a i g h t f o r w ~ ~ A architecture, provided that the data-dependentconstruct spans all the processors in the system. The access schedule c o ~ e s p o n ~ i ntog iterated subgraph is simply repeated until the iteration construct t e r ~ ~ n a t e s . processor responsible for determining when the iteration t e ~ i n a t e scan be made to force the schedule counter to loop back untilthe termination condition is reached. This shown in Figure 8.12.
2 3
u~si-staticschedule for the data-dependent iteration graph of Fig-
E X T E N ~ I NTHE ~ OMA A ~ C H I T E C ~ ~ E
This chapterhas dealt withextensionsoftheordered-transactions approach to graphswithdata-dependent control flow. The BooleanRataflow model was briefly reviewed, and the quasi-static approach to scheduling conditional anddata-dependent iteration constructs. A schemewasthendescribed whereby the Ordered Transactions approach could be used when such control constructs are included in the dataflow graph. In this scheme, bus access schedules are computed for each set of values that the control tokens in the graph evaluate to, and the bus access controller is made to select between these lists at runtime based on which set of values the control tokens actually take at any given time. This was also shown to be applicable to data-dependent iteration constructs. Such scheme is feasible when the number of execution paths in the graph small, A mechanis~based on masking of bus accesses depending on run-time values of control tokens may be usedfor handling the case when there are multiple conditional constructs in “parallel.”
access list
Processor that determines termination condition of the iteration can also reinitialize the schedule counter
Figure 8.1 2. possible access order list corresponding to the quasi-static schedule of Figure 8.1 1.
This Page Intentionally Left Blank
e previous three chapters have been concerned with the actions s~ategy,which a hardware approach to reducingI tion costs in self-timed schedules. In this chapter and the fol we discuss software"based strategies for minimizin~ nchronization costs in the final implementation of given self-timed schedule. ese software-based techniques are widely-applicable to sh~ed-memorymu1 rocessors that consist of eous or heterogeneous collections of processors, and they do not require bility of hardware support for employing the OT approach or any other form of specialized hardware suppo~. Recall that the self-timed scheduling s~ategyintroduces sync checks whenever processors communicate. A straightforward imple timed schedule would require that for each inter-processor c o ~ ~ u n i c ~ t i o n the sending processor ascertain that the buffer it is writing to is e receiver ascert~nthat the buffer itis reading from not empty. cessors block (s~spendexecution) when the appro~riatecondition is not met. Such sender-receiversynchronization can be implementedin any ing on the p ~ i c u l a r h a r ~ w aplatform re under consideration: in sh machines, such synchronization es testing and setting sharedmemory;in machines that synchronization in h ~ d w a r e(such as ~ ~ i e r sspecial ), sync~onization ins~uctions are used; and in the case of systems that consist of a mix of p r o g r ~ m a b l eprocessors and custom hardware ments, sync~onization is a c ~ e v e dby ~mployinginterfaces that support bloc reads and writes. In each type of platform, each that requires synchronization check costs performance,and sometimes extra hardware com~lexity. ~ e m a ~ h o r e checks cost execution time on the processors, synchronization ins~uctionsthat
Chapter
make use of special synchronization hardware such barriers also cost execution time, and blocking interfaces between programmable processor and custom hardware in combined hardware/software implementation require more hardware than non-blocking interfaces [H+93]. In this chapter, we present algorithms and techniques that reduce the rate at which processors must access shared memoryfor the purpose of synchronization in multiprocessor implementations of SDF programs. One of the procedures we present, for example, detects when the objective of one s y n ~ ~ o n i z a t i oopern ation is guaranteed side effect of other synchronizations in the system, thus enabling us to eliminate such superfluous sync~onizationoperations. The optimization procedure that we propose can be used post-processing step to any static scheduling technique (for example, to any one of the techniques presented in Chapter S) for reducing synchronization costs in the final implementation. As before, we assume that “good” estimates are available for the execution times of actors and that these execution times rarely display large variations so that selftimed scheduling viable for the applications under consideration. If additional timing information is available, such guaranteed upper and lower bounds on the execution times ofactors, it is possible to use this information to further optimize synchronizations in the schedule. However, use of such timing boundswill be left future work; we mentionthis again in Chapter 13.
Among the prior work that is most relevant to this chapter is the principle of Dietz, Zaafrani, and O’Keefe, which is combined hardware and software solution to reducing run-time sync~onizationoverhead [DZ092]. In this approach, shared-memory MIMD computer augmented with hardware support that allows arbitrary subsets of processors to synchronize precisely with respect to one another by executing sync~onizationoperation called If subset of processors is involved in barrier operation, then each processorin this subset will wait at the barrier until all other processors in the subset have reached the barrier. After all processors in the subset have reachedthe barrier, the sync~r~ny. co~espondingprocesses resume execution in In [DZ092], the barrier mechanism is applied to minimize synchronization overhead in self-timed schedule with hard lower and upper bounds on the task execution times. The execution time ranges are used to detect situations where the earliest possibleexecutiontimeof task that requires datafrom another processor is guaranteed to be later than the latest possible time at which the required data is produced. M e n such an inference cannot be made, barrier is instantiated between the sending and receiving processors. In addition to performing the required data synchronization, the barrier resets (to zero) the uncer-
SYNCHRONIZATIO~IN SELF-TI
tainty between the relative execution times for the processors that are involved in the barrier, and thus enhances the potential for subsequent timing analysis to eliminate the need for explicit synchronizations. The techniquesof barrier IMD do notapply to the problem that we addressbecausetheyassume that hardware barrier mechanism exists; they assume that tight boundson task executiontimes are available; they do not address iterative, self-timed execution, in which the execution of successive iterations of the dataflow graph can overlap; and even for non-iterative execution, there is no obvious correspondence between an optimalsolution that uses barrier synchronizations and an optimal solution that employs decoupled synchronization checks at the sender and receiver end point is illustrated in Figure 9.1. Here, in the absence of execution time bounds, an optimal application of barrier synchronizations can be obtained by inse~ing two barriers one barrier across and and the other barrier across A4 and This is illustrated in Figure 9,l(c). However, the corresponding collection of directed sync~onizations to and t0A4 is not sufficient since it does not guaranteethat the data required by from available before begins execution.
In [Sha89], Shaffer presents an algorithm that minimizes the number of directed synchronizations in the self-timed execution of dataflow graph. However, this work, like that of Dietz et al., does not allowthe execution of successive iterations of the dataflow graph to overlap. It also avoids havingto consider dataflow edges that have delay. The technique that we discuss in this chapter for removing redundant synchronizations can be viewed generalization of Shaffer’s algorithm to handle delays and overlapped, iterative execution, and we will discuss this further in Section 9.7. The other major software-based techniquesfor sync~onizationoptimization that we discuss in this book handling the feedforward edges of the synchruni~ationgraph (to be defined in Section 9.5.29, discussed in Section 9.8, and “resynchronization”, discussed in Chapters 10 and l 1 are fundamentally different from Shaffer’s technique since they address issues that are specific to the more general context of overlapped,iterative execution. As discussed in Chapter 4, multiprocessor executing self-timed schedule is one where each processor is assigned sequential list of actors, some of which are send and receive actors, which it executes in an infinite loop. When processor executes communication actor, it synchronizes with the processor(s) it communicates with. exactly when processorexecuteseach actor depends on when, at runtime, all input data for that actor available, unlike the
Proc
A,, A,
S Y N ~ H R ~ N I Z IN ~ TSELFI ~ ~TIME^ SYSTEMS
fully-static casewhere no such ~ n - t i m echeck is needed. In this chapter we use “processor” in slightly general terms: a processor could be a pro~rammablecomponent, in which case the actors mapped to it execute as software entities, or it could be a hardware component, in which case actors signed to it are implemented and execute in hardware. See [KL93] for adiscussion on combined hardware/software synthesis from single dataflow specification. Examples of ap~lication-specificmultiprocessors that use programmable processors and some form of static scheduling are described in +$$][Koh90l9which were also discussed in Chapter 2. ~nter-processor communication between processors is assumed to take place via shared memory. Thus the sender writes to a particular shared memory location and the receiver reads from that location. The shared memoryitself could be global memorybetween all processors, or it could be distribute between pairs of processors (as hardware FIFO queues or dual ported memo~es for example). Each inter-processor communication edge inan translates into abuffer of a certain size in shared memo^. Sender-receiver synchronization also assumed to take flags in shared ~ e m o r y .Special hardware for synchronization phores implemented in hardware, etc.) would be prohibitive multiprocessor machines for applications such as that we are conside~ng, Interfaces between h a r d w ~ eand software are typically implemented using memory-mapped registers in the address space of the progra~mableprocessor (again a kind of shared memory), and synchronization is achieved using flags that can be tested and set by the programmable component, and the same can be done by an inte~facecontroller on the hardware side [H+93]. Under the model above, the benefits of sync~onizationoptimization become obvious. Each sync~onizationthat is eliminated directly results in one less s y n c ~ o n i ~ a t i ocheck, n or, equivalently, one less shared memory access. example, where a processor would have to check a flagin shared m e ~ o r ybefore executing p~mitive,eliminating that synchronization implies there is no longer need for such a check. ”his translates to one less shared memory read. Such a benefit is especially signi~cantfor simplifying interfaces between a programmable component and a hardware component: or a without the need for synchronization implies that the interface can be implemented in a non-bloc~ingfashion, greatly simplifying the interface controller. a result, eliminating a sync~ronizationdirectly results in simpler hardware in this case. the metric for the optimizations we present in this chapter is the total number of accesses toshared memory that are needed for the purpose of synchronization in the final multiprocessor implementation of the self-timed schedule. This metric will be defined precisely in Section 9.6.
We model synchronization in self-timed implementation using the IPC graphmodelintroducedin the previous chapter. before, IPC graph G,( V,EiF) is extracted from given HSDFC G and multi-processor schedule; Figure 9.2 shows one such example, which we use throughoutthis chapter. We will find it useful to partition the edges ofthe IPC graph in th ingmanner: Eiw Ei,,U E,,,, where are the (shown dashed in Figure 9.2(d)) that are directed from the send to the receive actors in G,, and Ei,,,are the “internal” edges that represent the fact that actors assigned to a particular processor (actors internal to that processor) are executed sequentially according to the order predetermined by the self-timed schedule. A communication edge e E E,,,, in G, represents two functions: 1) reading and writing of data values into the buffer represented by that edge; and 2) synchronization between the sender and the receiver. mentioned before, we assume the use of shared memory for the purpose of sync~onization;the synchronization operation itself mustbeimplementedusingsomekindofsoftware protocol between the sender and the receiver. We discuss these sync~onizationprotocols shortly.
t
Recall from Lemma 7.3 that the average iteration period co~espondingto self~timedschedule with anIPC graph G, is given by the maximum cycle mean (G,) If we only have execution time estimates available instead of exact values, and we set the execution times ofactors t( to be equal to these estimated values, then we obtain the e s t i ~ ~iteration t e ~ period by computing G,) Henceforth we will assume that we know the e s t i ~ ~ t e ~ t ~ r o ~ calculated by setting the t ( v ) values to the available timing estimates. In all the transformations that we present in the rest of the chapter, we will preserve the estimated throughput by preserving the maximum cycle mean of G,, with each t ( v ) set to the estimated execution time of In the absence of more precise timing information, this is the best we can hopeto do.
In dataflow semantics, the edges between actors represent infinite buffers. Accordingly, the edges of the IPC tially buffers of infinite size. However, from Lemma 7.1, every edge that belongs to strongly connected component, and hence to some cycle) can only have finite number of tokens at any time during the execution ofthe IPC graph. We will call
S Y ~ C H R O ~ I Z A T IIN O ~SELF-TI
Execution TimeEsti~ates
A, C, H, F
on
processors
4
l
Figure 9.2. Self-timed execution.
9
this constant the s ewill we represe timed buffer bound: &(e)
of that edge, and for feedback edge emm ma 7.1 yields the following self-
min Delay (C) C is cycle that contains e
1)
(edges that do not belong to any bound on buffer size; therefore for practical implementations we need to i ~ ~ Q ~ e bound on the sizes of these edges. For example, Figure shows an graph where the communication edge (S, r) could be unbounded when the execution time of, is less than that of B. In practice, we need to bound the buffer edge; we will denote such an impose^' bound for a feedforward e ) Since the effect of placing such rest~ctionincludes “artificially” constraining ( e ) from getting more th invocations ahead of S ~ ~ ( e its ) effect on the estimated t ~ o u g h p u t reverse edge that has delays on it, where m ed e in Figure 9.3(b)). Since the addition of this e potential to reduce the estimated throughput; to prevent such e) must be chosen to be large enough so that the m a x i ~ u mcycle mean remains unc~angedupon adding the reverse edge with m delays.
ure 9.3.
g r a with ~ ~ a f e e ~ f o ~edge: ~ r d (a) ~uffer~.
NIZATI~~ IN S E L F - T I ~SE ~ S T E ~ S
Sizing buffers optimallysuch that the maximumcyclemeanremains unchanged has been g, Lewis and LofKLL873, in where the authors propose ramming in an f o ~ u l a t i othe nof prob~em,with the numberof constraints equal to the n u ~ b e roffundamental cycles in the H S ~ F G(potentially an esponential number ofconstraints). cient heuristic procedure to determine
holds for each feedforward edge e , then the maximum cycle mean of the resulting graph does not exceed Then, binarysearch (e) for eachfeedforwardedge,whilecomputing t search ach step and ascertaining that it less than i,) results buffer ain assignment for the feedforward edges. ~ l t h o u g hthis procedure is efficient, it suboptimal because the order that the edges e are chosen ar~itraryand may effect the quality of the final solution. we will see in Section 9.8, however, imposin~such bound roach for bounding buffer sizes, because such bound entails sync~onizationcost. In Section 9.8 we show that there is better technique for bounding buffer sizes; this technique achieves bounded buffersizes by t r a n s f o ~ ing the graph into strongly connected graph by adding minimal number of additional sync~onizationedges. Thus, i the final algorithm, it is notin fact ne cess^ to use compute these bounds
define two basic synchronization protocols communication edge hether or not the length of the co~espondingbuffer gu~anteedto be bounded from the analysis presented in the previous section. Given an I graph G and communication edge e in G , if the length of the co~esponding buffer is not bounded that is, if e is f e e d f o ~ a r dedge of G then we l synchronizationprotocol called which guarantees that an invocation of snk( e) never atte om an empty b u ~ e r ; (b) an invocation of never attem~tsto write data into the buffe nless the number of tokens in the buffer is less than some pre-specified limit (e) which the amount of memo^ allocated to the buffer discusse~in the previous section. On the other hand, if the topology of the
graph guarantees that the
Chapter
th for e is bounded by some valu then we use simpler protocol, calle that only explicitly ensures above. chroni~ationprotocols defined S
n this mechanism, for is maintained on the processor that executes src for is maintained on the is maintainedinsome processor that executes snk( e) and cop shared memo^ location (e) The pointer and are initiali%edto respectively. Just after each execution of e) the new data value produced onto e is written into the shared memory buffer for e offset e) is updated following the by operation l ) mo tb( e) and e) is updated to contain the new value of e) Just before each execution of the e) is repeatedly examined until it found to be t shared memory bu echanism uses the r e a d / ~ ~pointers te and these are initialized the same way; however, rather than maintaining copy of e) in the shared memo^ location we ~ a i n t a i na count (initiali%ed to of the number of unread tokens that currently reside in the buffer. executes, e) is repeatedly examined until its value is found e) then the new data value producedonto e is written into the sharedmemory buffer for e at offset e) (e) is updated in B (except that the newvalue is not w ~ t t e nto shar memo^); and the count in (e) is incremented. Just before each execution the value contained in e) is repeatedlymineduntil it is found to benonzero;then the data value residing at offset count in is decre there is enough shared memory to hold feedforwardcommunicationedge e of communication edge some of the buffers feedforward edges, roughput. Note that feedback edge e , f optimally choosing which edges should besubject to stricter buffer bounds when there is shortage of s h a r ~ d ~ e m oand r y ,the selection of these stricter bounds is an interesting area for further investigation. An impo~antparameter in an implementation of
S Y N C ~ R O N I ~ A T IN I O SEL~-TI~ED ~ SYSTE
T b If receiving processor finds that the correspondi~gIPC buffer is full, then the processor releases the shared memory bus, and waits time units before requesting the bus again to re-check the shared memory synchronization variable. Similarly, sending processor waits T b time units between successive accesses of the same synchronization variable. The back-off time can be selected experimentally by simulating the execution of the given synchronization graph (with the available execution time estimates) over wide range of candidate back-off times, and selecting the back-off time that yields the highest simulated throughput. As we discussed inthe beginning of this chapter, some of the communication edges in G, need not have explicit synchronization, whereas others require synchronization, which needto be implemented either using the UBS protocol or S protocol. Allcommunicationedges also represent buffers in shared memory. Thus we divide the set of communication edges follows: Es E, where the edges E, need explicit synchronization o Er need no explicit synchronization. Recall that a communication edge
of Gi, represents the
Vk
vi,
Thus, before we perform any optimization on synchronizations, Es and Er Q,, because every communication edge represents synchronization owever, in the following sections we describe how we can move certain m E* to Er, thus reducing synchronization operations inthe final implementation. After all synchronization optimizations have been applied, the communication edges of the IPC graph fall into either Es or E,.At this point the edges Es Er in G, represent buffer activity, and must be implemented buffers insharedmemory,whereas the edges Es represent synchronizationconstraints, and are implemented using the UBS and protocols introduced in the previous section. For the edges in E,, the synchronization protocol is executed before the buffers corresponding to the communication edge areaccessed so to ensure sender-receiver synchronization. For edges in Er,however, no synchronization needs to be done before accessing the shared buffer. Sometimes we will also find it useful to introduce synchronization edges without actually communicating data between the sender and the receiver (for the purpose of ensuring finite buffers for example), so that no shared buffers need to be assigned to these edges, but the corresponding sync~onizationprotocol invoked for these edges.
l1 optimizations that move edges from E, to E, must respect the syn-
Chapter
chronization constraints implied by G,. If we ensure this, then we only need to implement the synchronizati ( V ,Eint E$) the syn G, represents the sync~onization ~o~straints ensured, and the algo~thmswe present for minimizing synchronization costsoperate on G,, efore any synchronization-related optimizations are performed G, G, ecause at this stage, but as we move communication edges from to G, has fewer and fewer edges. moving edges from E, to enever we remove edges from G, we viewed as removal of edges from G,. haveto ensure, of course, that the syn ization graph G, atthat step respects all the synchronization constr~ntsof G, because we only implement synchronizations represented by the edges in G , , The following theorem is ~ s e f uto l formalize the concept of when the sync~onization constr~nts represented by one synchronization graph G,' imply the s y n c ~ o n i z a t i oconstraints ~ of another graph G: This theorem provides a useful constraint for synchronization optimization, and it underlies the validity of the main techni~uesthat we will present in this chapter. The synchronization constraints in a synchronization graph (V,
imply the synchronization cons~aintsof the sync~roniza-
.Ei,,$
tiongraph
GS2 ( V , EiatU ES2) if the following condition holds:
Es' p,(
that
(E)
CS2but not in
'V'E
s.t.
if for each edge E that
G,' there is a mini mu^ delay path from
E)
to
that has total delay of at most ote that since the vertex sets for the two graphs ar to refer to edges
E
sat.
and
as being vertices of
entical, it is meaningfu~ even though there are
P E,'
E
First we prove the following lemma. If there is a path stffrt( rouf of ~e~~
e,,) in
e2, k)
then
k
l
e following constraints hold along such a path p (as per (4-1))
imilarly,
start(
e2))
S Y ~ ~ H ~ O N I Z A TINI O ~
oting that
is the same
we get
(e
k ) end(snk(ei),k delay(e2)) k ) start(v, k ) so we get
~ a ~ s a l i implies ty
k ) start( ~~bstituting
SYSTEMS
k
e
deZay(
in
delay(e2) d e Z a y ( e ~ ) ) .
k ) end(src(e,),k
~ontinuingalong p in this manner, it can easily be verified that start(snk(e,)9k )
deZay(e,)
end(src(e,),
delay(e,-i) deZay(e,))
that is,
(e
k)
k
Delay ( p ) ) ) QED.
Proof of If E E:, E Esithen the synchronization constraint due to the edge holds in both graphs. But for each s.t. E E,' we need to show that the constraint due to delay(
k) holds in G,' provided least one path p and
e,)
delay which implies there at from to in 6,' such that DeZay(p) deZay(&).
existence of such path p implies
From Lemma
k - DeZay(p))).
k)
that is,
start(
k)
k
(p)))
If then
elay ( p ) deZay DeZay(p)) end( k we get
start(
k ) end(
e above relation is identical to
delay(&)) Substituting this in k
delay(
and this proves the Theorem.
Chapter
The above theorem motivatesthe following definition. If G,’ V , Ei,, Esl) and ( V , E,,,, nization graphs with the same vertex-set, we say that G,’ s.t. E E*, E E l ,we have snk(~))
are synchroG , ~if
Thus, Theorem 9. l states that the synchronization constraints of Eli,,, E$’) imply the synchronization constraints of E,,, if ( V ,E,,,, U Es’)preserves ( V , Given an IPC graph G,, and synchroni~ationgraph G, such that G, preserves G,, suppose we implement the synchronizations corresponding to the synchronization edges of G,. Then, the iteration period of the resulting system is determined by the maximum cycle mean of G, ( ~ ~ ~ (This G Jbecause the synchronizationedgesalonedetermine the interaction between processors; communication edge without synchronization does notconstrain the execution of the corresponding processors in anyway.
~e refer to each access of the shared memory ‘‘Synchronization variable” by and snk( as s y n ~ to shared memory. If synchronization for e is implemented using UBS, then we see that on average, 4 s y n ~ ~ o n i z a t i oaccesses n are required for e in each iteration period, while im lies 2 synchronization accesses per iteration period. ~e define the sy of synchronization graph G, to be the average number of synchronizationaccessesrequiredper iteration period. Thus,if n f f denotes the number of synchronizationedgesinG,$ that are feedforwardedges,and nfb denotes the number of synchronization edges that are feedback edges, then the synchronization cost of G, can be expressed ( 4 n , 2 n f b ) .In the remainder of this chapter, we develop techniquesthat apply the results and the analysis framework developed in the previous sections to minimize the synchronization cost of self-timed implementation of an HSDFG withoutsacrificing the integrity of any inter-processor data transfer or reducing the estimated throughput. Note that in the measure defined above of the number of shared memory accesses required for synchronization, some accesses to shared memory are not taken into account. In particular, the “synchronization cost” metric does not consider accesses to shared memory that are performed while the sink actor is waiting for the required data to become available, or the source actor is waiting for “empty slot” in the buffer. The number of accesses required to perform these “busy-wait,’ or “spin-lock” operations is dependent on the exact relative execution times of the actor invocations. Since in the problem context under considernumber of ation, this i n f o ~ a t i o n not generally available to us, the best
SYNC~~ONIZ~TION IN S E L ~ - T ISYSTEMS ~E~
accesses the number of shared memory accesses required for synchronization assuming that IPC data on an edge is always produced before the co~esponding sink invocation attempts to execute is used an approximation. In the remainder of this chapter, we discuss two mechanisms for reducing sync~onizationaccesses. The first (presented in Section9.7) is the detection and removal of redundunt synchronization edges, which are synchronization edges whose respective sync~onizationfunctions are subsumed by other synchronization edges, and thus need not be implemented explicitly. This technique essentially detects the set of edges that can be moved from the to the set Er.In Section 9.8, we examine the utility of adding additional synchronization edges to convert a synchronization graph that is not strongly connected into strongly connected graph. Such conversion allows us to implement all synchronization BS. We address optimization criteria in performing such conversion, and we will showthat the extra synchronization accesses requiredfor such conversion are always (at least) compensated by the number of synchronization accesses that are saved by the more expensive UBS synchronizations that are converted to sync~onizations. Chapters 10 and l 1 discuss mechanism, called resynchrunizutiu~,for inserting synchronization edges in way that the number of original synchronization edges that become redundant exceedsthe number of new edges added.
The first technique that we explore for reducing sync~onizationoverhead is removal of redu~dunt sy~chru~izutiun from the sync~onizationgraph, i.e., finding minimal set of edges E$ that need explicit synchronization.
A synchronization edge is ant in synchronizationgraph G if its removal yields sync~onizationgraph that preserves G Equivalently, from definition 9.1, synchronization edge e is redundant in the synchronization graph G if there is path (e) in G directed from src to snk( such that I: e) The synchronization graph G is tains no redundant synchronization edges. the sync~onizationfunction associated with redundant synchronization edge ‘‘comes for free” by-product of other synchroniz~tions.Figure 9.4 shows an example of redundant synchronization edge. ere, before executing actor D the processor that executes {A,B, C, D} does not need to synchronize with the processor that executes {E, G, H } because,due to the sync~onizationedge the corresponding invocation of is guaranteed to is redundant in Figure complete before each invocation of D is begun. Thus, 9.4 and can be removed fromEs into the set Er.It is easily verified that the path
is directed from the delay on
((K G), ( G H), to snk( x,) and has a path delay (zero) that is equal to
In this section, we discuss anefficient algorithm to optimally remove redundant sync~onizationedges from a synchronization graph.
The following theorem establishes that the order inwhichweremove redundant synchronization edges is not important; therefore all the redundant sync~onizationedges can be removed together.
( V , .Ei,, U a sync~onizationgraph, e , Suppose that and e, are distinct redun~ant synchronization~dges in G, (i.e., these are edges that could be indivi~uallymoved to E, and G, V , Ein,U Then redundant in G,. Thus both e , and e, can be moved into Ertogether, Since redundant in snk( e,) such that
there is a path
st ( e , )
in G, directed from
i_<
Figure 9.4.
is an example of a redundant syn~hronizationedge.
S Y ~ C H ~ O N I Z A T IIN O~
SYSTE~S m
does not contain e , then exists in G,, and we are done. ~therwise, x2, observe that is of the form
Now, if let
p
( Y l , Y2,
Yk-l,
e19 Y k , Y k + l t
Ym>;
and define
P’’= ( y l , y2,
Clearly,
x27 Y k , Y k + 1, ym)* is a path from src(e2) to snk(e2) in G,. Also,
eZay (p’) ~ ~ e Z( ap )y
Y k - 1 v x19
* * * P
~ e Z a(yp ) deZay ( e (from (9-9)) (from (9-8)).
Theorem 9.2 tells us that we can avoid implementing sync~onizationfor redundant synchronization edges sincethe “redundancies” are not interdependent. Thus, an optimal removal of redundant sync~onizationscan be obtained by applying a straightforward algorithm that successively tests the synchronization edges for redundancyin some arbitrary sequence, and since computing the weight of the shortest path in a weighted directed graph is a tractable problem, we can expect such a solution to be practical.
Figure 9.5 presents an efficient algorithm, based on the ideas presented in the previous subsection, for optimal removal of redundant sync~onizationedges. In this algorithm, we first compute the path delay of a minimum-delay path from y ) here, we assign a path delay of to y for each ordered pair of vertices whenever there is no path from to y This computation is equivalent to solving an instance of the well known shurtest pru~Zem(see Section 3.13). Then, we examine each sync~ronization edge e in some arbitrary sequence and determine whether or not there is apath from src ( e ) to snk that does not contain e and that has a path delay that does not exceed deZay e ) This check for redundancy is equivalent to the check that is performed by the statement in RemuveRedundantSynchs because if is a pathfrom src(e) to snk( e ) that contains more than one edge and that contains e then p must contain a cycle c such that c does not contain e and since all cycles must have pos-
Chapter
itive path delay (from Lemma 7. the path delay of such path p must exceed if satisfies the inequality in the statement of and p* is a path from to such that then p* cannot contain e This observation allows us to avoid havingto recompute the shortest paths after removing candidate redundant edge from C,. From the d e ~ n i t ~ oofn redundant synchronizatio~edge, it is easily verified that the removal of redundant synchronization edge does not alter any of the minimum-delay path values (path delays). That is, given redundant synifwe let chronization edge e, in G,, andtwo arbitrary vertices y E G, ( V , Eint (E then y) P ~ , ~ (y X), Thus,none of the minimum-delay path values computed in Step 1 need to be recalculated after removing redundant sync~onizationedge in Step 3. Observe that the complexity of the function is dominated by Step 1 and Step 3. Since all edge delays are non-negative, we can repeatedly apply Dijkstra’s single-source shortest path algorithm (once for each vertex) to carry out Step in VI time; we discussed Dijkstra’s algorithm in
chroni~ationgraph C,
E iU , Es
raph G,*
( V , Ein, (Es E,))
re 9.5. An algorithm thatoptima~iyremoves redundantsyn~~ronization
S Y N ~ ~ ~ O N I Z A T IIN ON SELF-TI~ED SYSTE~S
Section 3.13. A modification of Dijkstra’s algorithm can be used to reduce the is an complexity of Step 1 to Q( V/210g2(VI) VI IEI) [CLR92]. In Step 3, upper bound for the number of synchronization edges, and the in worst case, each vertex has an edge connecting it to every other member of V. Thus, the timeand if we use the modification to Dijkstra’s complexity of Step 3 is Q( algorithm mentioned above for Step 1, then the time-complexity of R e ~ u v e R e dundantSyn&hsis 3(Iv1210g,(lvl)
IVllEI)
Q(lVl2l0g2(lVI) IVllEI).
In [Sha89], Shaffer presents an algori inimizes the number of directed synchronizations in the self-timed execution of an HSDFG U (implicit) assumption that the execution of successive iterations of the are not allowed to overlap. In Shaffer’s technique, construction identical to the sync~onizationgraph is used except that there is no feedback edge connecting the last actor executed on processor to the first actor executed on the same processor, and edges that have delay are ignored since only intra-iteration dependencies are significant. Thus, Shaffer’s synchronization graph is acyclic. Re~uveRed~ndantSynchs can be viewed an extension ofShaffer’s algorithm to handle self-timed, iterative execution of an HSDFG;Shaffer’s algorithm accounts for self-timed execution only within graph iteration, and in general, it can be applied to iterative dataflow programs only if all processors are forced to synchronize between graph iterations.
In this subsection, we illustrate the benefits of removing redundant synchronizations through practical example. Figure 9.6(a) shows an abstraction of three channel, multi-resolution quadrature mirror (QMF)filter bank, which has applications in signal compression [\rai93]. This representation based on the general (not homogeneous) SDF model, and accordingly, each edge is annotated with the number of tokens produced and consumedby its source and sink actors. Actors and represent the subsystems that, respectively, supply and consume data tolfrom the filter bank system; B and C each represents parallel combination of decimating high and low pass FIR analysis filters; D and E represent the corresponding pairs of inte~olatingsynthesis filters. The amount of delay onthe edge directed from to E is equal to the sum of the filter orders of C and D . For more details on the application represented by Figure 9.6(a), we refer the reader to [Vai93]. construct periodic parallel schedule, we must first determine the numthat each actor must be invoked in the periodic schedule, described in Section 3.6. Next, we must determine the precedence relation-
Chapter
Figure 9.6.(a) multi-resolution QMF filter bank usedto illustrate the benefitsof removin~redundant synchroni~~tions. (b) The precedence gra self-ti~ed,two-processor, parallel schedule for (a).The initialsynchroni~ation graph for (c).
SYNCHRO~IZATI~N IN SELF-TI~EDSYSTEMS
ships between the actor invocations. In d e t e ~ i n i n gthe exact precedence relationships, we must take into account the dependence of a given filter invocation on not only the invocation that produces the token that is “consumed” by the filter, but also on the invocations that produce the n preceding tokens, where n the order of the filter. Such dependence can easily be evaluated with an additional dataflow para~eteron each actor input that specifies the number of past tokens that are accessed [Pri91]’. Using this information, together with the invocation counts specified by we obtain the precedence relationships specified by the graph of Figure 9.6( in which the i th invocation of actor N is labeled and each edge e specifies that invocation snk( e ) requires data produced by invocation src( e ) delay( e ) iteration periods after theiteration period in which the data is produced.
A self-timed schedule for Figure 9.6(b) that can be obtained from Hu’s list scheduling method [Hu6l] (described Section 5.3.2) is specified in Figure 9.6(c), and the synchronization graph that corresponds to the IPC graph of Figure 9.6(b) and Figure 9.6(c) is shown in Figure 9.6(d). All of the dashed edges in Figure 9,6(d) are synchronization edges. If we apply Shaffer’s method, which considers only those synchronization edges that do not have delay, we can eliminate the need for explicit synchronization along only one of the 8 sync~onizationedges In contrast, ifwe apply ~ e ~ ~ v e ~ e d u n d a n t $ y nwe c hcan s, detect the redundancy of B2) well four additional redundant synchronization edges (A3,B,),(A4,B,),( B 2 , and (B,, Thus, ~ e ~ ~ v e ~ e d~ndant$ynchsreduces the number of synchronizations from 8 down to 3 reduction of 62%. Figure 9.7 shows the synchronization graph of Figure 9.6(d) after all redundant sync~onizationedges are removed. It is easily verified that the sync~onizationedges that remain in this graph are not redundant; explicit sync~onizationsneed only be implemented for these edges.
Y In Section 9.5.1, we defined two different sync~ronizationprotocols bounded buffer synchronization (BBS), which has cost of 2 synchronization accesses per iteration period, and can be used whenever the associated edge is contained in strongly connected component of the synchronization graph; and l. It should be noted that some SDF-based design environments choose to forgo paralleli~ation across multiple invocations of actor in favor of simplified code generation and scheduling. For example, in the GRAPE system, this restriction has been justified the grounds that it simplifies inter-processor data management, reduces code duplication, and allows the derivation of efficient scheduling algorithms that operate directly general SDF graphs without requiring the use of the acyclic precedence graph (APG) [BELP94).
Chapter
unbounded buffer synchronization (UBS), which has cost of 4 synchronization accesses per iteration period. We pay the additional overhead of UBS whenever the associated edge is feedforward edge of the synchronization graph. One alternative to implementing UBS for feedforward edge e is to add synchronization edges to the synchronization graph so that e becomes encapsulated in strongly connected component; such transformation would allow e to beimplementedwithBBS.However, extra synchronizationaccesseswillbe required to implement the new synchronization edges that are inserted. In this section, we show that by adding synchronization edges through certain simple procedure, the synchronization graph can be transformed into strongly connected graph in way that the overhead of implementing the extra synchronization edges is always compensated by the savings attained by being able to avoid the use of UBS. That is, the conversion to strongly connected synchronization graph ensures that the total number of sync~onizationaccesses required (per iteration period) for the transformed graph less than or equal to the number of synchronization accesses required for the original synchronization graph. T'hrough a practical example, we show that this transformation can signi~cantly reduce the number of required synchronization accesses. Also, we discuss atechnique to compute the delay that should be added to each of the new edges added
synch. edges internal edges
Figure 9.7. The synchronization graphof Figure 9.6(d) after all redundant synchronization edges are removed.
SYN~H~ONI~ATION IN S E L F - ~ ISYSTEMS ~E~
in the conversion to a strongly connected graph. This technique computes the delays in a way that the estimated throughput of the IPC graph is preserved with minimal increase in the shared memory storage cost required to implement the communication edges.
Figure 9.8 presents an efficient algorithm for transforming a synchronization graph that is not strongly connected into a strongly connected graph. This algorithmsimply“chainstogether” the source SCCs, and similarly, chains together the sink SCCs. The construction is completed by connecting the first SCC of the “source chain” to the last of the sink chain with anedge that we From each source or sinkSCC, the algorithm selects a execution time to be the chain “link” co~espondingto tion time vertices are chosen in an attemptto minimize the amount of delay that must be inserted on the new edges to preserve the esti-
chronization graph that is not strongly connected. rongly connected graph obtained by adding edges between the SCCs of enerate an orderingC,, C,,, of the source SCCs of and simD,,of the sink SCCs of ilarly, generate an ordering E C, that minimi~est ( * ) over C,. lect a vertex tantiate edge the t a vertex W , E
that minimizes t ( * ) over
that minimizes t ( * ) over D,.
Selectavertex E. ~nstantiatethe edge
that mini~izest ( * ) over
~nstantiatethe edge ~ ~ ( w , , , ,
wi).
Chapter
mated t ~ o u g h p u of t the original graph. In Section 9.9, we discuss the selection of delays for theedges introduced by It is easily verified that algorithm always produces a strongly connected graph, and that conversion to strongly connected graph cannot be attained by adding fewer edges than the number of edges added by Figure 9.9 illustrates possible solution obtained by algorithm Were, the black dashed edges are the synchronization edges contained in the original sync~onizationgraph, and the grey dashed edges are the edges that are added by The dashed edge labeled e, is the sink-source edge. ~ s s u m i n gthe synchronization graph is connected, the number of feedfor( n , 1)) where n, is the n~mberof ward edges nf must satisfy This follows from the fundamental graph theoretic fact that in connected graph (V*,E*) must be at least (1V.l 1) it is easily verified that the number of new edges introduced by is equal to (nsrc n,,k where is the number of source and It,,k is the number of sink the number of syn~hronizationaccesses per iteration period, S, that is required to implement the edges introd~cedby is nsnk 1)) while the number of sync~onizationaccesses,
illustrat~on
solution
N IN
S-, eliminated by
(by allowing the feedforward edges of original sync~onizationgraphtobe implemented with equals It follows that the net change (S+ S-) in th nization accesses satisfies
(S+
and thus, (S+ S-) S 0 .
1 n , f )S have established the following result.
uppose that G is sync~onizationgraph, andis the graph om applying algorith~ to G . Then the synchronization cost of is lessthan or equal to the synchronization cost of G . For example, without the edges added by (the dashed grey edges) in Figure 9.9, there are 6 feedforward edges, which require synchronization accesses per iteration period to implement. The addition of the dashed edges require ynchronization accesses to implement these new edges, but allows us to use for the original feedforward edges, which leads to savings of synchr on accesses for the original feedforward edges. the net effect achieved by in this example is a reduc8) As tionofthe total number f sync~onizationaccesses by another example, consider igure 9.10, which shows the synchronization graph topology (after redun~antsynchronization edges areremoved) that results from four-processor schedule of esizerfor plucked-s~ingmusical inst~ments insevenvoicesbasedonthe us-~trongtechnique. This algorithm was discussed in Chapter 3, an example application that was implemented on the ordered memo^ access archit~ctureprototype. This graph contains 6 synchronization edges (the dashed edges), all of which are feedforward edges, so the nc~onizationcost is sync~onizationaccesses per iteration period. nce the graph has one source and one sink SCC, only one edge is added by and adding this edge reduces the synchronization cost to savings. Figure 9-11 shows the topology of possible solution computed by on this example. Here, the dashed edges represent the synchronization edges in the synchronization graph returned by
ne impo~antissue that remains to be addressed !i the conversion of G, into strongly connected graph G,? is the proper insers y n c ~ o ~ i z a t i ograph n tion of delays so that is not deadlocked, and does nothavelower estimated throughput than G,. The potential for deadlock and reduced estimated throughput arise because the conversion to strongly connected graph must necessarily introduce one or more new fundamental cycles. In general, new cycle may be
Chapter
delay-free, or its cycle mean may exceedthat of the critical cycle in Thus, we may have to insert delays on the edges added by Co~vert-to-SC-gra~~. The location (edge) and magnitude of the delays that we add are significant since they affect the self-timed buffer bounds of the communication edges, as shown subsequently in Theorem Since the self-timed buffer bounds determine the amount of memory that we allocate for the corresponding buffers, it desirable to prevent deadlock and decrease in estimated throughput in a way that the sum of the self-timed buffer bounds over all communication edges is minimized. In this section, we outline a simpleand efficient algorithm called ~ e t e r ~ i ~ e ~ efor zays z a y s an optimal result addressing this problem. Algorithm ~ e t e r ~ i ~ e ~ eproduces
igure 9.1 0. The synchronization graph, after redundant synchronization edges areinducedby a four-processor schedule of a musicsynthesizer bas ~ a r p ~ u s - ~ t r algorithm. ong
SY~CHR~~I~A INTSEL~-TIME~ IO~ SYSTEMS
if G, has only one source or only one sink SCC; in other cases, the algorithm must be viewed heuristic. Our algorithm produces an optimal result if G, has only one source SCC or only one sink SCC; in other cases, the algorithm must be viewed heuristic. In practice, the assumptions under which we can expect an optimal result are frequently satisfied. For si~plicityin explaining the opti~alityresult that has been established y s first , specify restricted version of the algofor Algorithm ~ e t e r ~ i n e ~ e z awe rithm that assumes only one sink SCC. After explaining the optimality of this restricted algorithm, we discuss how it can be modified to yield an optimal algorithm for the general single-source-SCC case, and finally, we discuss how it can be extended to provide heuristic for arbitrary synchronization graphs. Figure 9.12 outlines the restricted version of Algorithm ~ e t e r ~ i n e ~ e z a y s that applies when the synchroni~ationgraph G, has exactly one source Here, ~ e Z l ~ is~ assumed ~ ~ o to r be ~ an algorithm that takes synchronization graph input, and repeatedly applies the Bellman-Ford algorithm discussed in Section 3.13 to return the cycle mean of the critical cycle in if one or more cycles exist that have zero path delay, then ~ e l l ~ returns ~ n ~ ~ r ~
Figure 9.1 1. possible solution obtained by applying ~ o f f v ~ r t - t o - ~to~the -gr~~~ example of Figure 9.10.
Chapter
~eter~i~e~elays Synchronizationgraphs and where is thegraphcomputed by Conve~t-fo-SC-g~a~~ when applied to G,. The ordering of source SCCs generated in Step 2 of Converf-fo-SC-gra~~ is denoted Cl, C,, For i 1, m 1 denotes the edge instantiated by Converf-fo-SC-gra~~ from a vertex in to a vertex in The sink-source edge instantiated byConverf-fo-SC-g~a~~ is denoted on-negative integers d,- such that the estimatedthrou~hputwhen delay 0 i m 1 equals estimated throughputof
G,[
h,,,=
l
-1
set delays on each edge to be infinite
~ e ~ / ~ a ~ ~ o r ~ (
of G,
compute the max. cycle mean
an upper bound on the delay required for any ei
i
0,
m~ ~ ~ ~ e L a y (~X i , C
AYi+)
Si]
~ i n ~ e l a h, ~B synchronization graph X , an edge itive integer B Assuming minimum (0, 1, than h"
~
, the fix delay on
be to
*I
in X , a positive real numberh ,and a pos-
B] has estimated throughput no less thanh-' det~rminethe B} such that the estimated throughput of dl is no less
~ e ~ o ram binarysearch in therange 0, 1, B] tofindtheminimumvalueof r 0, B} such that ~ e / / ~ a n ~ o r ~ r] ( returns a value less than or equal to h . Return this minimum value ofr
Figure 9.12. An ~~gorithm for determining the ~ i g o r i t hC~Q ~ ~ e ~ - f Q - S C - g ~ ~ ~ ~ .
on the edgesi n t r o ~ u ~ ebyd
S Y ~ ~ H R O ~ I IN ~ ~SELFT I O TIME^ ~ SYSTEMS
In developing the optimality properties of Algorithm D e t e r ~ i n e D e l ~ y s , we will use the following definitions: If G ( V , E ) isaDFC; tinct members of E ; and
then C e,
A,,
l
An
isasequenceofdis-
denotes the DFC
whereeach defined by ~elay( A, Thus, results from “changing the delay” on each value
snk( snk( and is simply the that to the corresponding new delay
src( l
at G synchronization graph that preserves G,. n G minimum-delaypath in G directed from an IPC edge (in G, otivation for Algorithm ~ e t e ~ i n e D e l a is y sbased on the observations e paths introduced by C o n v e r t - t o - ~ C - ~ can r a ~be ~ p ~ i t i o n e dinto m non-empty subsets such that each member of P, contains e and contains no other members of e and similarly, the set of fundamental cycles introduced by Deter~ineDelayscan be p~titionedinto W O , W , , such that eachmemberof W , contains e, and contains no other members of{e,, e,,
y const~ction, nonzero delay on any of the edges e tributes to reducing the cycle means of all members of W, Algorithm ~ i n e ~ e lstarts ~ y s(it~rationi 0 of the For loop) by determining the minimum delay 6,) on that is required to ensure that none of the cycles in has cycle mean that exceeds the maximum cycle mean h,,, of G,. Then (in iteration i 1 the algorithm determines the minimum delay on e, that is required to guarantee that no member of W Ihas a cycle mean that exceeds h,,, assuming that ~ e l a y ( 6, ow, if delay ~ e l ~ y 6, and 0 then for any positive integer k S k units of delay can be “transferred from e l to without violating the property that no member of U W , ) contains a cycle whose cycle mean exceeds h,,, However, such transfo~ationincreases the path See Figure 9.12 for the specification of what the e, represent.
Chapter
delay of each member of while leaving the path delay of each member of unchanged,and therefore such transformationcannotreduce the self-timed buffer bound of any IPC edge. Furthermore, apart from transferring delay from e , to e o , the only other change that can be made to delay( or delay ( e without introducing member of (W, U W , ) whose cycle mean exceeds h,,, is to increase one or both of these values by some positive integer amount(s). Clearly, such change cannot reduce the self-timed buffer bound on any IPC edge. Thus, we see that the values and computed by Dete~ineDeZaysfor delay(eo) and deZay(e,) respectively, optimallyensurethat no member of U has cycle mean that exceeds h,,,, After computing these values, Determi~eDelayscomputes the minimum delay 6,on e, that is required for all members of W, to have cycle means less than or equal to h,,, assuming that delay( e,) and delay ( e , ) Given the “configuration” (deZuy(e,) delay(e,) delay(e,) transferring delayfrom e2 to e , increases the path delay of all members of while leaving the path delay of each member of (POU P,) unchanged; and transferring delay from e, to increases the path delay across U P,) while leaving the path delay across unchanged. Thus, by an argument similar to that given to establish the optimality 6,) with respect to ( W oU W , ) , we can deduce that (1) the values computed by Determine~elaysfor the delays on e,,, e , , e , guarantee that no member of U U has cycle mean that exceeds h,,, and (2) for 6,’) to (e,,, e 2 ) that preserves the any other assignment of delays estimated throughput across ( W oU U W,), and for any IPC edge e such that an IPC sink-source path of e is contained in U U P,) the self-timed buffer bound of e under the assignment is greater than or equal to computed by iterself-timed buffer bound of e under the assignment ations i 2 of Determi~eDeZays. After extending this analysis successively to each of the remaining iteram of thefor loop in Determine~eZays, we arrive the foltions i 3,4, lowing result. Suppose that G, is sync~onizationgraph that has exactly one sink SCC; let G, and ( e oe, l , be in Figure 9.12; let (do, be the result of applying DetermineDeZays to and and let (do’, dm- beanysequence of m non~negativeintegers such that eo do’, ’3 has the same estimated throughput G, Then
*.*,e,-, ~d,-,’l)r:Q1.(6,[e,~do, ?e,-,-+d,-,l), (X) denotes the sum of the self-timed buffer bounds over all IPC edges e . .
in G, induced by the sync~onizationgraph
Figure 9.13 illustrates solution obtained from ~ e t e r ~ i n e ~ e L aHere y s . we and we assume that the set of IPC assume that t ( v ) 1 for eachvertex edges {e,, e b } (for clarity, we are assuming in this example that the IPC edges are present in the given synchronization graph). The grey dashed edges are the We see that h,,,, is determined by the edgesadded by Convert-to-SC-~rap~. cycle in the sink SCC of the original graph, and inspection of this cycle yields h,,, we see that the set W O the set of fundamental cycles that contain and do not contain e l consists of single cycle that contains three edges. By inspection of this cycle, we see that the minimum delay on e,, required to guarantee that its cycle mean does not exceed h,,,v is 1. Thus, the i 0 iteration of the For loop in ~ e t e r ~ ~ n e ~ ecomputes Z a y s 6, l Next, we see that Wl consists of single cycle that contains five edges, and we see that two delays must be present on this cycle for its cycle mean to be less than or equal to h,,, Since one delay has been placed on ~ e t e r ~ i n e ~ e Lcomays iteration of the For loop. Thus, the solution deterputes l in the i mined by ~ e t e r ~ i n e ~ e z for a y sFigure 9.13 is 6,) 1, l ) the resulting self-timed buffer bounds of e, and eb are, respectively, 1 and and
Figure 9.13. An example used to illustratea solution obtained byalgorit~m ~eter~i~e~ei~ys.
2+1= ow is an alternative assignmentofdelayson thatpreserves the estimated throughput of the original graph. However, in this assignhe self-timed buffer bounds of e, and are identically equal 4 , one greater than the c o ~ e s p o n d i n g ~ ufrom m the delay assignment 1, 1 computed by DetermineDeZays. Thus, if denotes the graph returned by Cu~vert-tu-SC"graphfor the example of Figure9.13, we have that
denotes the sum of the self-timed buffer bounds over all IPC edges
A~gorithmDeter~ineDeZayscan easily be modified to optimally handle general graphs that have only one SCC. Here, the algorithm s~ecification remains essentially the same, with the exception that for i 1 2, .) ( m denotes the edge directed from vertex in - i to vertex in D, where D, is the ordering of sink SCCs generated in tep of the corresponding invocation of Cunve~-tu-SC-graph still denotes the sink-source edge instantiated by Cunvert-tu-SC-graph),By adapting the reasoningbehind Theorem 9.4, it is easily verified that when it is applicable, this modified algorithm always yields an optimal solution. As far we are aware, there is no straight for war^ extension of Deterdays to general graphs (multiple source SCCs and multiple sink SCCs) guaranteed to yield optimal solutions. The fundamental problem for the eneral case is the inability to derive the partitions W O , W , , W,,,P, of the fundamental cycles ( P C sink-source paths) introduced by ~ U ~ v e r t - t u - S C - g rsuch ~ p h that each contains e,, and cone,where E, is the set of edges tains no other members of E, added by C u ~ v e r t - t u - ~ C - g rThe ~ ~ hexistence . of such pa~itionswas crucial to our development of Theorem 9.4 because it implied that once the minimum vale,. are successively computed9 " t r a n ~ f e ~ i n delay g ' ~ from some ues for eo, e,, to some e j is never beneficial. Figure 9.14 shows an example of synchronization graph that has multip~esource SCCs and multiple sink SCCs, and that does not induce partition of the desired form for the fundamental cycles.
e t e ~ i n e D e ~ a ycan s beextended to yield heuristics for the eneral case in which the original synchronization graph C, contains more than and more than one sink SCC. For example, if ( a l , a 2 , a k ) ne source denoteedges that were instantiated by C ~ n v e ~ - t u - ~ C - g r f"between" fph the CCs with each ai representing the i th edge created and similarly,
SYNCHRONIZATION IN S E L ~ - T ISYSTE~S ~E~
b l , b2t 6,) denote the sequence of edgesinstantiated between the sink thenalgorithm ~ e t e ~ i n e ~ e ~can abeyapplied s with the modi~cation that m k-tZ+l,and
where e, is the sink-source edge from C o n v e r t ~ t o - ~ C - g r a ~ ~ z . The derivation of alte~ativeheuristics for general synchroni~ationgraphs appears to be an interesting direction for further research. It should be noted, though, that practical synchronization graphs frequently contain either single source or single SGG, or both such the example of Figure 9.10 that algorithm ~ e t e r ~ ~ n e ~ e ztogether a y s , with its counte~artfor graphs that have single source form widely-applicable solution for optimal~ydeterini in^ the delays on the edges created by C u n v e ~ - t o - ~ C - g r u ~ ~ .
Figure A synchronization graph, afterprocess in^ by such that there is no m -way partition WO, W,- of the fundamenta by that satisfies both (1). Each W , conEach W i does not contain any member tains e, et, ei+2, Here, the fundamental cycles introduced by y dashed edges are the edges instantiated by are e21 is easilyv ~ r i f i e that ~ these cycles cannot be decom~osed n if we are ~ l ~ o w to e dreorder thee, S.
Chapter
re exist constants and such that and S for all edges e then the complexity of O( VI /Ellog,( V I ) ) (see Section 3.13.2); and we have
5
for all
and so that
TIVI DTIV/ Thus, each invocation of ~ i n ~ e l runs u y in
It follows that and any of the variations of defined above is VI IEI (log2(IV1))2) where m is the number of edges instantiated by Since where is the number of source SCCs, and the number of sink SCCs, it is obvious that IVI With this observation, and the observation that 5 /VI2,we have that and its variations are O( IV14(log,( IVi))') Furthermore, it easily verified that the time complexity of dominates that of the time complexity of applying in succession is o(~v~~(~o~,(~v~))*). Although the issue ofdeadlock does not explicitly arise inalgorithm the algorithm does guaranteethat the output graph is not deadlocked, assuming that the input graph is not deadlocked. This because (from Lemma 7.1) deadlock is equivalent to the existence of a cycle that has zero path delay, and thus equivalent to an infinite maximum cycle mean. Since elays does not increase the maximum cycle mean, it follows that the algorithm cannot converta graph that not deadlocked into deadlocked graph.
Converting mixed grain HSDFG that contains feedforward edges into a stron ly connected graph has been studied by Zivojnovic, 941 in the context of retiming when the assignment of actors to processors is fixed beforehand. In this case, the objective is to retime the input graph so that the number of communication edges that have nonzero delay is maximized, and the conversion is performed to constrain the set of possible retimings in such way that integer linear programmin~formulation can be developed. Thetechnique generates two dummyvertices that are connected by an edge; the sink vertices of the original graph are connected to one of the dummy vertices, while the other d u ~ m yvertex is connected to each source. It is easily verified that in self-
S~NCH~ONIZATIO~ IN S E L F - T I ~SYSTEMS E~
timed execution, this scheme requires at least four more synchronization accesses per graph iteration than the method that we have proposed.We can obtain further relative savings if we succeed in detecting one or more beneficial resynchronization opportunities. The effect of Zivojnovic’s retiming algorithmon synchronizaunpredictable since, on onehand, communicationedge tionoverhead becomes “easier to make redun~ant”when its delay increases, while on the other hand, the edge becomes less useful in making other communication edges redundant since the path delay of all paths that contain the edge increase.
This chapter has developed two software strategies for minimizing synchronizationoverheadwhenimplementing self-timed, iterative dataflowprograms, These techniques rely on graph-theoretic analysis framework based on two data structures called the interprocessor communication graph and the synchronization graph. This analysis framework allows us to determine the effects on throughput and buffer sizes of modifying the points in the target program at which sync~onizationfunctions are carried out, and we have shown how this framework can be used to extend an existing technique removal of redundant synchronization edges for non-iterative programs to the iterative case, and to develop new method for reducing synchr~nizationoverhead the conversion of sync~onizationgraph into strongly connected graph so that more efficient sync~onizationprotocol can be used. As in Chapter the main premise of the techniques discussed in the chapter that estimates are available for the execution times of actors such that the actual execution time of an actor exhibits large variation from its corresponding estimate only with very low frequency. Accordingly, our techniques have been devised to guarantee that if the actual execution time of each actor invocation is always equal to the corresponding execution time estimate, then the throughput of an implementation that incorporates our synchronization minimization techniques is never less than the throughput of a corresponding unoptirnized implementation that is, we never accept an opportunity to reduce synchronization overhead if it constrains execution in such way that t ~ o u g h p u is t decreased. Thus, the techniques discussed in this section are particularly relevant to embedded applications, where the price of synchronization high, and accurate execution time estimates are often available, but guarantees on these execution times do not exist due to infrequent events such cache misses, interrupts, and error handling. In the nexttwo chapters, we discuss third software-basedtechnique for reducing synchronization overhead in applicationcalled r~sync~runizatiun, specific multiprocessors.
This Page Intentionally Left Blank
This chapter discusses technique, called resync~roniz~tio~, for reduci~g synchronization overheadin application-specific multiprocessor implementations. The t e ~ h n i ~ uapplies e to arbitrary collections of dedicated, programmable or configurable processors, such combinations of programmable DSPs, ASICS, and FPGA subsystems. synchronization is based on the concept of redundant synchronization operations, which defined in the previous chapter. The objective of resynchronization is tointroduce new synchroni~ationsin such way that the number of original synchroni~ationsthat consequently become redundant is significantly more than number of new sync~onizations.
Intuitively, resync~onizationis the process of adding one or more new sync~onizationedges andremoving the redundant edges that result. Figure lO,l(a) ill~strateshow this concept can be used to reduce the total numberof synchronizations in multiprocessor implementation, Here, the dashed edges represent synchronization edges. Observe that if the new synchronization edge C, H ) is inserted, then two of the original synchronization edges and ( E , become redundant. Since redundant synchronization edges can be removed from the synchronization graph to yield an equivalent synchronization graph, we see that the net effect of adding the sync~onizationedge C, H ) is to reduce the number of synchroni~ationedges that need to be imple~entedby 1 Figure lO.l(b) shows the sync~onizationgraph that results from inserting the r ~ ~ y n c ~ r o n i ~edge ~t~on into Figure 1O.l(a), and then ~emovingthe redundant sync~onizationedges that result. ~ e ~ n i t i o10.1 n gives formal definition of resynchronization. This considers resynchronization only “across” feedforward edges. Resynchroni~ation that includes inserting edges into is also possible; however, in general, such resynchronization may increase the estimated throughput (see Theorem 10.1 at
Chapter
the end of Section 10.2). Thus, for our objectives, it must be verified that each new synchronization edge introduced in an does not decrease the estimated throughput. avoid this complication, which requires check of significant complexity (0(IVl IEllog,( V I ) ) ,where ( V ,E) the modified synchronization graph, using the Bellman Ford algorithm described in Section 3.13.2) for candidate resynchronization edge, we focus only on “feedforward” resynchronization in this chapter. Future research will address combining the insights developed herefor feedforward r e s y n c ~ o ~ ~ z a twith i o n efficient techniques to estimate the impact that given resynchronization edge has on the estimated throughput. Opportunities for feedforward resynchronization are pa~icularlyabundant in the dedicated hardware implementation of dataflow graphs. If each actor is mapped to separate piece of hardware, in the VLSI dataflow arrays of Kung, then for any application graph that is acyclic, every communication channel between two units will have an associated feedforward sync~onizationedge. Due to increasing circuit integration levels, such isomorphic mapping of dataflow subsystems into hardware is becoming attractive for growing family of applications. Feedforward synchronization edges often arise naturally in multiprocessor software implementations as well. A software exam-
I
ple is reviewed in detail in Section 10.5.
itio Suppose that G ( V , E) synchronization ra h, and {e,, e2, e,,} is the set ofallfeedforwardedges in G A of G finite set R e,’,e2’, e,’} of edges that are not lY contained in E , butwhosesourceand sink vertices are in V , such that e e2 e,’ are feedforward edges in the HSDFG G* ( V , ((E andb) G* preserves G that is, snk(ei))S deZay(ei) for all i E 1 2, n } Each member of that is not in E is called of the resynchronization G* is called the d with R ,and this graph denoted by “(R, G) If we let G denote the graph in Figure 10.1, then the set of feedforward edges is {(B,G), (E, (E, (H, R id&, (E, (H, is resync~onizationof G Figure 10.1(b) shows the HSDFG
G* and from Figure lO.l(b), it is easily verified that tions and (b) of Definition 10.1.
R , and G* satisfy condi-
Typically, resynchro zation is meaningful only in the context of synchronization graphs that are not that is, synchronizationgraphs that do not contain any delay-free cycles, or equivalently, that have infinite estimated throughput. In the remainder of this chapter and throughout Chapter 11, we are concerned only with deadlock-free synchronization graphs. Thus, unless otherwise stated, we assume the absence of delay-free synchronization graph cycles. In practice, this assumption is not problem, since delay-free cycles canbe detected efficiently
This section reviews number of useful properties ofsynchronization redundancy and resynchronization that we will apply throughout the developments of this chapter and Chapter 11.
.l: Suppose that G ( V , E) is synchronizationgraphand is redundant synchronization edge in G . Then there exists simple path p in G directed from to snk( such that p does not contain S , and DeZay(p) S deZay(s) Let G’ ( V , (E denote the synchronization graph that results when we remove S from G . Then from Definition 9.2, there exists path p’ in G’ directed from SE( to S) such that
Delay (p’) S delay
S)
(10-1)
Chapter
Now observe that every edge in C’ is contained in C and thus, C contains the path If is simple path, then we are done. ~therwise, can be expressed concatenation
where each is simple path, at least one qi is non-empty, and each (not necessarily simple) cycle. Since valid synchronization graphs cannot contain delay-free-cycles (Section we must have 1 for l k n Thus, since each originates and terminates at the same actor,thepath qn) simple path directed from to such that ~ombiningthis last inequality with (10-1) yields
0-3) F u r t h e ~ o r esince , is contained in G it follows from the construction of that must also be contained in G . Finally, since is contained in C’, C’ does not contain and the set of edges contained in is subset of the set of edges contained in we have that does not contain QED. Suppose that G and G’ are synchronization graphs such that G’ preserves G and is path in G from actor x to actor Then there is a path in G’ from to such that 5 and G where tr( denotes the set of actors traversed by the path cp. Thus, if synchronization graph G’ preserves another synchronization graph and is path in C from actor to actor then there is at least one path in G’ such that 1) the path directed from x to the cumulative delay on does not exceed the cumulative delay on and every actor that is traversed by is also traversed by (althoug~ may traverse one or more actors that are not traversed by For example in Figure lO.l(a), if we let x (G, W
?
y
and
(H7 10-4)
in Figure lO.l(b) confirms Lemma 10.1 for this example. Here
{ B , G, H,
and
G, H,
Let l, By definition of the relation, each ei that is not synchronization edge in G is contained in G’. For in from each that is a synchroni~ationedge in G there must be path
src(e;) to i, define the path
such that DeZay(p;) delayte,). Let denote the set of e, that are synchronization edges in G , and to be the concatenation
Clearly, a path in G’ from x to y and since Delay p,) I delay( ei) holds whenever is synchronization edge, it follows that Delay( I Delay( p ) Furthermore, from the const~ctionof it is apparent that every actor that is traversed by is also traversed by The following lemma states that if resynchronization contains resynchronization edge e such that there is delay-free path in the original synchronization graph from the source of e to the sink of e , then e must be redundant in the resychronized graph. Suppose that G synchronization graph; R is resynchronizad ( x , y ) is resynchronization edge such that pc(x, y ) Then y ) is redunda~tin (R,G) minimalresynchronization(fewest number of elements) hasthe property that pG(x’, y’) for each resynchronization edge ( X ’ , y ’ ) Proofi Let p denote minimum-delay path from x to y in G . Since ( x , is resynchronization edge, ( x , y ) is not contained in G , and thus, p traverses at least three actors. FromLemma 10.1, it follows that there is path p’ in G) from x to y such that
DeZay(p’)
(10-6)
and p’ traverses at least three actors. Delay (p’) ~ e Z u y ( ( x ,
(1
and p’ ( ( x , y ) ) Furthermore, p’ cannot properly contain y ) To see this, observe that if p’ contains ( x , y ) but p’ ( ( x , y ) ) then from (10-6), it follows that there exists delay-free cycle in G (that traverses and hence that our assumption of deadlock-free schedule (Section 10.1) is violated. Thus, we conclude that ( x , y ) is redundant in G). consequence of Lemma 10.1, the estimated throughput of given synchronization graph is always less than or equal to that of every synchronization graph that it preserves. If G is synchronizationgraph,and graph that preserves G , then h,,,( G’) h,,,( G)
G’ is
synch ronization
Suppose that is critical cyclein G Lemma10.1guarantees that there is cycle C’ in G’ such that Delay( C’) DeZay( C) and b) the set of actors that are traversed by C is subset of the set actors traversed by Now clearly, b) implies that 9
traversed
is traversed
C‘
(l
C
and this observation together with implies that the cycle mean of C’ is greater than or equal to the cycle mean of C . Since C is critical cycle in G , it follows that h,,,( G’) h,,,( G) QED.
Thus, any saving in synchronization cost obtained by rearranging synchronization edges may come at the expense of decrease in estimated t ~ o u g h ~ u t . implied by Definition 10.1, weavoid this complication by restricting our attention to feedforward synchronization edges. Clearly, resynchronization that rearrangesonlyfeedforwardsynchronizationedgescannotdecrease the estimated t ~ o u g h p u since t new cycles are introduced and no existing cycles are altered. with the form of resynchronization that is addressed in this chapter, any decrease in synchronization cost that we obtain is not diminished by degradation of the estimated throughput.
nization with the fewest In Section 10.4, it isformally shown that the resynchronization problem is NP-hard, which means that it is unlikely that efficient algorithms can be devised to solve the problem exactly, and thus, for practical use, we should search for good heuristic solutions In this section, we explain the intuition behind this result. establish the NPhardness of the resynchronization problem, we ex when there are exactly two which we call t and we derive polynomial-time reduction from the classic ing well-known NP-hard problem, to the pairwise resynchronization problem. Inthe set-covering problem, one given finite set X and family T of subsets of and askedto find minimal (fewest number of members) subfamily T, T such that T,
subfamily of T is said to if each member of is contained in some member of the subfamily. Thus, the set-covering problem is the problem of finding minimal cover.
Given synchronization graph G , let ( x , , x,) be sync~roniand let ( y , ,y,) be an ordered pair of actors in G We say that (x17 in G if
p(”,, y1) Po‘),, delaY((x17 Thus,everysynchronizationedgesubsumes itself, and intuitively, if ( x , , x,) is synchronization edge, then y , , y , ) subsumes ( x , , x,) if and only if zero-delay synchronization edge directed from y , to y2 makes ( x , , x,) redundant. The following fact easily verified from Definitions 10.1 and 10.2. Suppose that G is synchronization graph that contains exactly two SCCs, the set of feedforward edges in G , and is resynchronization of G . Then for each e E there exists e’ E such that snk( e’)) subsumes e in G . An intuitive correspondence between the pairwise resynchronization problem and the set covering problem can be derived from Fact 10.2. Suppose that G synchronization graph with exactly two SCCs, Cl and such that each feedforward edge is directed from member of C, to member of We start the finite set that we wish to by viewing the set of feedforward edges in G cover, and with each member p of ( x , y ) ( x E C , , y E we associate the subset of defined by {e E Thus, is the set of feedforward edges of G whose corresponding synchronizations can be eliminated if we implement zero-delay synchronization edge directed from the first vertex of the ordered pair to the second vertex of p . Clearly then, is resynchronization if and only if each e E F is contained in at least one X( snk( e;’))) that is, and ifonly if snk( e;’))) 1 S i S covers F. Thus, solving the pairwise resynchronization problem for G is equivalent to finding minimal cover for given the family of subsets y ) ( x E Cl, y E C,)}
{x((
Figure 10.2 helps to illustrate this intuition. Suppose that we are given the {x1, and the familyof subsets T t l , t,, where t1 t2 {x,, and To construct an instance of the pairwiseresynchronizationproblem,wefirst create two vertices andan edge directed between these vertices for each member of we label each of the edges created in this step with the corresponding member of Then for each t E: T we create two vertices and t ) Next, for each relation xi E ti (there are six such relations in this example), we create two delayless edges one directed from the sourceof the edge corresponding to and directed to t i ) and another directed from t j ) to the sink of the edge corresponding to This last step has the effect ofmakingeach pair set
Chapter
v ~ n k (t i ) ) subsume exactly those edges that correspond to members of in other words, after this construction, ti), ti))) t i , for each i Finally, for each edge created in the previous step, we create a corresponding feedback edge oriented in the opposite direction, and having a unit delay.
x(
ti
V
Figure (a) An instance of the pairwiseresynchronizatiofl problem thatis derived from an instanceof the set-covering problem; (b) theWSDFG that results from a solutionto this instanceof pairwise resyflchronization.
Figure 10.2(a) shows the synchronization graph that results from this construction process. Here, it is assumed that each vertex corresponds to separate processor; the associated unit delay, self loop edges are not shown to avoid clutSCCs the SCC Observe ter. that the graph contains two U and the SCC U and that the set of feedforward edges the set of edges that correspond to members of Now, recall that major correspondence betweenthe given instance ofset covering and the instance of pairwise resynchronization defined by Figure 10.2(a) ti), t i ) ) ) t i , for each Thus, if we can find minimal that resynchronization of Figure 10.2(a) such that each edge in this resynchronization is directed from some t k ) to the corresponding t k ) then the associated tk form minimum cover of For example, it easy, albeit tedious, to verify that the resync~onizationillustrated in Figure 10.2(b),
x(
do(
is minimal resynchronization of Figure 10.2(a), and from this, we can conclude is minimal cover for X . From inspection of the given sets and that t,, T , it easily verified that this conclusion is correct. This example illustrates howan instance of pairwise resynchronization can be constructed (in polynomial time) from an instance of set covering, and how solution to this instance of pairwise resynchronization can easily be converted into solution of the set covering instance. The formal proof of the NPhardness of pairwise resync~onizationthat is given in the following section is generalization of the example in Figure 10.2.
In this section, the NP completeness of the resynchronization problem is established. This result derived by reducing an arbitrary instance of the setcovering problem, well-known NP-hard problem,to an instance ofthe pairwise resynchronizationproblem,which is special case of the resynchronization problem that occurs when there are exactly two SCCs. The intuition behind this reduction is explained in Section 10.3 above. Suppose that we are given an instance T ) of set covering, where finite set, and T is family of subsets of that covers Without loss of generality, we assume that T does not contain
U
proper nonempty subset T’ that satisfies 10-9)
tE
We can assume this without of generality because if this assumption does not hold, then we can apply the construction below to each “independent subfamily”
Chapter 10
separately, and then combine the results to get minimal cover for The following steps specify how we construct anWSDFG from Except where stated otherwise, no delay is placed on the edges that are instantiated. instantiate two vertices l . For each x E ( x ) to an edge e( x ) directed from
2. For each
and
and instantiate
tE
Instantiate two vertices
t)
and
t)
(b) For each x E Instantiate an edge directed from
x ) to
Instantiate an edge directed from
t)
to
t)
to
place one delay
t)
and
this edge.
Instantiate an edge directed from Instantiate an edge directed from
(x)
to
and
place one delay on this edge. 3. For each vertex that has been instantiated, instantiate to itself, and place one delay onthis edge.
edge directgd from
Observefromour construction, that whenever x E X contained in T there an edge directed from x) t ) to t) and there also an edge (having unit delay) directed from to x) t ) Thus, from the assumption stated in (l 0-g), it follows that E T ) f o m s one SCC, E forms another SCC, and E X} is the set of feedforward edges. t
E
Let G denote the HSDFG that wehave constructed, and in Section 10.3, define {e E ( s r c ( e ) ,$ & ( e ) ) ) } for eachordered pair of vertices such that is contained in the source SCC of G , and contained in the sink SCC of G . Clearly, G gives an instance of the pairwise resync~onizationproblem. By construction {xE
Thus,forall
(t),
G , observe that
t ) ) subsumes
x),
vsnk(t))
t E
For each x E
all input edges of
x)))}
t for all t E T
t}.
have unit delay on
them. It follows that for any vertex y the in
c:
X(
sink SCC of G ,
E
For each
E
T the only vertices in G that have
t) are those vertices contained in
E
delay-free It follows that
for any vertex y in the sink SCC of G , Now suppose that j-’ f2, is a minimal resynchronization of G . For each i E 1,2, m } exactly one of the following two cases must apply: Case 1 fi) for some E X In this case, we pick an arbiand we set and From trary E T that contains Observation 2, it follows that
c: W;
Case vsnk
I(i each
for some From Obse~ation3, we have From our 1,2, m})} is of the form E
Now, for each i E
1,2,
E
T We set
and
definition of the and minimal resynchronization of G , Also, where E T I m} we define
Z, Z,} covers X . From Observation 4, we have that for each there exists E T such that Z , Thus, each Z; member of T.Also, since wi)l(i E 1,2, m})} is resynE must be prechroni~ationof G , each member of served by some and thus each E X must be contained in some
Z,} is a minimal cover for X . (By contraposition). Suppose there exists cover Y , , U*, U,,,?} (among the members of T ) for X , with m’ m . Then, each E X is contained in some Y , and from Observation 1, Y,), Y,)) subsumes e( Thus, Y;), ( i E l , 2, m‘})} is a resynchronization of C Since m’ m ,it follows that j-’ 2, is not minimal resynchronization of G .
In summary, we have shown how to convert an arbitrary instance T) of the set-covering problem into an instance C of the pairwise resynchronization problem, and we have shown how to convert solution of this instance of pairwise resync~onizationinto solution of 7“) It easily verified that all of the steps involved in deriving C from T) and in deriving from can be performed in polynomial time. Thus, from the NP hardness of set covering we can conclude that the pairwise resynchronization problem is NP hard.
A heuristic framework for the pairwiseresynchronizationproblem emerges naturally from the relationship that was established inSection 10.3 between set-covering and pairwise resync~onization.Given an arbitrary algothat solves the set-covering problem, and given aninstance of paironization that consists of two Cl and and set S of feedforw~dsynchronization edges directed from members of C, to members of this heuristic framework first computesthe subset Sl(pdsrc(e), U ) for each ordered pair of actors (U, that E
EE
{(U’,v‘)I( U’
(PC(V,snk(e)) dela~(e))l contained in the set
in Clan
and then applies the algorithm C O V to ~ the ~ instance of set covering defined by v’))l((u’, E T)} If E the set S together with the family of subsets denotes the solution returned by COVER, then r~sync~onization for the given instance of pairwiseresync~onizationcan be derivedby
mE This resynchronization is the solution returned by the heuristic framework. From the correspondence between set-covering and pairwise resynchronization that is outlined in Section 10.3, it follows that the quality of resynchronization obtained by the heuristic framework is determined entirely by the quality of the solution computed by the set-covering algorithm that is employed; that ~ is V worse ~ ~ more subfamilies) than an if the solution computed by C optimal set-covering solution, then the resulting resynchronization will be worse more synchronization edges) than optimal resync~onizationof the given instance of pairwise resynchronization. The application of the heuristic framework for pairwise resynchronization to each pair of in some arbitrary order, in general synchronization graph yields heuristic framework for the general resynchronization problem. How-
ever, major limitation of this extension to general sync~onizationgraphs arises from its inability to consider resync~onizationopportunities that involve paths that traverse more than two SCCs, and paths that contain more than one feedforward synchronization edge. Thus, in general, the quality of the solutions obtained by this approach will be worse than the quality of the solutions that are derived by the particular set covering heuristic that is employed, androughly, this discrepancy canbe expected to increase the number of SCCs increases relative to the number of sync~onizationedges in the original sync~onizationgraph. For example, Figure 10.3 showsthe sync~onizationgraph that results from a six-processor schedule of synthesizer for plucked-string musical instruments in 11 voices based on the Karplus-Strong technique. Here, represents the excitation input, each represents the computation for the th voice, and the actors marked with signs specify adders. Execution time estimates for the actors are shown in the table at the bottom of the figure. In this example, the only pair of distinct SCCs that have more than one sync~onizationedge between them is the pair consisting of the SCC containing and the SCC containing five addition actors, and the actor labeled Thus, the best result that canbe derived from the heuristic extension for general synchronization graphs described above is resync~onizationthat optimally rearranges the synchronization edges between these two SCCs in isolation, and leaves all other synchronization edges unchanged. Such resynchronization illustrated in Figure 10.4. This synchronization graph has total of nine synchronization edges, which is only one less than the number of synchronization edges in the original graph. In contrast, it is shownin the following subsection thatwith moreflexible approach to resynchronization, the total synchronization cost of this example can be reduced to only five synchroni~ationedges.
This subsection presents more global approach to resync~onization, called Algorithm ~lobal-resync~onize, which overcomes the major limitation of the pairwise approach discussed in Section 10.5.1. Algorithm ~lobal-resynchronize is based on the simple greedy approximation algorithm for set-covering that repeatedly selects subset that covers the largest number of where a remaining element is an element that not contained in any of the subsets that have already been selected. In [Joh74, Lov753 it shown that this setcovering technique is guaranteed to compute solution whose cardinality is no greater than (1n( 1) times that of the optimal solution, where is the set that is to be covered. To adapt this set-covering technique to resync~onization,we construct an instance of set covering by choosing the set the set of elements to be covered,
Chapter
actor
execution time
I
Figure 10.3. The synchronization graph that results from a six-processor schedule of a music synthesizer based on the Karplus-~trong techni~ue.
to be the set of feedforward synchronization edges, and choosing subsets to be
the family of
(V, is the input synchronization graph. The constraint where C pG(v2, vl) in (10-10) ensures that inserting the resynchronizatio~ edge (v,, v2) does not introduce cycle, and thus that it does not introduce deadlock or reduce the estimated throughput. Algorithm ~lobal-resynchronizeassumes that the input synchroni~ation graph is reduced reduced synchronization graph can be derived efficiently, for example, by using the redundant synchronization removal technique discussedin the previous chapter). The algo~thmdetermines the family of subsets specified by (10-lo), chooses member of this family that hasmaximum cardina~ity, inserts the corresponding delayless resynchronization edge, removes all synchronization edges that it subsumes, and updatesthe values pG(x,y ) for the new synchronization graph that results. This entire process is then repeated on the new sync~onizationgraph, and it continues until it arrives at sync~onizationgraph for which the computation defined by (10-10) produces the empty set that is,
Figure 10.4. The synchronization graph that results from applying the heuristic fra~eworkbased on pairwise resynchronization to the example of Figure 10.3.
~lo~~/-r~sync~roniz~ synchro~izationgraph G (V, an alternative r ~ ~ u $~y necdh r o n i ~ a t j ~ ~ that ~ r apprhe ~ ~ r v eG$.
the algorithm terminates when no more resynchronization edges can be added. Figure10.5 gives pseudocode specification of this algorithm(withsome straightforward modifications to improve the ~ n n i n gtime). To analyze the complexity of Algorithm6lobal-resync~onize? the following definition useful. Suppose that G is sync~onizationgraph. The denoted the number of distinct ordered vertex-pairs ( x , y ) in G that satisfy pG(x,y ) 0 That is, where S(G)
{ ( x , y)l(pc(x,y)
0)).
(10-11)
The followinglemmashows that long as the input synchronization graph is reduced, the resynchronization operations performed in Algorithm bal-resynchronize always yield reduced synchronization graph. Suppose that G pair and ( x , y ) is anordered pc(y, x ) and y)l l obtained by inserting do(x,y ) into is, G’ ( V , E’), where
( V , E ) is reducedsynchronizationgraph; of vertices in G suchthat (x,y ) Let G’ denote the synchronization graph G and removing all members of~ ( xy ,) that
-x(&
(E Y)) Y )l Then G’ reduced synchronization graph. In other words, G’ does not contain any redundant synchronizations. Furthermore, G’) DC(G) We prove the first part of this lemma by contraposition. Suppose that there exists redundantsynchronization edge in G’ and first suppose that ( x , y ) Then from Fact 10.1, there exists path in C’ directed from x to y such that DeZay( 0 and contain not does
(x, y )
Also, observe that from the definition of E’,
It follows from 10-12) and (10- 13) that G also contains the path Now let
y’) be an arbitrary member of
Since G c o n t ~ n the s path we have pG(x,y ) ine~uality(3-4) together with (10- 14),
y ) Then
0 , and thus, from the t ~ a n g l e (10-15)
Chapter 10
We conclude that G is reduced.
y’)
redundant in G , which violates the assumption that
If, on the other hand, S ( x , y ) then from Fact 10.1, there exists simple path p s (S) in G’ directed from S) to S) such that
Delay(
delay(s)
(10-16)
Also, it follows from (10-13) that G contains S . Since G is reduced, the path ps must contain the edge ( x , y ) (otherwise S would be redundant in G Thus, p s can be expressed concatenation p s y)), where either may be empty, but notboth. Furthermore, since p s is simple path, neither p 1 nor contains y ) Hence, from (10- 13), we are guaranteed that both and are also contained in G . Now from (10-16), we have
Delay(p2)S delay(s).
(10-17)
F u r t h e ~ o r efrom , the definition of p I and ~G(s~c(s),
and pc(y,
S
~ombining 10-17) and 10-18) yields
which implies that S E ~ ( xy ), But this violates the assumption that G’ does not contain any edges that are subsumed by y ) in G . This concludes the proof of the first part of Lemma 10.3. It remains to be shown that DC( G’) efinition 10.3, it follows that
DC( G) Now, from Lemma
G ) c:S(G’).
10-20)
Also, from the first part of Lemma 10.3, which has already been proven, we know that G’ is reduced. Thus, from Lemma 10.2, we have
But, clearly from the construction of G’ pG>(x,y ) (x,
0 , and thus,
E
From (30-20), (10-21), and (10-22), it follows that S(G) is G’) G ) . QED. ence,
(10-22) proper subset of
Clearly from Lemma 10.3, each time Algorithm Global-resynchronize performs resynchronization operation (an iteration of the 10.5), the number of ordered vertex pairs y ) that sat increased by at least one. Thus, the number of iterations of the The complexity of one ure 10.5 is bounded above by loop is dominated by the computation in the pair of nested tation of one iteration of the inner loop dominated by the time required to y ) for specific actor pair y ) Assuming y’) is availcompute able for all X’, y’ E V the time to compute ~ ( xy ,) is where s, is the number of feedforwardsynchronizationedgesinthe current synchronization graph. Since the numbe forward synchronization edges never increases fromone iteration of the op to the next, it follows that the time-complexity of the overall algorithm VI4) where s is the number of feedforward synchronizationedgesin the input sync tiongraph.In practice, however, the number of resynchronization steps loop iterations) usually much lower than since the constraints on the introduction of cycles severely limit the number of resynchronization steps. Thus, the O(sj VI4) bound can be viewed very conservativeestimate.
AlgorithmGlobal-resynchronize long resynchronizationedgecanbe found that subsumes at least two existing synchronization edges. However, in general it may be advantageous to continue the resynchronization process even if each resynchronization candidate subsumes at most one synchronization edge. This is because although such resynchronization candidate does not lead to an immediatereduction in synchronization cost, its insertion maylead to future resynchronization opportunities in which the number of sync~onizationedges can be reduced. Figures 10.6 and 10.7 illustrate simple example. In the synchroniza graphshowninFigure 10.6(a), there are 5 synchronizationedges, (B,C) (G, Self-loop edges incident to actors C , and F (each of these four actors executes on separate processor) are omitted from the illustration for clarity. It easily verified that no resynchronization candidate in Figure 10.6(a) subsumes more than one synchronization edge.If we terminate the resynchronization process at this point, we must accept synchronization cost of 5 synchronization edges. However,suppose that we insert the resynchronization which subsumes (B,C) and then we remove the subsumed edge ve at the synchronization graph of Figure 10.6(b). In this graph, resynchronization candidates exist that subsume upto two synchronization edgeseach.
Chapter
E) (A,E)
ure 10.6. An example in which inserting aresynchroni~ation edge that subs ~ m e sonly one existing synchronization edge eventually leads to a reduction in the total numberof synchronizations.
Alternative~y?from Figure 10.6(b), we could insert the resynchronization edge (C, E ) and remove both (D, and (A,E ) This gives us the synchronization graph of Figure 10.7(d), which contains four synchronization edges.
Figure 10.7.
continuation of the example in Figure10.6.
Chapter
This is the solution derived by an actual implementation of ~lgorithmGlobalresynchronize [BSL96b] when it applied to the graph Figure 10,6(a).
Figure 10.8 shows the Optimized synchronization graph that is obtained when ~lgorithm ~lobal-resync~onize is applied to the example of Figure 10.3 (using the implementation discussed in [BSL96b]). Observe that the total number of synchr~nizationedges been reduced from 10 to 5. The total number of "resynchronization steps" (number of while-loop iterations) required by the heuristic to complete this resynchronization is 7. Table 10.1 shows the relative t ~ o u g h p u timprovement delivered by the optimized synchronization graphof Figure 10.8 over the original synchron.ization graph the sharedmemoryaccesstime varies from 1 to 10 processor clock cycles. The assumed synchronization protocol is and the back-off time for each s i ~ u l a t ~ o nobtained by the experimental procedure discussed in Section
Figure 10.8. The o p t i m i ~ ~ syn~hronization d graph thatis obtained whenAlgorith~ ~~~a~-res~n~ is happlied r o n i zto~the exampleof Figure
9.5. The second and fourth columns show the average iteration period for the original synchronization graph and the resynchronized graph, respectively. The average iteration period, which is the reciprocal of the average throughput, is the average number of timeunits required to execute an iteration of the synchronization graph. From the sixth column, we see that the resynchronized graph consisto This improvement tently attains throughputimprovement of includes the effect of reduced overheadfor maintaining synchronization variables and reduced contention for shared memory. The third and fifth columns of Table 10.1 show the average number of shared memory accesses per iteration of the sync~onizationgraph. Here we see that the resynchronized solution consistently obtains at least 30% improvementoverthe original synchronizationgraph. Since accesses to shared memory typically require significant amounts of energy,
em access time
Original graph
Resynchronized graph Decrease in iter. period
2 26%
5
6 7
8
9 10 Table 10.1. ~ e ~ o r m a n ccomparison e between the resynchronized solution and the original synchronization graph the example of Figure 10.3.
p a ~ i ~ u l a r for l y multiprocessor. system that is not integrated on single chip, this reduction in the average rate of shared memory accesses is especially useful when low power consumption is an important implementation issue.
The simulation written in C makinguse of package called CSIM [Sch88] that allows concurrently running processes to be modeled. Each CS1 process is “‘created,” after which it runs concurrently with the other processes in the simulation. Processes communi~ateand synchronize through events and ~ u i l ~ u (which ~ e s are FIFO queues of events betweentwo processes). Time delays are specified by the function hold. ~ o l d i n gfor an appropriate time causes the process to be put into an event queue, and the process “wakes up” when the simulation time has advanced by the amount specified by the hold statement. Passage of time is modeled in this fashion. In addition, ~ S allows ~ Mspecification of ~ u ~ i l i t ~which e s , can be accessed by only one process at time. Mutual exclusion of access to shared resources is modeled in this fashion. For the multiprocessor simulation, each processor made into process, and synchronization is attained by sending and receiving messages from mailboxes. The shared bus is made into facility. Polling of the mailbox for checking the presence of data done by first reserving the bus, thenchecking for the message count on that particular mailbox; if the count is greater than zero, data can be read from shared memory, or else the processor backs off for certain duration, and then resumes polling. When processor sends data, it increments counter in shared memory, and then writes the data value. When processor receives, it first polls the corresponding counter, and if the counter is non-zero, proceeds with the read; otherwise, it backs off for some time and then polls the counter again. Experimentally d e t e ~ i n e dback-off times are used for each value of the m e ~ o r yaccess time. For send, the processor checks if the corresponding buffer is full or not. For the simulation, all buffers are sized equal to 5; these sizes can of course be jointly m~nimizedto reduce buffer memory. Polling time is defined the time required to access the bus and check the counter value.
In this section,itis shown that although optimal resynchronization is intractable for general synchronization graphs, broad class of synchronization graphs exists for which optimal resync~onizationscan be computed using an efficient polynomial-ti~ealgorithm.
in synchronization graph C , and f C if for each feedforward synchronisink actor in C , we have pc(x, snk( e)) 0 Simof G if for each feedforward synchronjzation edge e in C, we have 0 . We say that C is y in C such that is an y is anoutput synchronization graph is each if linkable. For example, consider the in Figure 10.9(aj, and assume that the dashed edges represent the sync~onizationedges that connect this with other This has exactly one input hub, actor and exactly one output hub, actor F , and since p(A, F) 0 , it follows that the is linkable. However, if we remove the edge (C, F ) , then the resulting graph (shown in Figure 10.9(bjj not linka~lesince it does not have an output hub. class of linkable that occur commonly in practical sync~ronizationgraphs are those that correspond to only one processor, such the shown in Figure 10.9(c). In such cases, the first actor executed on the processor always an input hub and the last actor executed is always an output hub. In the remainder of this section, we assume that for each linkable an input hub andoutputhub y are selected such that y ) 0 , and these actors are referred to the selec and the se the associated Which input hub are ch ones makes no difference to our discussion of the techniques in this section long they are selected so that y) 0 An important propertyof linkable synchronization graphs is that if C, and C2 are distinct linkable then all synchronization edges directed from C, to are subsumed by the single ordered pair whete denotes the and denotes the selected input hub of C2. Furtherselected output hub of more, if there exists pathbetweentwo C2’ of the form ( ( o , , (02, (on- in)), where is the selected output hub of C,’, i, is the selected input hub of and there exist distinct -2 c,’,c,’> such that for k 2, 1) i,, are respectively the selected input hub and the selected output hub of then all sync~onizationedges between C,’ and are redundant. From these properties, an optimal resynchronization for chainable synchronization graph can be constructedefficiently by computing topological sort of the instantiating zero delay synchronization edge from the selected in the topological sort to the selected input hub of the output hub of the i th ( i l) th for i 1,2, ( n l ) , where is the total number of
Chapter
and then removing all of the redundant synchronization edges that result. For example, if this algorithm applied to the chainable synchronization graph of Figure lO.lO(a), then the synchronization graph of Figure lO.lO(b) is obtained, and the number of synchronization edges is reduced from 4 to This chaining technique can be viewed as a form of pipelining, where each SCC in the output synchronization graph corresponds to pipeline stage. discussed in Chapter 5, pipelining can be used to increase the throughput in multiprocessor DSP implementations through improved parallelism. However, in the form of pipelining that is associated with chainable synchronization graphs, the load of each processor is unchanged, and the estimated throughput is not affected
Figure 10.9. An il~ustrationof input and output hubs forsyn~hronizationgraph.
(since no new cyclic paths are introduced), and thus, the benefit to the throughput of the chaining technique arises chiefly from the optimal reduction of synchronization overhead. The time-complexity of the optimal algorithm discussed above for resychronizing chainable synchronization graphs is where is the number of synchroni~ationgraph actors.
It easily verified that the original synchronization graph for the music synthesis example of Section 10.5.2, shown in Figure 10.3, is chainable. Thus, the chaining technique presented in Section 10.6.1 is guaranteed to produce an optimal resynchronization for this example, and since no feedback synchronization edges are present, the number of synchronization edges in the resynchronized solution guaranteed to be equal to one less than the number of in the original synchronization graph; that is, the optimized synchronization graph contains 6 1 5 synchronization edges. From Figure 10.8, we see that this is precisely the number of synchronization edges in the synchronization graph that results from the implementation of Algorithm Global-resynchronize that was dis-
Figure 10.1 0. i~lustratioflof an algorithm for optimalresyflchroflizatiofl of chainable syflchroflizatiofl graphs. The dashed edges are syflchroflizatiofl edges.
Chapter
m ~lobal-resynchronizedoes not always produce optimal results for chainable synchronization graphs. For example, consider the synchronization graph shown in Figure 10.1 l(a), which corresponds to an eightprocessor schedule in which each ofthe following subsets of actors are assigned arate processor {G, {C and {B} The dashed edges are synchronization connect actors that are assigned to the same processor. The total number of synchronization edges is 14. Now it is easily veri d that actor K is both an input hub and an output hub for the SCC {C, G, and similarly, actor is both an input and output hub for the D, Thus, we see that the overall sync~onizationgraph is chainable. It is easily verified that the chainingtechniquedevelopedinSection10.6.1uniquely yields the optimal resynchronization illustrated in Figure 10.l l(b), which contains only 11 synchronization edges.
In c o n ~ a s tthe ~ quality of the resynchronizationobtained for Figure 10.1l(a) hm by ~lobal-resync~onize on the order which in the actors are tr by each of the two nested in Figure 10.5. For example, ifbothloops traverse the actors inalphab r, then ~lobal-resynchronize obtains the sub-optimal solution shown in Figure 10.1l(c), which contains l 2 synchronization edges. owever, actor traversal orders exist for which ~lobal-resynchronize achieves optimal resynchronizations of Figure 10.1 l(a). Onesuch ordering is
loops traverse the actors in this order, then ~lobal-resynchronize yields the same resynchronized graph that computed uniquely by the chaining technique of Section 10.6.1 (Figure 10.1 l(b)). It is an open question whether or notgivenan arbitrary chainable sync~onizationgraph, actor traversal orders always exist with which ~lobal-resynchronizearrives at optimal resynchroniza(ions. Furthermore, even if such traversal orders are always guaranteed to exist, it is doubtful that theycan, in general, be computed efficiently.
The chaining technique developed in Section 10.6.1 can be generalized to imally resync~onize somewhat broader class of synchronization graphs. class consists of all sync~onizationgraphs for which each source has an output hub(but not necessarily an input hub), each sink has an input hub an output hub), and each internal linkable. In this are pi~elined in the previous algorithm, and then for
ure 10.1 c h a i n ~ ~synchronization le for which resynchronize fails to produce an optimal solution.
Chapter
each source SCC., synchronization edge is inserted from one of its output hubs to the selected input hub of the first SCC in the pipeline of internal SCCs, and for each sink synchronization edge is inserted to one of its input hubs from If there the selected output hub of the last SCC in the pipeline of internal are no internal SCCs, then the sink SCCs are pipelined by selecting one input hub from each SCC, and joining these input hubs with chain of synchronization edges. Then synchronization edge is inserted from an output hub of each source to an input hub of the first in the chain of sink SCCs.
In addition to guaranteed optimality, another important advantage of the chainingtechnique for chainablesynchronizationgraphs is its relatively low time-complexity (0( versus for ~lobal-resync~onize), where the number of synchronization graph actors, and s is the number of feedforward sync~onizationedges. The primarydisadvantage of course, its restricted applicability. An obvious solution is to first check if the general form of the chaining technique (described above in Section 10.6.3) can be applied.,apply the chaining technique if the check returns an affirmative result, or apply Algorithm ~lobal-resynchronizeifthecheck returns negative result. Thecheckmust determine whether or not each source has an output hub, eachsink SCC has an input hub, and each internal linkable. This check can be performed in time, where n is the number of actors in the input synchronization graph, using straightforward algorithm. useful direction for further investigation deeper integration of the chaining technique with algorithm ~lobal-resynchronizefor general (not necessarily chainable) synchronization graphs.
e studied synchronization rea~angementin context the of minimizing for hardware synthesis synchroof nization digital circuitry and significant differences in the models prevent these techniques from applying to the conDF implementation. In the graphical hardware model of on~traintgraph model, each vertex corresponds to separate hardware device and edges have arbitrary weights that specify sequencing en the source vertex hasboundedexecution time, positive ~ o ~ a cunstrai~t) r d imposes the constraint weight
start( snk( e ) )
e ) start( src( e ) )
10-24)
while negative weight
implies snk(
S W(
If the source vertex has unbounded execution time, the forward and backward constraints are relative to the time of the source vertex. In contrast, in the synchronization graph model, multipleactors can reside on the same processing element (implying zero synchronization cost between them), and the timing constraints always correspond to the case whereW ( e) is positive and equal to the execution time of The implementationmodels,and associated implementation cost functions are significantly different. A constraint graph implemented using a schedulingtechnique called 921, whichcanroughlybe viewed intermediatebetween self-timed and tatic scheduling. In relative scheduling, the constraint graph vertices that have unbounded execution time, called are used reference points against which other vertices are scheduled: for each vertex an offset is specified for each anchor that affects the activation of and scheduled to occur once clock cycles have elapsed from the completion of for each i In the implementation of relative schedule, each anchor has attached control circuitry that generates offset signals, and each vertex has synchronization circuit that asserts an signal when all relevant offset signals are present. The resynchronization optimization is driven by cost function that estimates the total area of the synchronization circuitry, where the offset circuitry area estimate for an anchor function of the maximum offset, and the synchronization circuitry estimate for vertex function of the number of offset signals that must be monitored. result of the significant differences in both the scheduling models and the implementation models, the techniques developed for resynchronizing constraint graphs do not extend in any straightforward manner to the resynchronization of sync~onizationgraphs for self-timed multiprocessor implementation, and the solutions that we have discussed for synchronization graphs are significantly different in structure fromthosereportedin [F 921. Forexample, the fundamental relationships that have established between set coveringand the resynchronizationof self-timed F scheduleshavenotemerged in the context of constraint graphs.
This chapter has discussed post-optimization called resynchronization for self-timed, multiprocessor implementations of algorithms. The of resynchronization is to introduce new synchronizations in such way that the
Chapter
number of additional synchronizations that become redundant exceeds the number of new synchronizations that are added, and thus the net s y ~ c ~ o n i z a t i ocost n reduced. It was shown that optimal resynchronization is intractable by deriving a reduction from the classic set-covering ~roblem. owever, a broad class of systems was d e ~ n e dfor which optimal resynchronization can beper forme^ in polynomial time. This chapter also discussed a heuristic algo~thm for resynchronization of general systems that emerges naturally from the correspondence to set covering. The performance of an implementation of this heuristic was emo on st rated on a multiprocessor schedule for a music synthesis system. The results em on st rate that the heuristic can efficiently reduce synchronization overhead and im~rovethroughput signi~cantly.
~ h a p t e r10 introduced the concept of resynchronization, post-optimization for static multiprocessorschedulesinwhichextraneoussynchronization operations are introduced in such way that the number of original synchronizations that conse~uentlybecome significantly exceeds the number of additional synchronizations~ edundantsynchronizations are synchronization operationswhosecorrespong se~uencingre~uirementsare enforcedcompletely by other synchronizations in the system. The amount of run-time overhead re~uiredfor sync~onizationcan be reduced significantly by eli~inating redundant sync~onizations[Sha89, BSL97). Thus, effective resynchronization reduces the netsync~onizationoverhead in the implementation of a multiprocessor schedule, and improvesthe overall throughput. owever, since additional serialization is imposed by the new synchronizations, resynchronization can produce significant increase in latency. In Chapter 10, we discussed fundamentalproperties of resynchronization and westudied the problemofoptimalresynchronizationunder the assumption that a r b i t r a ~ increases in latency canbe tolerated maximum-thro~ghput resynchronization”). Such an assumption is valid, for example, in wide variety of simulation applications. This chapter discusses the problem of computing an optimal resynchronizationamong all resynchronizations that do notincrease the latency beyond p r e s p e c i ~ eupper ~ bound L,, Thisstudyofresynchronization based in the context of self-ti~ed e~ecution of iterative data~ow speci~cations, which imple~entationmodel that has been applied extensively for digital signal processing systems. Latency constraints become important in interactive applications such video conferencing, games, and telephony9where latency beyond certain point becomes annoying to the user. This chapter demonstrates howto obtain the bene-
Chapter 11
fits of resynchronization while maintaining specified latency constraint
This section introduces number of useful properties that pertain to the process by which resynchronization can make certain synchronization edges in the original synchronization graph become redundant.The following definition is fundamental to these properties, If G is synchronization graph, S is synchronization edge in undant, R is a resynchronization of G and S is not contained in thenwesay that R ates S . If R eliminates S S’ R and there is th p from S) t (S) in G) such that p contains S’ and Delay ( p )S delay( S ) then we say that S’ A synchronization edge S can be eliminated if resynchronization creates path p from src(s) to snk( S ) such that Delay( p) S delay( S ) In general, the path may contain more than one resynchronization edge, and thus, it is possible that none of the resynchronization edges allows us to eliminate S 66by itself’, In such cases, it is the contribution of of the resynchronization edges within the path that enables the elimination of S This motivates the choice of terminology in ~efinition11.1. An example is shown in Figure 11.1. The following two facts follow immediately from Suppose that G is sync~onizationgraph, R is resynchronization is resynchronization edge in R . If r does not contribute to the elimination of any synchronization edges, then (R r } ) is also resynchronization of G . If r contributes to the elimination of one and only one synchroniza{ S } ) is resynchronization of G . tion edge S then ( R Suppose that G is synchronization graph, R resynchronization of G S is synchronization edge in G and S’ is resynchronization edge in R such that delay( S’) delay ( S ) Then S’ does not cont~buteto the elimination of S.
For example, let G denote the synchronization graph in Figure 11,.2(a). Figure 11.2(b) shows a resynchronization R of G . In the resynchronized graph of Figure 11.2(b), the resynchronization edge y 3 ) does not contribute to the e~iminationof any of the sync~onizationedges of G and thus Fact 11.1 guaranR y 3 ) } illustrated in Figure 11.2(c), is also resynchronicontributes to the zation of G . In Figure 11.2(c), it is easily verified that elimination of exactly one synchronization edge the edge and from
Fact 11.1, we have that y4)} of G 11.2(d), a also resynchroni~ation
illustrated in Figure
Figure 11 An i~l~stration of Definition 11 Here each processor executes a single actor. A resynchronization of the synchronization graphin (a) is illustratedin (b). In this resynchronization, the resynchronization edges (V, and W) both contribute to the elimination of (V,W ) .
Chapter 11
discussed in Section 10.2, resync~onizationcannot decrease the estimated throughput since it mani~ulatesonly the feedforward edges of a synchronization graph. Frequently in real-time DSP systems, latency an portan ant issue, and although resynchronization does not degrade the esti~ated t~oughput, it generally doesincrease the latency. This section defines the for self-timed mult~~rocessor systems.
Figure
Suppose an application graph, G is graph that results from multiprocessorschedule for source (an actor that has no input edges or has nonzero in G , and y is an actor in G other than We define th y ) ~ n d ( yl, ~e refer to the with this measure of latency, and we refer to y as the la
synchronization is anexecution
Intuitively, the latency is the time required for the first invocation of the latency input to influence the associated latency output, and thus the latency corresponds to the critical path in the dataflow implementation to the first output invocation that is influenced by the input. This inte~retationof the latency the critical path is widely used in VLSI signal processing [Kun88,~ a d 9 5 1 . In general, the latency can be computed by performing simple simulaAP execution for G through the 1 th execution of y Such simulation can be performed functional sirnulation of an HSDFG G,;," that has the same topology (vertices and edges) G , and that maintains the simulation time of each processor in the values of data tokens. Each initial token (delay) in initialized to have the value 0, since these tokens are present at time 0. Then, data-driven simulation of G,, is carried out. In this simulation, an actor may execute whenever it has sufficient data, and the value of the output token produced by the invocation of any actor in the sirnulation given by
where is the set of token values consumed during the actor execution. In such a simulation, the i th token value produced by an actor gives the completion time of the i th invocation of in the ASAP execution of G . Thus, the latency can be determined the value of the 1 th output tokenproduced by y ith careful implementation of the functional simulator lV[,S})) time, described above, the latency can be determined in where 1 and S denotes the number of sync~onizationedges in G . The simulation approach described above is similar to approaches described in [TTL95] For a broad class of synchronization graphs, latency can be analyzed even more efficiently during resynchronization. This is the class of synchronization graphs in whichthe first invocation of the latency output is influenced by the first invocation of the latency input. Equivalently, it is the class of graphs that contain at least one delayless path in the corresponding application graph directed from 1. Recall from Chapter that k ) and k ) denote the time at which invocation k of actor commences and completes execution. note that l) 0 since is an execution source.
Chapter
the latency input to the latency output. For transparent synchronization graphs, we can directly apply well-known longest-path based techniques for computing latency. Suppose that an application graph, source actor in an actor in that is not identical to If pc,(x, y ) 0 , then we t with respect to latency input and latency output y n graph that corresponds to multiprocessor schedule for G,, we also say that G is
If synchronization graph is transp~entwith respect to latency input/ output pair, thenthe latency can be computedefficiently using longest pathcalculations on an acyclic graph that is derived from the input synchroni~ationgraph G . This acyclic graph, which we call the G ) , is constructed by removing all edges from G that have nonzero-delay; adding a vertex V , which represents the beginning of execution; setting 0 and adding delayless edges from V to each source actor (other than V of the partial construction until the only source actor that remains is V Figure 11.3 illustrates the derivation of fi(G) Given two vertices and y in G) such that there is a path in from to y we denote the sum of the execution times along a path from that has maximum cumulative execution timeby y ) That is,
C) to y
Figure 11.3. An example usedto illustrate the construction of $(G) The graphon the b o ~ o mis $(G) if G is the top graph.
LATENCY-CONSTRAINED ~ E S ~ N C H ~ O ~ I ~ A ~ O ~
t ( z ) ( p is pathfrom
mm(
to y in$(G))
p traverses
If there is no path from to y then we define y ) to be + m since G) is acyclic. The values y ) for for all y Tj(c,(x, all pairs y can be computed in 0 ( n 3 ) time, where is the number of actors in G , by using simple adaptation of the Floyd-Warshal~algorithm described in Section 3.13.3. Suppose that is an HSDFG latency input and latency output y G, results from multiprocessor schedule for Then pG(x,y ) 0 , and thus y) 0
that is ans spa rent with respect to is the synchronizationgraph that and G is resyn~~onization G,. (i.e., y)
Since GO is transparent, there is delayless path in from to y Let U,) where and y U, denote the sequence of actors traversed by p . From the semantics of the HSDFG it follows that for 1 i n either and execute on the same processor, with ui scheduled earlier than or there is a zero-delay synchronization edge in G, directed from ui to u j +l Thus, for 1 i n we have pc,(aI., 0 , and thus, that y) 0 . Since G is resynchronization of G,, it follows from Lemma 10. that y) The following theorem gives an efficient means for computing the latency for transparent synchronization graphs. with
Suppose that G is sync~onizationgraph that is transparent respect to latency input and latency output y Then Y) 7;i(G,(”, Y) By induction, we show that for every actor
W
in
G)
which clearly implies the desired result. First, let denote the maximum number of actors that are traversed by path in Ji( G) (over all paths in G) that starts at and terminates at W . If mt( W ) 1 then clearly W Since both the LHS and RHS of 1-3) are identically equal to t ( v ) 0 when W V we have that (l 1-3) holds whenever 1. Now suppose that (1 1-3) holds whenever S k for some k 1 and consider the scenario k 1 Clearly, in the self-timed execution of G , invocation w1 the first invocation of W , commences soon all invocations in the set
Chapter
P,)> have comp~etedexecution? where denotes the first invocation of actor and P, is the set of predecessorsof in f i ( G) Allmembers P, satisfy m$( k since otherwise mt( would exceed ( k 1 Thus, from the induction hypothesis?we have sturt(w, 1)
l)l(z
E
P,))
which implies that
the ut, by definition of T8(G), 7 thus we have that e f i ~ ( W 1)
of (1 1-4) is clearly equal to
and
have shown that (1 1-3) holds for m t ( ~ ) 1 and that whenever it s f o r m ~ ( w ) = k ~ l st hold for ~ t ( ~( k ) 1 Thus, ( l 1-3) holds for values of mt( In the context of resync~onization,the main benefit of transparent synchronization graphs that the change in latency induced by adding a new synchronization edge bbresynchronizationo~eration”)can be computed in 1) time, given 6 ) for all actor pairs b ) We will discuss this further in Section 1.5. Sincemany practical application graphscontain delayless pathsfrom input to output and these graphs admit a p ~ i c u l a r l yefficient means for cornputing latency, the first i~plementationoflatency-constrainedresynchronization was targeted to the class,of transparent sync~onizationgraphs [BSL96a]. However, the overall resync~onizationframework described in this chapter does not depend on any particular method for computing latency, and thus, it can be fully applied to general graphs (with moderate increase in complexity) using the A~~~ simulationapproachmentionedabove. This frameworkcan also be applied to subclasses of synchronization graphs other than transp~entgraphs for which efficient techniques for computing latency are discovered. An instance ofthe consists of synchronization graph G with latency input and latency output y and c~~struifit y ) A solution to suchan instance is aresynchronization R such that S and no has resynchronization of G that results in latency less than or equal to smaller cardinality than R . Given synchronization graph G with latency input and latency output y and a latency constraint we say that resynchronization R of G is
LATENCY- CONSTRAIN^^ R~S~NCHRONIZATIO~
y) Thus, the latency-constrained resync~onizationproblem is the problem of d e t e ~ i n i n g minimal LCR, Generally, resynchronization can be viewed complementary to the Conoptimization defined in Chapter 10: resynchronization is performed first, followed by Under severe latency constraints, it may not be possible to accept the solution computed by in which case the feedforward edges that emerge from the resynchroni~edsolution must be implemented with FFS. In such situation, can be attempted onthe original (before resynchronization) graph to see ifit achieves better result than resync~onization without However, for transparent synchroni~ationgraphs that have only one source SCC only one sink SCC, the latency is not affected by and thus, for suchsystems,resynchronizationand are fully comple~entary.This is fortunate since such systems arise frequently in practice. Trade-offs between latency and throughput have been studied by Potkonjac and Srivastava in the context of transformations for dedicated implementation of linear computations [PS94]. Because this work is basedonsynchronous j~plementations,it does not addressthe synchronization issues and opportunities that we encounter in the self-timed dataflow context.
This section shows that the laten~y-constrained resynchroni~ation problem isNP-hardeven for the very restricted subclass of synchronization graphs in which each SCC corresponds to a single actor, and sync~onizationedges have zero delay. with the ~aximum-throughputresynchronization problem, disc~ssed in Chapter 10, the intractability of this special case of latency-constrained resynchronization can be established by a reduction from set-covering, illustrate this reduction, we suppose that we are given the set {x,, x,} and the family of subsets T t , , t2, where x g } t, and Figure l 1.4 illustrates the instance of latency-constrained resynchronization that we derive from the instance of set-covering specified by ere, each actor corresponds to single processor and the self loop edge for each actor is not shown. The numbers beside the actors specify the actor execution times, and the latency constraint is L,,, 103 In the graph of Figure 1.4, which we denote by G , the edges labeled correspond respectively to themembers of the set in the set"cove~ng instance, and the vertex pairs (resync~onization candidates) st,), st,), s t , ) correspond to the members of T . For each relation
xiE t i , an edge exists that is directed from to The latency input and latency output are defined to be in and out respectively, and it is assumed that C is transparent. The synchronization graph that results from an optimal resynchronization of C shown in Figure 11S , with redundant resynchronization edges removed. Since the resynchronization candidates were chosen to obtain the solution shown in 11S , this solution corresponds to the solution of that consists of the subfamily t,, t,} A correspondencebetween
the set-covering instance
and the
is
instance of latency-constrained resynchronization defined by Figure 11.4 arises from two properties of the const~ctiondescribed above: in the set-covering instance in R
stj)
subsumes
in G)
If R is an optimal LCR of G , then each resynchronization edge of the form sti), i E
1, 2, 3 ) or oftheform
( s t j , sx,), xi
tj
Figure 11.S. The synchronization graph that results from sol instance of l~tency-constraine~ resynchronization shown in Figu
1-5)
l1
The first observation is immediately apparent from inspection of Figure 11.4. A proof of the second observation follows. of ~ ~ s e ~ ~ t We i o must n showthat no other resynchronization edges can becontainedinanoptimalLCR of G . Figure 11.6 specifies argumentswith which we can discard all ~ o s s i b i l ~ t ~ other e s than those given in 1-5). In the matrix shown in Figure 11.6(a), each entry specifies an index into the list of arguments given in Figurel 1.6(b). For each of thesix categories of arguments, except for #6, the reasoning is either obvious or easily understood from inspection of Figure 11.4. A proof of argument follows shortly within this same section. Forexample, edge cannotbe resynchronizationedge in because the edge already exists in the original synchronization graph; an edge of W ) cannot be in because there is path in G from W to each the form W ) since otherwise there wouldbe pathfrom in to out that traverses W, st,, and thus, the latency would be increased to at least R from Lemma 10.2 since pG(in, 0 and 4 since 204 ( i n , otherwise there would be delayless self loop. Three of the entries in Figure 11.6 then sti) point to multiple argument categories. For example, if x j E introduces cycle, andif then s t i ) cannotbecontainedin because it would increase the latency beyond L,,,
(I
The entries in Figure l 1.6 marked OK are simply those that correspond to and thus we havejustified Observation 6.
In the proof of Observation 6, we deferred the proof of ~ g u m e n#6 t for Figure 1l .6. proof of this ar~umentfollows.
A r g u ~ e n t in Figure By contraposition, we show that ( W , cannot contribute to the elimination of any sync~onizationedge of G , and thus upp pose that ct 1 l . 1, it follows from theopti~alityof R that ( W , (W, contributes to the elimination of some synchronizationedge S . Then W)
pc(z, s n k ( s ) )
0
(1 1-6)
where (1 1-7) From the matrix in Figure 11.6, we see that no resynchronization edge can have z the source vertex. Thus, snk(s) E o u t } . ow,if snk( S) then s and thus from (1 1-6), there is zero delay path from to W in G . owever, the existence of such path in implies the existence of path from zn to out that traverses actors W, s x , which inturnimplies that Lc(in, out) 104, and thus that R is not a valid L
~ s s ~that ~ xij Pnt i ~otherwise applies.
Exists in G 2. Introduces cycle.
3. Increases the latency beyond 4. pG(al,
0 (Lemma 10.2).
5. Introduces delayless self loop. 6. Proof
B.
given below.
Chapter
On the other hand, if then E {z, ow, from 1 1-6), S) implies the existence of zero delay path from W in which implies the existence of path from to out that traverses which in turn implies that L,,, 204. On the other hand, if for some i then since from Figure 1 1.6, there are no resynchronization edges that have an the source, it follows from(1 1-6) that there must to W The existence of such path, however, be zero delay path in 6 from implies the existence of cycle in C since out) 0 . S) implies that R is not an LCR. The following observation states that resynchronization edge of the form contributes to the elimination of exactly one sync~ronizat~on edge, w ~ i c h the edge (stj, $xj)
is anoptimalLCR of andsuppose that a resynchronization edge in R , for some i E 1 , 2 , 3 , 4 ) , j E 1,2,3} such that P t i . Then e contributes to the elimination of one and only onesync~onizationedge e
an optimal LCR, we know that e must contribute to the elimination of at least one synchronization edge (from Fact 11 Let be some synchronization edge such that e contributes to the elimination of S . Then
Now from Figure 11.6, it is apparent that there are no resynchronization edges in that have or out as their source actor. Thus, from (11-8), snk (S) or Now,if snk(s) o u t , then for some k i , or S) However, since noresynchronizationedgehas memberof sx4} its source, we must (from 11-8) rule out i m i l ~ l yif, src( S) then from (11-8) there exists a zero delay path inR(G) from to which in turn implies that LR(G)(in, o u t ) 140. But this is not possible since the assumption that R is LCR anguarantees that ~ R ~ G ~ ( i 103 ~ , Thus,weconclude that snk(s) o u t , and thus, that snk(s) Now implies that or (b) s for some k such that x, E t, (recall that xi P and thus, that k j If s thenfrom (1 1-8), pR(G)(stk, 0 . It follows that for any member E t j there is zero delay path in that traverses and Thus, s $ x i ) does not hold since otherwise in, 140. Thus, we are left with only possibility ow, suppose that we are given an optimal LCR
of G . From Observa-
tion 7 and Fact 11.1, we have that for each resynchronization edge in R we can replace this resynchronization edge with and obtain another optimal LCR. Thus from Observation 6, we can efficiently obtain an optimal LC R’ such that all resynchronization edges in R’ are of the form (v, For each xi E
such that (l 1-9)
have that g R’,This is because is assumed to be optimal, and thus, R,G) containsnoredundantsynchronization edges. Foreach xi E for which (1 1-9) does not hold, we can replace with any (v, that satisfies E t j and since such replacement does not affect the latency, we know that the result will be another optimal LCR for G In this manner, we repeatedly replace each that does not satisfy (1 1-9) then we obtain an optimal LC such that
eachresynchronizationedge in R” and for each
E
of the form (v,
there exists resynchronization edge that xi E t i .
1-10) ti) in
(11-11)
It
easily verified that the set of synchronization edges eliminated by E Thus, the set t j l ( t j ) is resynchronization edge in R”} is cover for and the cost (number of sync~onizationedges) of the resynchronization is 1x1 where is the number of synchronization edges in the original sync~onizationgraph, Now, it is also easily verified (from Figure the resynchronization defined by 11.4) that given an arbitrary cover T ,for
(1 1-12) is alsoa valid LCRof G , and that the associated cost 1x1 Thus, it follows from the optimality of that T’ must be minimal cover for given the family of subsets T. To summarize, we have shown how fromthe particular instance (X, T ) of set-covering, we can construct synchronization graph G such that from solution to the latency-constrained resync~onizationproblem instance defined by T ) This example of the reduction we can efficiently derive a solution to from set-covering to latency-constrained resync~onization easily generalized to an arbitrary set-covering instance The generalized const~ctionof the initial sync~onizationgraph G specified by the steps listed in Figure 11.7. The main task in establish~ng general correspondence between latencyconstrained resynchronization and set-covering is generalizing Observation 6 to
Chapter 11
LCR LCR
is,
tasks
tors V, Z, in, wi and instantiate all sub~ra~ associat h in§tantiate an actorlabe~edst that has ex@cution tim Instantiat@an actorlabeled ln§tanti~te the edge E ntiatetheedge do(sx,
as @xecutio~ time 60.
ntiate the edge r eachX E t in§tantiate theed
ure 11.7. procedure for construct in^ an instance r e s y n ~ h r o n i ~ afrom t ~ o ~an instance of et-cov~rin~ yields a solution to
of ~at~ncy-constrained that a solution to
constraints on the task (actor) execut~ontimes. Two-processor optimality results in multiprocessor scheduling havealso been reported in the context of stochastic model for parallel computation in which tasks have random execution times and communication patterns [Nic89].
n times t ( x i ) } such that each xi the i th actor scheduled on the processor that corresponds to the source SCC of the sync~onizationgraph; set of sink actors y, with associated execution times t ( y i ) } such that each is the i th actor scheduled on the processor that corresponds to the sink of the synchronization graph; set of S,} such that for each si non-redundant synchroni~ationedges S { S ] , S,, E and snk( S;) E y,} and latency constraint L,,, which is positive integer. A solution to such an instance is miniy,) S In the remainder of mal res~nchronizationR that satisfies this section, we denote the sync~onizationgraph corresponding to the generic .7
We assume that 0 for all and we refer to the subproblem that results from this restriction This section demonstrate§ an algorithm that solves the delayless n time, where N is number of vertices in An extension of this algorithm to the general problem ( ~ b i t r delays ~ y can bepresent) is alsogiven.
An efficient polynomial-time solution to delayless 2LCR by reducing the problem to special case of set-covering called in which we are given an o r d e ~ n gw l , wN of the m e ~ b e r sof (the set that must be covered), such that the collection of subsets T consists entirely of subsets of the form { W , , W , + Wb}, S a S b S N . Thus, while general set-covering involves covering set from collection of subsets, interval covering amounts to covering an interval from collection of subintervals. Interval covering can be solved in that first selects the subset w1, W , ,
max((bl(w,,
E E)
then selects any subset of the form
1x1]TI)time by for some t E Wbz}
~ ~ ( { b ~ ( w b , + l t,)wforsome bE t~
then selects any subset
b3 and
on until b,
the form
Wb3}
Wb E
N.
simple procedure
where S bl
l where
T}); S b,
E)for some t E T})
where
Chapter l1
R to interval covering, we start with the following observations. Suppose that R is resync~onizationof 6 , r E R , and r contributes to the elimination of synchronization edge s Then r subsumes s . Thus, the set of sync~onizationedges that r contributes to the elimination of is simply the set of synchronization edges that are subsumed by r This follows immediately from the restriction that there can be no resynchronization edges directed from y j to an xi (feedforw~d resync~onization), (R, 6) there can be at most one synchronization edge in any path directed from S) to snk( S ) QED. is resyn~hronizationof $x
Y,)
max( { t p r e d ( src(s’))
t ( x j ) for i j l i
l , 2, i 1,
5,then
t,,,,,( snk(s’)) E R } ) where p , and tSrtcc(yi) jri
Proof- Given a synchronization edge (x,, y b ) E R , there is exactly one delayless from to y , that contains (xu,y b ) and the set of vertices traversed by this path is { x , , x2, x,, yb, y & + y,} The desired result follows immediately. QED. Now, co~espondingto each of the source processor actors xi that satisfies t p r r d ( x i ) t(y,) I we define an ordered pair of actors “resynchronization candidate”) by
Consider the exampleshown 1 for each actor and L,,,
in Figure11.8.Here,weassume 10. From (1 1-13), we have
that
then do(vi) can be viewed the best resynIf vi exists for a given chronization edge that has xi the source actor, and thus, to construct an optimal LCR,we can select the set of resync~onizationedges entirely from among the vi This is established by the following two observations. Suppose that is an LCR of and suppose that (x,, y b ) a R such that (x,, y b ) v,. Then delayless synchronization edge in (R (x,, { d n ( v , ) } ) is an LCR of R .
LATENCY-CONSTRAINED RESYNCHRONI~ATION
Pro@ Let that exists, since
and
yb)
and observe
From Observation 8 and the assumption that y b ) delayless, the set of synchronization edges that y h ) contributes to the elimination of is simply the set synchronization edges that are subsumed by y b ) Now, if is synchronization edge that subsumed by ( x N 7y b ) then
CT CT
CT
Figure 11.8. An instance of delayless, two-processor latency-constrained resynchroni~ation,In this example, the execution times of actors are identically equal to unity.
From the definition of v,, we have that c I b and thus, that pG(yc,y b ) follows from (1 l 16) that
0 . It
and thus, that v, subsumes S . ence, v, subsumes all synchronizationedges that (x,, y b ) contributes to the elimination of, and we can conclude that R’ is valid resyn~hronizationof From the de~nition v,, we know that is an LCR, we have from Observation9 that
ts,,,,(yC)I an LCR.
and
From Fact .2 and the assumption that the members of S are all delayless, anoptimalLCRof consists onlyof delayless sync~onizationedges. Thus from Observation 10, we know that there exists an optimal that consists only of members of the form d o ( ~ , )F u r t h e ~ o r efrom , Observation 9, we know that collection V of vi is an LCR if and only if U X ( V )
{st,
52,
“ - 9
S,}
7
V E
where x ( v ) is the set of synchronization edges that are subsumed by v The following observation completes the co~espondencebetween 2LCR and interval covering. Let
s2’,
S,’
be the ordering of (i
and thus from(1 1
we have
s2,
S,
specified by (1
~ESYN~H~ONIZATIO~
such
of ~ ~ s e ~ ~ t iLet o n (xj, and suppose k a positive integer i
x(
(1 1-21) Now clearly (1 1-22) S*’), snk( S,’)) and thus (from l 1-18) S*’ subsumes since otherwise s i ’ , which contradicts the assumption that the members of S are not redundant. we know that snk( S,’)) Combining this Finally, since si’ E x;( with (11-22) yields
and (1 1-21) and (l 1-23) together yield that sk’ E
QED.
From Observation 11 and the preceding discussion, we conclude that an optimal LCR of 6 can be obtained by the following steps. (a) Construct the ordering S,’ specified by l 18). (b) For i l , 2, p determine whether or not exists, and if it exists, compute ompute for each value of j such that exists. (d) Find a minimal cover C for S given the family of subsets exists} (e) Define the resynchronization R vjlx( E
x(
Steps (a), (b), and (e) can clearly be performed in time, where is the number of vertices in 6 If the algorithm outlined in Section 11.4.1 is employed for step (d), then from the discussion in Section 11.4.1 and Observation 12(e) in Section 11.4.3, it can be easily verified that the time complexity of step (d) is Step (c) can also beperformed in time using the observation that if (xi, then E
i
where S (si, S,,} is the set of sync~onizationedges in G . Thus, we have the following result. ~olynomial-timesolutions (quadratic in the number of synchronization graph vertices) exist forthe delayless, two-processor latency-constrained resynchronization problem. Note that solutions more efficient thanthe above may exist.
approach descri~ed
Chapter
From (1 1-20), we see that there are two possible solutions that can result if we apply Steps (a)-(e) to Figure 11.8(a) and use the technique described earlier for interval covering. These solutions correspond to the interval covers The synchronization graph and that results from the interval cover is shown in Figure 11.8(b).
x(
If delays exist on one or more edges of the original synchronization graph, then the correspondence defined in the previous subsection between 2LCR and interval covering does notnecessarily hold. For example, considerthe synchronization graph in Figure 11.9. Were, the numbers beside the actors specify execution times; ‘‘W on top of an edge specifies unit delay; the latency input and latency output are respectively and ys and the latency constraint is L,,,, 12 It is easily verified that exists for i l , 2, 6 and from (1 13), we obtain
Now if we order the sync~onizationedges as specified by S,”
and
( x i , y i + 4 )for i
yim4)for i
18), then
1,2,3,4 5,6,7, 8
11-25)
and if the correspondence between delayless 2LCR and interval covering defined in the previoussection were to hold for general then we would have that eachsubset
isoftheform
owever, computing the subsets
{S~’,S,+~’,
sb’},1
(11-26)
we obtain
and these subsets are clearly not all consistent with the form specified in (1 1-26). Thus, the algorithm developed in Section 1 .4.2 does not apply directly to handle delays. owever, the technique developed in Section 11.4.2 can be extended to general problem in polynomial time. This extension is based on s e p ~ a t i n gthe subsumption relationships between the and the synchronization edges into two categories: if subsumes the synchronization edge S y,) thenwesay that es S if i andwesay that S S if i For example in Figure 1 1-subsumes
LATENCY-CONSTRAINED RESYNCHRONI~A~~ON
both
and
7
and
subsumes
and
(x57
Assuming the same notation for generic instance of that was defined in the previous subsection, the initial sync~onizationgraph satisfies the following conditions: Each synchronization edge has at most one unit of delay
11.
E
, ,L
1
Chapter
(b) If ( x i , y j ) is zero-delay synchronization edge and y,) is unitdelay synchronization edge, then k and (c) If vi l-subsumes unit-delay synchroni~ationedge ( x i , y j ) then vi also l-subsumes all unit-delay synchronization edges S that satisfy n 0. (d) If 2-subsumes unit-delay synchronization edge ( x i , y j ) then v i also 2-subsumes all unit-delay synchronization edges S that satisfy n (e) If ( x i , y j ) and y,) are both distinct zero-delay synchronization edges or they are both distinct unit-delay synchronization edges, then f k and k) If y j ) l-subsumes unit delay synchroni~ation edge y,) then ~ r Q o ~ u ~ t l iFrom n e : Fact we know thatg(x,, y y ) Thus, there exists at least one delayless synchronization edge in Let e be one such delayless synchronization edge. Then it is easily verified from the structure of 6 that for all y j there exists path pi, in G directed from to y j such that contains e, contains no other synchronization edges, and ~ e l ~ y ( p j , It follows that any synchronization edge e’ whose delay exceeds unity would be redundant in Thus, part follows from the assumption that none of the synchronization edges in 6 are redundant. The other parts can be verified easily from the structure of 6 , including is redundant. We omit the the assumption thatno synchronization edge in details. Resynchronizations for instances of general 2LCR can be partitioned into two categories cate~ory consists of all resynchronizations that contain at least one synchronization edge having nonzero delay, and category all resynchronizations that consist entirely of delayless sync~onizationedges. An optimal category solution category A solution whose cost is less than or equal to the cost of all category A solutions) can be derived by simply applying the optimal solution described in Subsection 11.4.2 to “rearrange” the delayless resynchronization edges, and then replacing all synchronization edges that have nonzero delay with single unit delay synchronization edge directed from the last actor scheduled on the source processor to y 1 the first actor scheduled on the sink processor. We refer to this approach Al~orithm An example is shown in Figure 11.10. When general 2LCR is applied to the instance of Figure 1 l.lO(a),the constraint that all synchronization edges have zero delay too restrictive to permit globally optimal solution. Here, the latency constraint is assumed to be 2 Under this constraint, it is easily seen that no zero-delay resynchronization edges can be added without violating
LATENCY-CONSTRAINED RESYNCHRONI~ATION
the latency constraint. However, if we allow resynchronization edges that have delay, then we can apply Algorithm A to achieve a cost of two synchronization edges. The resulting synchronizationgraph,withredundant synchronizatio~
Figure 11.l 0. An example in which constrainingall resynchronization edges to be delayless makes it impossibleto derive an optimal resynchronization.
Chapter l
edges removed, is shown in Figure ll.lO(b). Observe that this resynchronization is an LCR since only delayless synchronization edges affect the latencyofa transparent synchronization graph. Now suppose that 6 (our generic instance of 2LCR) contains at least one unit-delay synchronization edge, suppose that G, is an optimal category B solution for 6 , and let R, denote the set of resync~onization edges in C,. Let denote the set of synchronization edges in that have unit delay, and let y,,), y,J, ylM)denote the ordering of the members of that corresponds to the order in which the source actors execute on the source processor that is, ( i ( k i k j ) NotefromObservation 12(a) that is the set of all sync~onizationedges in 6 that are not delayless. G,) denotethe set of unit-delay sync~onizationedgesin that let are l-subsumed by resynchronization edges in 6,. That is, G,)
E
s.t
E
If
5,G,)
l-subsumes inC)}
is not empty, define
G, C,)
(xkj, ylj) E
Y
((zl,
( l 1-28)
pose (x,, y,.) E C,), Then by definition of Y m' l , , and 0 . Furthermore, since x, and xI execute on the same prothus pE(yl,, y,.) cessor, pc(x,, 1 Hence PC;(X,, PC;(Ylr,Y,.) 1 ~elay(x,9 Y,.) so we have that ylr) subsumes (x,, y,.) in G Since x,, y,.) member of G ) , we conclude that Everymemberof
6)
Now, if T'
and suppose pi;(x,, PG(Y,'y,4
6, G,) 6, G,))
is subsumed by
y,,)
is an arbitrary (1 1-29)
not empty, thendefine
(x,, y,.) E T' By definition of U , m k , and thus 0 . Furthermore, since y,. and yc, execute on the same processor, 1 Hence,
P&n,
P,(Y,,
Every member of
I-
Ym9
1
delay(x,, Yd)
and we havethat
Observe
y,)
subsumed by
that from the d~finitionsof
Y
and
(11-31)
U , and from Observation
LATENCY-CONSTRAINE~RESYNCHRONIZATION
G,)
0)
(1 1-33)
(U
and
Now we define the synchroni~ationgraph Z( 6) by ( V , (E-
+P),
(1 1-35)
where V and are the sets of vertices and edges in 6 P {do(xt7 dO(xkN, y,)} ifboth G,) and l" are non-empty; P if l? empty;and P do(xkl,,y c I ) } if G,) is empty. G, is resynchronization of Z( 6)
Proofi The set of synchronization edges in 6) of delayless synchronization edges in 6 . Since it suffices to show that for each e E P , e))
Eo P , where
is the set resynchronization of 6 , (1 1-36)
0.
G,) is non-empty then from (1 1-28) (the definition of r and If Observation 12(f), there must be delayless synchronization edge e' in G, such e') for some W 5 Thus, that 0 and we have that (1 1-36) satisfied for e
0
ylr)
Similarly if l? is non-empty, then from ( l 1-30) (the definition of U and from the definition of there exists delayless synchronization edge e' in G, such that e') for some W k , Thus, Y,) hence, we have that (1 1-36)
fG,(s"k(e'), satisfied for e
PG,(Xk,r
0
0
From the definition of P it follows that (11-36) is satisfied for every The latency of
not greater than
That is,
Y,) FromTheorem11.3,weknow
that G, preserves Z( 6) Thus,from
Chapter
Lemma 10.l , it follows that y,) 5 LGb(xl,y,) Furthermore, from the assumption that Gb is an optimal category B LCR, we have LGb(xl,y,) 5 conclude that y,) Theorem 1 1.3,along with (1 1-32)-( 1 1-34), tells us that an optimal category B LCR of is always a resynchronization of (1) a sync~onizationgraph of the form
or
or
Thus, from Corollary 1 1.1, anoptimal resynchronization can be computed by examining each of the 1) 1) sync~onization graphs defined by (1)-(3), computing an optimal LCR for each of these graphs whose latency is no greater than L,,, and returning one of the optimal LCRs that has the fewest number of sync~onizationedges. This is straightfor~ardsince these graphs contain only delayless synchronization edges, and thus the algorithm of Section 11.4.2 can be used. Recall the example of Figure 1 &a). 1 Here, Y4) and the set of synchronization graphs that correspond to (1)-(3)are shown in Figures 1 1.1 1 and 1 1.12. The latencies of the graphs in Figure 1 1.1 (a)-(c) 1 and Fig12, we only ure 11.12(a-b) are respectively 14, 13, 12, 13, and 14. Since L,,, need to compute an optimal for the graph of Figure 1 1l(c) (from Corollary 1 1.1).This isdone by first removing redundant edges from the graph (yielding the graph in Figure 1l .13(b)) and then applying the algorithm developed in For the syn~hronization graph of Figure 11,13(b), and ~ection 11 L,,, 12 it is easily verified that the set of is {(x57
Y3),
7
Y4)t v3 v6
If we let
Y d t v4 y8)
(x47 Ys)
.l
then we have
From (1 1-38), the algorithm outlined in Section 11.4.1 for interval covering can be applied to obtain an optimal resynchronization. This results in the resynchronization R v l , v3, v6}. The resulting synchronizationgraph is shown in Figure 11.13(c). Observe that the number of synchronization edges has beenreducedfrom8 to 3 while the latency hasincreasedfrom 10 to L,,, 12 Also,none of the original synchronizationedgesin are retained in the resynchronization. We say that for general2LCR is the approachof constructing the G)/ 1 synchronization graphs corresponding to (1)-(3), computing an optimal LCR for each of these graphs whose latency no greater than L,,, and returning one of the optimal LCRs that has the fewest number of syn-
Figure 1l
ont ti nu at ion of the illustrationof Algorithm B in Figure11.l
LATE~CY-CQNSTRAINEDRESYNCHRQNIZATIQN
Figure
Optimal
example.
chronization edges. We have shown previously that Algorithm B leads to an optimal LCR Thus, given an instance of general 2LCR, globally optimal solution can be derived by applying Algorithm A and Algorithm B and retaining the best of the resulting two solutions. The time complexity of this two-phased approach is factor dominated by the complexity of Algorith~B, which is of greater than the complexity of the technique for delayless 2LCR that wasdevelopedinSection11.4.2),where is thenumber of verticesin Since ud( from Obs~rvation12(e), the complexity is Po1ynomia~-timesolutions exist for the general two-processor latency-constrained resync~onizationproblem. ”he example in Figure 11.10 shows how itis possible for Algorithm A to produce better result than Algorithm B. Conversely, the ability of Algorithm B to o u t p e r f o ~Algorithm A can be demonstrated through the example of Figure 11.9. From Figure 11*13(c),we know that the result computed by A l ~ o ~ t hBm has cost of 3 synchronization edges. The result computed by Algorithm A can be derived by applying interval covering to the subsets specified in (11-27) with all of the unit-delay edges (ss’, s7’, s 8 ’ ) removed:
x( x(
v ~ ) }and thecorreA minimal coverfor (1 1-39) is achieved by {x;( sponding synchronization graph computed by Algorithm A is shown in Figure 11.14. This solution has cost of 4 synchronization edges, which is one greater than that of the result computed by Algorithm B for this example.
In Section 10.5, we discussed heuristic called Global-resynchronize for the maximum-t~oughputresynchronization problem, which is the problem of d e t e ~ i n i n gan optimal resynchronization under the assumption that arbitrary increases in latency can be tolerated. In this section, we extend Algorithm Global-resynchronize to derive an efficient heuristic that addresses the latency-constrained resynchronization problem for general sync~onizationgraphs. Given an input sync~onizationgraph C , Algorithm ~lobal-resynchronizeoperates by first computing the family of subsets
After computing the family of
subsets specified by (1 1-40), Algorithm
LATENCY-CONSTRAINE~RESYNCHRONIZATION
~lobal-resynchronizechooses a member of this family that has ~ a x i m u mcardinality, inserts the corresponding delayless resynchronization edge, and removes all sync~onizationedges thatbecome redundant as a result of inserting this resynchronization edge. extend this technique for maximum-throu~hput resynchronizatio~to the latency-constrained resync~onizationproblem, we simply replace the subset computation in (1 1-40) with
Figure 11.14. The solution derivedby Algorithm A when it is applied to the example of Figure 11.9.
where is the latency of the synchronization graph ( V , { E results from adding the resynchronization edge vl, v2) to
(v1,v;?)}}) that
A pseudocode specification of the extension of Global-resynchronize to the latency-constrainedresynchronizationproblem, called Algorithm is shown in Figure 11.15.
In Section l 1.2, we mentioned that transp~ent sync~onization graphs are advantageous for performing latency-constrained resynchronization. If the input synchronization graph is transparent, then assuming that y ) hasbeen determined for all y V , L’in Algorithm Global-LCR can be computed in 1) time from
where is the source actor in latency of
the latency output, and L, is the
Furthermore, y ) can be updated in the same manner pc. That is, once the resynchronization edge is chosen,wehave that for each Y ) (V U 7
y),
(1 1-43)
Y)}),
where denotes the maximum cumulative execution time between actors in the first iteration graph after the insertion of the edge in The computations in 1-43) canbeper by inserting the simple loopshown in Figure11.16 at the endof th lock in AlgorithmGlobal-LCR.Thus, as with the computation of pc, -time Bellman-Ford algorithm need only be invokedonce, at the beginning of the LCRAlgorithm, to ialize y) This loop canbe inserted immediatelybefore or after the loop that updates PG
In Section 10.5, it was shown that Algorithm Global-resynchronize has time-complexity, where y1 is the number of actors in the input synchronization graph,and is the numberoffeedforwardsynchronizationedges. Since the longest path quantities can be computed initially in time and updated inQ( time, it is easily verified that the Q( sffn4) bound also applies to the customization of Algorithm Global-LCR to transparent synchroni-
LATENCY-CONSTRAINE~RESYNCHRONIZATION
zation graphs. In general, whenever the nestedloops in Figure l 1 ~ o ~ i n a the te computation of the complexity ~aintained long as y ) S L,,,) can be evaluated in 1) time. For general (not necessarily
Global-LCR educed synchronization graphG (V,E) an alternative reduced synchronization graph that preserves G compute
Y ) for ail actor pairs X, Y E V FALSE
E: (V,E)
r
Figure
X,YE
update pG *I
.l 5. A heuristic for ~atency-constrained r~synchro~ization.
transparent) synchronization graphs, we can usethe functional simulation approach described in Section 11.2 to determine y ) in O(d n, S } ) ) time, where d 1 y ) and S denotes the number of s y n c ~ o n i ~ a t i o nedges in G . This yields running time of n, S } ) ) for general synchronization graphs. The complexity bounds derived above are based on general upper bound which is derived inSection 10.5,on the total number of resynchronization loop iterations). However, this bound can be viewed very estimate since in practice, constraints on the introduction of cycles severely limit the number of possible resynchronization steps. Thus, on practical graphs, wecan expect signi~cantlylower average-case complexity than the n ~~ ~) ~ n ~ ~ ~ ( { n worst-case bounds of ~ ( ~and~ ~ ~( ~
of
Figure 11.17 shows the synchronization graph that results from six-processor schedule of synthesizer for plucked-string musical instruments in 11 voices based on the Karplus-Strong technique, shown in Section 10.5. In this example, and out are respectively the latency input and latency output, and the latency is 170. "hereare ten sync~onizationedges shown, and none of these is redundant. Figure 1 1,18 shows how the number of synchronization edges in the result computed by the heuristic changes as the latency constraint varies. If just over 50 units of latency can be tolerated beyond the original latency of 170, then the heuristic is able to eliminate single synchronization edge. No further improvement can be obtained unless roughly another 50 units are allowed, at which point the number of synchronization edges drops to 8 and then down to 7 for an additional 8 time units of allowable latency. If the latency constraint weakened to
Figure 11.16. Pseudocode to update for use in the custornization of Algorithm Global-LCR to transparent synchronization graphs.
,
Figure 11.17. The synchronization graph that results from a six-processor schedule of a music synthesizer based on the Karplus-Strong technique.
Chapter l
x Figure 11 8. Performance of the heuristicon the example of Figure 11.l
LATENCY-CONSTRAINEDRESYNCHRONI~ATION
382, just over twice the original latency, then the heuristic able to reduce the number of synchronization edgesto 6 No further improvement is achieved over the relatively long range of (383 644) When 645 the minimal cost of 5 synchronization edges for this system is attained, which is half that the original s y n c ~ o n i ~ a t i ograph. n Figure 11 and Table 1.1 show how the average iteration period (the reciprocal of the average t~oughput)varies with different memory access times for various resync~onizationsof Figure I 1.17. Here, the column of Table 1 and the plot of Figure 11.19 labeled represent the original synchronization graph (before resynchronization); colurnn/plot label B represents the resynchronized result corresponding to the first break-point of Figure 8 (L,,, 221 sync~onizationedges); label C corresponds to the second break-point of Fig268 8 synchronization edges); and so on for labels ure 1X. 18 (L,,,
700
l
I
l
Figure 11.l 9. Average iteration period (reciprocal of average throughput) vs. memory access time for various latency~constrained resynchronizations of the music synthesis examplein Figure 11.l
Chapter
and whose associated synchronization graphs have and synchronization edges, respectively. Thus, we go from label to label the number of sync~onizationedges in resynchronized solution decreases monotonically. However, seen in Figure the average iteration period need not exactly follow this trend. For example, even though synchronization graph has one synchronization edge more than graph B , the iteration period curve for graph B lies slightly above that of This because the simulations shown in the figure model shared bus, and take bus contention into account. Thus, even though graph B has one less synchronization edge than graph it entails higher bus contention, and hence results in higher average iteration period. A similar Table .l. Performance results for the resynchronization of Figure 11.17. The firstcolumngivesthememoryaccesstime; standsfor“averageiteration period” (the reciprocal of the average throughput); and stands for “memory accesses per graph iteration.”
F
E3
c
LATENCY-CONSTR~INE~ R~SYNCHRONIZATION
anomaly is seen between graph C and graph D , where graph D has one less synchronization edge thangraph C , but still has higheraverage iteration period. However, we observe such anomalies only within highly localized neighborhoods in which the number of synchronization edges differs by only one. Overall, in global sense, the figure shows clear trend of decreasing iteration period with loosening of the latency constraint, and reduction of the number of synchronization edges. It is difficult to model bus contention analytically, and for precise performance data we must resort to detailed simulation of the shared bus system. Such simulation can be used means of verifying that the resynchronization optimization does notresult in performance degradation due to higher bus contention. ~xperimentalobservations suggest that this needs to be done only for cases where the number of synchronization edges removedby resynchronization is small compared to the total number of synchronization edges (i.e., when the resynchronized solution is within localized neighborhood of the original synchronization graph). Figure 11.20 shows that the average number of shared memory accesses
A
c- -D
F
Figure 11.20. Average number of shared memory accesses per iteration for various latency-constrained resynchronizations of the music synthesis example.
l
pergraph iteration decreases consistently withloosening of the latency constraint. mentioned in Chapter such reduction in shared memory accesses is relevant when power consumption an important issue, since accesses to shared memory often require significant amounts of energy. Figure 11.21 illustrates how the placement of sync~onizationedges changes the heuristic able to attain lower synchronization costs. Note that synchronization graphs computed by the heuristic are not necessarily identical over any of the ranges in Figure l 1.18 in which the number of synchronization edges constant. In fact, they can be significantly different. This is because even when there are no resynchronization candidates available that can reduce the net synchronization cost (that is, no resynchronization candithe heuristic attempts to insert resynchronization dates for which edges for the purpose of increasing the connectivity; this increases the chance that subsequentresynchronizationcandidateswillbegenerated for which l discussed in Chapter 10. For example, Figure 11.23 showsthe synchronization graph computed when is justbelow the amount needed to permit the minimal solution, which requires onlyfivesynchronization edges (solution F ) . Comparison with the graph shown in Figure 11.21(d) shows that even though these solutions have the same sync~onizationcost, the heuristic had much more room to pursue further resynchronization opportunities with L,,, 644 and thus, the graph of Figure l 1 2 3 is more similar to the minimal solution than it is to the solution of Figure 11.2 1(d),
([x(
Ix(
Earlier, we mentioned that the O(sjfn4)and O ( ~ SIZ,S~} ) ) com~ ~ plexity expressions are conservative since they are based on an bound on the number of iterations of the while loop in Figure 11.15, while in practice, the actual number of while loop iterations can be expected to be much less than n 2 . This claim is supported by the music synthesis example, shown in the graph of Figure l 1.22. Here, the X -axis corresponds to the latency constraint L,, and the Y-coordinates givethe number of while loop iterations that were executed by the heuristic. We see that between 5 and 13 iterations were required for each execution of the algo~thm,which not only much less than 484, it even less than n This suggests that perhaps significantly tighter bound on the number of whileloop iterations can be derived.
This chapter has discussed the problem of latency-constrained resynchronization for self-timed implementation of iterative dataflow specifications. Givenanupperbound L,,, on the allowable latency, the objective of latency-constrainedresynchronization to insert extraneoussynchronization operations in such way that the number oforiginal sync~onizationsthat con-
~
t
221
268
Figure 11.21. Synchronization graphs computed by the heuristic for different values of
Chapter
13.00 12.50 12.00 11.50
11.00 10.50
10.00
9.50 9.00 8.50
8.00
I
7.50 7.00 6.50
5.50 5.00 200.00
300.~
400.00
500.00
X ~ . 0 0
Figure 11.22. Number of resynchronization iterations versus of Figure 11.17.
700.00
for the example
LATENC~-~ONSTRAINED RESYNCHRONIZATION
sequently become redundant significantly exceeds the number of new synchronizations, and b) the serialization imposed by the new synchronizations does not increase the latency beyond To ensure that the serialization imposed by resynchronization does not degrade the throughput, the new synchronizations are restricted to lie outside of all cycles in the final sync~onizationgraph. In this chapter, it has been shown that optimal latency-constrained resynchronization is NP-hardeven fora very restricted class of synchronization graphs. Furthermore, an efficient, polynomial-time algorithm has been demonstrated that computes optimal latency-constrained resyn~~onizations for twoprocessor systems; and the heuristic presented in the Section 10.5 for maximumthroughput resync~onization hasbeen extended to address the problem of latency-constrained resynchronization for general n-processor systems. Through an example of a music synthesis system, we have illustrated the ability of this extended heuristic to systematically trade-off between synchronization overhead and latency. The techniques developed in this chapter and Chapter can be used as a post-processing step to improve the performance of any of the large number of static multiprocessor scheduling techniques for dataflow specifications, such as
Figure 11.23. The synchroni~ationgraph computed by the heuristic for 644
described inEBPFC94, CS95, GGD94, Hoa92, LAAG94, PM91 Pri91,
r The previous three chapters have developed several software-based techniques for minimizing synchronization overhead for self-timed multiprocessor implementation. After all of these optimizations are completed on given application graph, we have final synchronization graph G, U that preserves G,. Since the synchronization edges in G, are the ones that are finally implemented, it is advantageous to calculate the self-timed buffer bound B, final step after all the transfo~ationson G, are completed, instead of using G,, itself to calculate these bounds. This is because addition of the edges in the and steps may reduce these buffer bounds. It is easily verified that removal of edges cannot change the buffer bounds in (9-1) long the synchronizations in G, are preserved. Thus, in the interest of obtaining minimum possible shared buffer sizes, we computethe bounds usingthe optimized synchronization graph.The following theorem tells us how to compute the self-timed buffer bounds from G, If G, preserves G, and the synchronizationedges in G, are implemented, then for each feedback communication edge e in G,, the selftimed buffer bound of e an upper bound on the number of data is given by: tokens that can be present on e
By Lemma 7.1, if there is path start(
k)
snk(
from s n k ( e ) to k
in G,, then
Chapter 12
Taking we get
to be an arbitrary minimum-delay path from
k)
to src(
in
(l 2-3)
That is,
cannot be more that iterations “ahead” of Thus there can never be more that tokens more than the initial number of tokens on Since the initial number of tokens on was the size of the buEer corresponding to is bounded above by m(
Thequantities can be computed using Dijkstra’s algorithm (Section 3.13.1) to solve the all-pairs shortest path problem on the synchronization graph in time ]VI3)
present unified viewof multiprocessor implementation issues in concrete manner. that can cont~buteto the development of future ~ultiprocessor implementation tools, we introduce flexible framework for combining arbitrary multiprocessor scheduling algorithms for iterative dataflow graphs, including the diverse set that we discussed in Chapter 5, with algorithms for opti~izing and sync~onizationcosts of a given schedule, such those covered in Chapters 9-1
A pseudocode outline of this framework is depicted in Figure 12.1, In Step l, an arbitrary multiprocessor scheduling algorithm is applied to construct parallel schedule for the input dataflow graph. From the resulting parallel schedule, the graph and the initial synchronization graph models are derived in Steps and 3. ’Then, inSteps 4-8, series of transformations is attempted on the synchronization graph. First, A l g o ~ t h ~ detects and removes all of the s y n c ~ o n i ~ a t i oedges n in G, whose associated synchroni~ationfunctions are guaranteed by other synchronization edges in the graph, described in Section 9.7. Step 5 then applies resynchronization to the “reduced” graph that emerges from Step 4, and inco~oratesany applicable latency constraints. Step6 inserts new sync~onizationedges to convert the synchronization graph into strongly connected graph so that the efficient BBS protocol can be used uniAlgorithm discussed in Section 9.9 formly, Step 7 applies the to determine an efficient placement of delays on the new edges. Finally, Step 8
INTEGRAmD S~NCHRONI~ATION O~I~IZATION
removes any synchronization edges that have become redundant as a result of the conversion to a strongly connected graph. After Step 8 is complete, we have a set of IPC buffers (co~espondingto the IPC edges of G,) and a set of synchronization points (the synchronization edges of the transformed version of G, The main task thatremains before mapping the given parallel schedule into an implementation is the determination of
Irn~lernentlMult~rocessorSchedule iterative dataflow graph specificationG of a DSP application. n optimized synchronization graphG,, an IPC graph and le IPC in IPc buffer sizes 1. Apply a multiprocessor scheduling algorithm to construct a parallel schedule for onto the given target multiprocessor architecture. The parallel schedule specifies the assignment of individual tasks to processors, and the orderin which tasks execute on each processor. 2. Extract 3. Initialize G,
from G and the parallel schedule constructedin Step 1. G,~
G#
9.Calculate the buffer size
for each IPC edge e in a) Compute the total delay on a minimumdelay path in C;, directed from src(e) to b) Set
the buffer size the amount of memory that must be allocated for each IPC edge. From Theorem 12.1, we can compute these buffer sizes from G, and G, by the procedure outlined inStep 9 of Figure 12.1. we have discussed in Chapter 5, optimal derivation of parallel schedules is intractable, and widevariety of useful heuristic approacheshave emerged, with no widely accepted “best choice” among them. In contrast, the technique that we discussed in Section 9.7 for removing redundant synchronizations (Steps 4 and 8) is both optimal andof low computational complexity. However, discussedinChapters 10 and 1l , optimal resy~chronization intractable, and although some efficient resynchronization heuristics have been developed, the resync~onizationproblem is very complex, and experimentation with alternative algorithms may be desirable. Similarly, the problems associated with Steps 6 and 7 are also significantly complex to perform in an optimal manner, although no result on the intractability has been derived so far. Thus, at present, with the parallel scheduling problem, tool developers are not likely to agree on anysingle “best” algorithm for each of the implementation sub-problems s u ~ o ~ n d i nSteps g 5,6, and 7. For example, some tool designers maywish to experimentwithvariousevolutionaryalgorithms or other iterative/probabilistic searchtechniquesonone or more of the sub-problems [Dre98]. The multiprocessor implementation framework defined in Figure 12.1 addresses the inherent complexity and diversity of the sub-problems associated with multiprocessor implementation of dataflow graphs by implementing naturaldecompositionof the self-timed sync~onizationproblem into series of well-defined sub-problems, and providing systematic method for combining arbitrary algorithms that address the sub-problems in isolation.
This section has integrated the software-based synchronization techniques developed in the Chapters 9-1 into single framework for the automated derivation of self-timed multiprocessor implementations. The input to this framework is an H S D F ~representation of an application. The output processor assignmentandexecutionorderingof application sub-tasks; anIPCgraph (V, which represents buffers communication edges; strongly G, connected synchronization graph G, ( V , Ei,, U E,) which represents synchronization constraints; and set of shared-memory buffersizes e) e is an IPC edgein GiF}
2-4)
A code generator can accept G, and G, from the output of the ~ i n i ~ i z e Sync~Costframework, allocate buffer in shared memory for each comn~unicaof size e ) , and generate synchronization code tion edge specified byG,
for the synchronization edges represented in G,. These synchronizations may be implementedusing the sync~roniz~tiun protocol. The resulting sync~onizationcost is 2n,T,where is the number of synchronization edges in the sync~onizationgraph G, that is obtained after optimizations are completed.
This Page Intentionally Left Blank
This book has explored techniques that minimize inter-processor communicationand sync~onizationcosts in statically scheduled multiprocessors for DSP. The main underlying theme that communication and sync~onizationin statically scheduled hardware is fairly predictable, and this predictability can be exploited to achieve our aims of low overhead parallel implementation at low hardware cost. The first technique described was the ordered transactions strategy,wherethe idea is to predictthe order of processoraccesses to shared resources and enforce this order at run time. An application of this idea to a sharedbus multiprocessor was described, wherethe sequence of accesses to shared memory is pre-dete~inedat compile time and enforced at run time by a controller implemented in hardware. A prototype of this architecture, called the ordered memory access architecture, demonstrates how low overhead IPC can be achieved at low hardwarecost for the class of DSP applications that can bespecified as SDF graphs, provided good compile time estimates of execution times exist. We also introduced the IPC graph model for modeling self-timed schedules. model was used to show that we candetermine a particular transaction order such that enforcing this order at run time does not sacrifice performance when actual execution times of tasks are close to their compile time estimates. When actual running times differ from the compile time estimates, the computation performed is still correct, but the performance (throu~hput)may be affected. We described how to quantify such effects of run time variations in execution times on the throughput of a given schedule. The ordered transactions approach also extends to graphs that include constructs with data-dependent firing behavior. We discussed how conditional con0 structs and data-dependent iteration constructs canbemappedtothe architecture, when the numberof such control constructs is small a reasonable assumption for most DSP algorithms. Finally, we described techniques for minimizing sync~onizationcosts in
Chapter
self-timed implementation that can be achievedby systematically manipulating the sync~onizationpoints in given schedule; the IPC graph construct was used for this purpose. The techniques described include determining whencertain synchronization points are redundant, transforming the IPC graph into strongly connected graph, and thensizing buffers appropriately such that checks for buffer overflow by the sender can be eliminated. We also outlined technique we call resynchronization, which introduces new synchronization points in the schedule with the objective of minimizing the overall synchronization cost. The work presented in this book leads to several open problems anddirections for further research. apping general BDF graph onto the OMA architecture to make the best use of our ability to switch between bus access schedules at run time is topic that requires further study. Techniques for multiprocessor scheduling of BDF graphscouldbuildupon the quasi-static schedulingapproach,which restricts itself to certain types of dynamic constructs that need to be identified (for example conditional constructs ordata-dependent iterations) before scheduling can proceed. Assumptions regarding statistics of the Boolean tokens (e.g., the propo~ionof TRUE valuesthat control token assumes duringthe execution of the schedule) would be requiredfor determining multiprocessor schedules for BDF raphs. A architecture applies the ordered transacti~nsstrategy to a shared multiprocessor. If the interprocessor communicationbandwidth requirements for an application are higher than what single shared bus can support, more elaborate interconnect, such a crossbar or mesh topology, may be required. If the processors in such a system run a self-timed schedule, the communication pattern is again periodic and we can predict this pattern at compile time. We can then determine the states that the crossbar in such system cycles through or we can determine the sequence of settings for the switches in the mesh topology. The fact that this i n f o ~ a t i o ncan be determined compile time should makeit possible to simplify the hardware associated withthese interconnect mechanisms, since the associated switches need not be configuredat run time. Exactly how this compile time information can be made use of for simplifying the hardware in such interconnectsis an interesting problem for further study. In the techniques we proposed in Chapters 9 through 11 for minimizing sync~onizationcosts, no assumptions regarding bounds on execution times of actors in the graph were made. direction for further work is to incorporate timing guarantees for example, hard upper and lower execution time bounds, Dietz, Zaafrani, and O’Keefe use in [DZ092]; and handling of mix of actors, some of which have guaranteed execution time bounds, and others that have no such guarantees, Filo, Ku,Coelho Jr., and De Micheli do in [FKJM93]. Such
guarantees could be used to detect situations in which data will always be available before it is needed for consumption by another processor. Also, execution time guarantees can be used to compute tighter buffer size bounds. As simple example, consider Figure 13.1. Here,the analysis of Section 9.4 yields buffer size B,( B)) 3 since 3 is the minimum path delay of B) However, if and $(B) the execution times of cycle that contains actors and B ,are guaranteed to be equal to the same constant, then it is easily verified that buffer size of l will suffice for (A,B) Systematically applying execution time guarantees to derive lower buffer size bounds appears to be promising direction for further work. Several useful directions for further work emerge fromthe concept of selftimed resynchronization described in Chapters 10 and 11. These include investigating whether efficient techniques can be developed that consider resynchronization opportunities within strongly connectedcomponents, rather than just across feedforward edges. There may be considerable roomfor improvement over the resynchronization heuristics that we have discussed, which are straightforward adaptations of an existing set-covering algorithm. In particular, it would be useful to explore ways to best integrate the heuristics for general synchronization graphs with the optimal chaining methodfor restricted class of graphs, and it may be interesting to search for properties of practical synchronization graphs that could be exploited in addition to the correspondence with set covering. The extension of Sarkar’s concept of counting semaphores [Sar89] to self-timed, iterative execution, and the incorporation of extended counting semaphores within the framework of self-timed resynchronization, are interesting directions for further work.
Figure 13.1. An example of how execution time guarantees canbe used to reduce buffer size bounds.
Chapter 13
Another interesting problem applying the synchronization minimization techniques to graphs that contain dynamic constructs. Suppose we schedule graph that contains dynamic constructs using quasi-static approach, or more general approach if one becomes available. Is it still possible to employ the synchronization optimization techniques we discussed in Chapters 9-1 l ? The first step to take would beto obtain an IPC graph equivalent for the quasi-static schedule that has representation for the control constructs that processor may execute part of the quasi-static schedule. If we can show that the conditions we established for synchronization operation to be redundant (in Section 9.7) holds for all execution paths in the quasi-static schedule, then we could identify redundant synchronization points in the schedule. It may be possible to extend the strongly-connect and resynchronization transformations to handle graphs containing conditional constructs; these issues require further investigation, Also, the quasi-static schedulingapproaches that havebeenproposed (e.g., Ha’s techniques do not take the communication overhead of broadcasting control tokens to all the processors in the system into account. Multiprocessorschedulingapproaches that do take this overhead into account are an interesting problem for future study.
ic, and
[ABU91] ers. In 1991. [AB%92]
Burnett, and B. A. Zimmerman. Operational versus de~nitional:A perspective on programming paradigms. ZEEE 25(9), September 1992. Chandy, and J. R. Dickson. A comparison of arallel processing systems. of 17(12):685-690, December 1974.
[ACD74]
[Ack82]
Ungerer. Evolution of dataflow computPrentice Hall,
W. B. Ackerman. Data flow languages. ZEEE 15(2), February 1982. A.
Aho, J. E. Hopcroft, and D.Ullman. Addison-Wesley9 1987.
F0
[AK87]
Allen and D. Kennedy. Automatic transformations of "RAN programs to vector form. 9(4), October 198'7.
[AN881
A. Aiken and A. Nicolau. Optimal of 1988.
[AN901
Arvind and R.S. Nikhil. Executing program onthe token dataflow architecture. ZEEE 39(3), March 1990.
[ASIt-981
A. Abnous, K. Seno, Y. Ichikawa, M. an, and J. Rabaey. Evaluation of a low-power reconfigurable DSP architecture. In
loop parallelization. In
1998, [At-871
Annaratone et al. The Warp computer: Architecture, implementation, and performance. 12), December 1987.
[M981
S. Aiello et al. Extending monoprocessor real-time system in a
~ ~ ~ L ~ O ~ R A P ~ Y
DSP-based multiprocessing environment. In 1998. [BB911
A. Benveniste and G. Berry. The synchronous approach to reactive and real-time systems. 79(9): 12701282, September 1991.
[BCQQ92]
F. Baccelli, G. Cohen, G. J.Qlsder, and Quadrat. John Wiley Sons, Inc., 1992.
[BDWSS]
J. Beetem, M. Denneau, and Computer.In June 1985.
[BELP94]
C. Bilsen, M.Engels, R. Lauwereins,and J. A. Peperstraete. Static scheduling ofmulti-rate and cyclo-static DSP-applications. In 1994.
[BHCF95]
S. Banerjee, T. Hamada, P. M. Chau, and R. D. Fellman. Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems. 43(6):1468-1484, June 1995.
D. Weingarten. The GF11
super-
J. Buck, S. E. A. Lee, and D. G. Messerschmitt. Ptolemy: A framework for simulating and prototyping heterogeneous systems. January 1994.
R. K. Brayton, G. D. Hachtel, C. McMul~en,and A. L. Sangiovanni-~incentelli. Kluwer Academic Publishers, 1984. [BL89]
J. Bier and E. A. Lee. Frigg: A simulation environment for multiprocessor DSP system development. In Inpages 280-283, October 1989.
B. Barrera and E. A. Lee. CO’s SPW. In
ultirate signal processing in ComdisApril 1991.
S. S. Bhattacharyya and Edward A. Lee. Scheduling synchronous dataflow graphs for efficient looping. 1993. [BL94]
S. S. Bhattacharyyaand E. A.Lee.Memorymanagement for dataflow programming of multirate signal processing algorithms.
BIBLIOGRAP~Y
Transactions on Signal Processing,42(5): 190-1 201, May 1994. [Bla87]
J. Blazewicz. Selected topics in scheduling theory. In Surveys in Combinatorial ~ptimization.North Holland Mathematica Studies, 1987.
BML961
S. S. Bhattacharyya, K. Murthy, and A. Lee. Software Synthesis from D a t a ~ o wGraphs. ISluwerAcademic Publishers, 1996.
[Bok881
S. H. Bokhari. Partitioning problems in parallel, pipelined, and Transactions on C o ~ p u t e r s , distributed computing. 37(1):48--57,January 1988.
[Bor88]
G. Borriello. Combining events and data-flow graphs in behavioral synthesis. In Proceedings of the International Conference on on Computer-Aided Design,pages 56-59, 1988.
[BPFC94]
S. Banerjee, D. Picker, D. Fellman, and P. M. Chau. Improved scheduling of signal flow graphs onto multiprocessor systems through an accurate network modeling technique. Proceedings In of the International Workshopon Signal Processing, 1994.
[Bry861
R. E. Bryant. Graph based algorithms for boolean function manipulation. Transactions on Computers, 35(8):677--691, August 1986.
[BSL9O]
J. Bier, S. Sriram, andE. A. Lee. Aclass of multiprocessor architectures for real-time DSP. In Proceedings of the Inte~ational Signal Processing,November 1990. Workshop on
[BSL96a]
S. S. Bhattacharyya, S. Sriram,and E. A.Lee.Latency-constrained resynchronization for multiprocessor DSP implementation. In Proceedings the Inte~ational Conference on Application Specific Systems, Architectures, and Processors, August 1996. Chicago, Illinois.
[BSL96b]
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Self-timed resynchronization: A post-optimization for static multiprocessor schedules. In Proceedings of the Inte~ationalParallel Processing S y ~ p o s i u mApril , 1996. Honolulu, Hawaii.
[BSL97]
S. S. Bhattacharyya, S. Sriram, and E. A. Lee. Optimizing synchronization in multiprocessor DSP systems. Transactions on Signal Processing,45(6), June 1997.
orkar et al. iVVarp: An integrated solution to high-speed parallel computing. In 1988. T. Buck. Electrical Engineering California at Berkeley, September 1993. Cohen, D. Dubois, and uadrat. A linear system theoretic view of discrete eventprocesses and its usefor performance evaluation in manufacturing. March 1985. R. Cunningham-c re en. [Cha84]
inimax algebra. In Springer-~erlag,1979.
Chase. A pipelined data flow architecture for digital signal processing: The NEC pPD7281. In V~SI ~ o v e m b e r1984. P. Chretienne. Timed event graphs: A complete study of their cutions. In ~hretienne.Task ~chedulingover distributed memory chines. In 1989. and R. L.Rivest.
to
Chen and J. Rabaey. A reconfigurable multiprocessor IC for rapid prototyping of algorithm-speci~chigh-speed DSP data paths. 27( 12), December 1992.
L. F. Chao and E. Sha. ~nfoldingand retiming data-flow DSP ISC multiprocessor scheduling. In April 1992. Silva. Structural techniques and p e ~ o ~ a n c e bounds of stochastic Petri net models. In Springer-~erlag,1993.
L. Chao and E. H. Static scheduling for synthesisof DSP alVLSI gorithms on various models.
pages 207-2~3, 1995. [De
upta. Fastermaximumand mean cycle algorit~msfor on tober 1998.
mini~um
root, S. Gerez,and guided iterative data-flow graph sc~eduling. on pages 351-364, May 1992. U, A.
Izatt, and 6.
[Dij59]
Academic Publishers, 1998. avoli et al. Parallel computing in networks of workstations with Paralex. on April 1996. rooksKole, 1991. tectures.
T. O’Keefe. Static scheduling of
T~eoretical i~provements in algorithmic efficiency for network flowalgorithms. pages 248-264, April 1972. agan. The Pentium(R) processor with 1997. Lewis. Scheduling parallel pro
BIBLIOGRAPHY
onto arbitrary target machines. Journal of Parallel and ~ i s t r i ~ u t ed Computing, pages 138-1 53, 1990. [FKAJM93]
D. Filo, D. C. C. N. Coelho Jr., and G. De Micheli.Interface optimization for concurrentsystemsundertiming constraints. IEEE ~ransactionson Very Large Scale Integration (VLSI) Systems, l September 1993. D. Filo, D. C. Ku, and G. De Micheli. Optimizing the controlunitthrough the resynchronization of operations. INTEGRATION, the VLSI Journal, pages 231-258, 1992.
[Fly661
M. J.Flynn. Very high-speed computing systems. Proceedings of the IEEE, December 1966.
[F+97]
R. Fromm et The energy efficiency of IRAM architectures. In ~nternationalSymposium on Computer Architecture,June 1997.
[GB911
J. Gaudiot and L. Bic, editors. Advanced Topics in ~ a t a ~ l o w Computing. Prentice Hall, 199 1.
Ger951
S. Van Gerven. Multiple beam broadband beamforming: Filter design andreal-time i~plementation.In Proceedings of the IEEE ASSP ~ o r k s h o pon Applications Signal Processing to Audio and Acoustics, 1995.
[GGA92]
K. Guttag, R. J. Grove, and J. R. Van Aken. Asingle-chip multiprocessor for multimedia: the MVP, IEEE Computer Graphics and Applications, 12(6), November 1992.
[GGD94]
R. Govindarajan, G. R. Gao, and P. Desai. Minimizing memory requirements in rate-optimal schedules, In Proceedings of the International Conference on Application Specific Array ProcesAugust 1994.
G5791
[GMN96]
Gra691
M. R. Garey and D, Johnson. Computers and I ~ t r a c t a ~ i l i ~ : Guide to the Theory of ~P-Completeness. H, Freeman and Company, 1979. B, Gunther, G. Milne, and L. Narasimhan. Assessing document relevance with run-time reconfigurable machines. Proceedings In of the IEEE Symposium on FPGAs Custom Computing Machines, pages 10-17, April 1996. R. L. Graham.Boundsonmultiprocessingtiminganomalies. S I A Journal ~ of Applied ~ a t h17(2):416"429, , March 1969.
7
BIBLIOG~APHY
[Gri88]
C. M. Grinstead. Cycle lengths in October 1988.
[GS92]
F. GasperoniandUweSchweigelshohn.Schedulingloops on parallel processors: A simple algorithm with close to optimum performance. In September 1992.
[G+91]
Gokhale et al. Building and using highly programmablelogic array. 24( 1):81--89, January 1991.
[G+92]
A. Gunzinger et al. Architecture and realization of multi signal processor system. In pages 327-340, 1992.
[GVNG94]
D. J. Gajski,
[GW92]
B. Greer and J. Webb. Real-time supercomputing 1992.
[G11921
A. Gerasoulis andT. Yang. A comparison of clustering heuristics for scheduling directed graphs on multiprocessors. 16276-291, 1992.
[Ha921
on
Vahid, S. Narayan, and J. Gong. Prentice Hall, 1994. on iWarp. In
Ha. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, April 1992.
[Ha1931
N. Halbwachs. Kluwer Academic Publishers, 1993.
[Hav91]
B. R. Haverkort. Approximate performabilityanalysis using generalized stochastic Petri nets. In pages 176185, 1991.
[HCA89]
J. J. Hwang, Y. C. Chow, and D. Anger. Scheduling precedence graphsinsystemswith inter-processor communication times. 18(2):244-257, April 1989.
[HCRP91]
N. Halbwachs, P. Caspi, P. Raymond, and D. chronous data flow programming language September 1991.
Pilaud. The syn-
of
for
Sympages 105-
lgenstock, and P. Pirsch. Design of development system for multimedia app~icationsbased on In of pages 1151-1 154,1996. and E. A.. Lee. Compile-time sched structs in dataflow program graphs. 46(7), July 1997.
g of dynamic con-
mini mu^ cost to time ratio cycles with small integral transit times. Septe~ber 1993.
~ ~ u k o t u Considerations n. in the design of ssor"on-a-chip micro~chitecture. ~echnical er eport C S L - T ~ - 9 8 - ~ tanf ~ 9 ,ford ~niversity~ o m ~ u tSystems Lab, February 1998.
ingand
Computer Sciences, ~niversity
of
Array
A.ugust 1992.
n et al. ~ynthesisof synchronous communication in multi~rocessorarchitecture. of V ~ S I 6 : 2 ~ 9 - ~ 9 91993. , allel se~uencingand assembly line problems.
B l B ~ l O ~ ~ A ~ H ~
April 1997. [Joh74]
S. Johnson. Approximation algorithms forcombinatoria~ problems. pages 278,1974.
Jr.761
E.
Coffman, Jr. iley Sons, Inc., 1976.
961
U,Kwok andI. Ahmad. Dynamic critical path scheduling: An effective technique for allocating task graphs to multiprocessors,
Kar781
haracterization of the minimum cycle 23, 1978. e. A general approach to mappi multiprocessor architectures. In pages 1-43, 1988.
[Kim881
im, hesis, Depa~mentof Computer Science, ~niversityof Texas at Austin, 1988. Kalavade and E. Lee. h ~ d w ~ e / s o f t w codesign ~e methology for DSP applications. 10(3):16-28, September 1993.
KLL87
Lewis, and S. C. Lo. Pe~ormanceanalysis and optimi~ationof VLSI dataflow arrays. pages 592-6 18, 1987. iller. Properties of model cy, t e ~ i n a t i o n queueing, , November 1966. ative sc~edulingunder timing level synthesis of digital cir(6):69~ 718, June 1992. A preliminary evaluation of critical pat tasks on ~ultiprocessorsystems. ~E~~ pages 1235-1238, December 1975.
Koh901
oh. A
BIBLIOGRAPHY
havioral Simulation.Ph.D. thesis, Department ofElectrical EngineeringandComputerSciences,UniversityofCalifornia at Berkeley, June 1990. [Km871
B. h a t r a c h u e . Static Task Scheduling and Grain Packing in Parallel Processing Systems. Ph.D. thesis, Department of Computer Science, Oregon State University, 1987,
[KS83]
K. Karplus andA. Strong, Digital synthesis of plucked-string and drum timbres. Computer Music Journal,7(2):56-69, 1983.
[Kun88]
Y. Kung. Arrays Processors. Prentice Hall, Englewood Cliffs, N.J., 1988.
[LAAG94]
G. Liao, E. R. Altman, V.K. Agarwal, and G. R. Gao. A cornparative study of DSP multiprocessor list scheduling heuristics. In Proceedings the Hawaii Inte~ationalConference on System Sciences, 1994.
[Lam861
L. Lamport. The mutual exclusion problem:Part I and11. Journal of the Association Computing machine^, 33(2):3 13-348, April 1986.
[Lam881
M. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings the ACM Conference on Programming Language Design and Zmplementation, pages 318328, June 1988.
[Lam891
M. Lam. A Systolic Array ~ptimizingCompiler. Kluwer Academic Publishers, 1989.
[Lap9 11
P. D. Lapsley. Host interface and debugging of dataflow DSP systems. Master’s thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, December 1991.
[Law761
E. L. Lawler. Combinatorial Optimization: N e ~ o r k sand Matroids. Wolt, Rinehart and Winston, 1976.
[LE3901
E. A. Lee and J. C. Bier. Architectures for statically scheduled dataflow. Journal Parallel and DistributedComputing, 10:333-348, December 1990.
[LBSL94]
P. Lapsley, J. Bier, A. Shoham, and A. Lee. DSP Processor ~undamentals.Berkeley Design Technology, Inc., 1994.
[LDK98]
S. Y. Liao, S. Devadas, and K. Keutzer. Code density optirnization for embedded DSP processors using data compressiontech-
niques. IEEE Transactions Computer-Aided on Design, 17(7):601-608, July 1998. [LDK+95]
S. Liao, S. Devadas, K. Keutzer, S. Tjiang, and A. Wang. Code optimization techniques for embedded DSP microprocessors. In Proceedings the Design Automation Conference,June 1995.
[LEAP941
R.Lauwereins, Engels, Ade,and J. A. Peperstraete. Grape-ii: Graphical rapid prototyping environment for digital signal processing systems. InProceedings the International Conference onSignalProcessingApplications and Technology, 1994.
[Lee861
E. A. Lee. A Coupled Hardware and S o f ~ a r eArchitecture for Programmable DSPs.Ph.D. thesis, Department of Electrical EngineeringandComputerSciences,University of California at Berkeley, May 1986.
[Lee88a]
E. A.Lee.Programmable DSP architectures ASSP Magazine, 5(4), October 1988.
[Lee88b]
E. A. Lee. Recurrences, iteration, and conditionals in statically scheduled block diagram languages. InProceedings ofthe International ~ o r k s h o pon VLSI Signal Processing, 1988.
[Lee9 1]
E. A. Lee. Consistencyin dataflow graphs. IEEE Transactions on Parallel and Distri~utedSystems, 2(2), April 1991.
[Lee931
E. A. Lee. Representing and exploiting data parallelism using Proceedings of the Intermultidimensional dataflow diagrams. In national Conference on Acoustics, Speech, and Signal Processing?pages 453-456, April 1993.
[Lee961
R. B. Lee. Subword parallelism with MAX2. August 1996.
[Lei921
F.T.Leighton. Introduction to Parallel Algorithmsand Architec-
Part I.
Micro, 16(4),
tures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers Inc., 1992. [LH89]
E. A. Lee and S. Ha. Scheduling strategies for multiprocessor real time DSP. In Global Telecommunications Conference, vember 1989.
[LLG+92]
D.Lenoski, J. Laudon, K. Gharachorloo, W. D.Weber,and J. Hennessey. The Stanford DASH multiprocessor. IEEE Computer ~ a g a z i n eMarch , 1992.
E. A.Lee and D. C. Messerschmitt. tatic scheduling of synchronousdataflowprograms for digital signal processing. IEEE Transactions on Computers,February 1987.
Li and S. Malik. Performance analysis of embedded softusing implicit path enumeration. In Procee~ings the Design Automatio~Conference, 1995. E. Lemoine and Merceron, Run-time reconfiguration of FPGA for scanning genomic databases. In P r o c e e ~ i ~ g s the IEEE Symposiu?n on FPGAs for Custom Computing ~ a c h i n e s 9 pages 90-98, April 1996.
[Lou931
J. Lou. Application development on the Intel Warp system. In ~roceedings the SPIE, 1993.
[Lov75]
L. Lovasz. On the ratio of optimal integral and fractional covers. Discrete ~ a t h e ~ a t i cpages s , 383-390, 1975.
[LP811
Lewis and H. Papadimitriou. Elements omputation. Prentice
[LP821
R. Lewis and H. Papadimitriou. Elements Co~putation.Prentice
[LP951
E. A. Lee andT. Parks. Dataflow process networks.Proceedings ofthe IEEE9pages 773-799, May 1995.
[LP981
Liu and Prasanna. ~tilizingthe power of high-performance computing.IEEE Signal Processing 100, September 1998.
[LS91]
E. Leiserson and B. Saxe. Retiming synchronous circuitry. Algorithmica, pages 5-35, 1991.
V.~adisetti.
the Theory the ~ h e o ~
~ i g i t aSignal l Processors. IEEE Press, 199
irsky and A, DeHon.ATRIX:Areconfigurablecornputevice with configurab instruction distribution and deploythe Hot Chips Symposium, able resources. In Proceedings August 1997. Messerschmitt. Breaking the recursive bottleneck. In J. Pe~ormanceLimits in Communication Theuwer Academic Publishers, 1988. an, J. J. Thornpson, and istics for scheduling DA
processors. In Proceedings the International Parallel Processing Symposium, pages 446-45 1,1994. [MM921
F. Moussavi and D. G. Messerschmitt. Statistical memory management for digital signal processing. In Proceedings ofthe International Symposium on Circuits and Systems, pages 1011-1 014, ay 1992.
[M01821
M. K. Molloy. Performance analysis using stochastic Petri nets. IEEE ~ransactions Computers, September 1982.
[Mot891
Motorola Inc. DSP9600~IEEE ~loating-PointDual-Port Pro, cessor User’s ~ a n u a l1989.
[Mot901
Motorola Inc. ~SP96OOOA~S Application Development System Reference ~ a n u a l1990. ,
G. Mouney. Parallel solution of linear ODE’S: Implementation on transputer networks. Concurrent Systems Engineering Series, 1996. N. Morgan et al. The ring array processor: A multiprocessing peripheral for connectionist applications. J o u ~ a l Parallel and ~ i s t r i ~ u t eComputing9 d 14(3):248-259, March 1992.
[Mur89]
Murata. Petri nets: Properties, analysis, and applications. ceedings the IEEE, pages 39-58, January 1989.
[Nic89]
Nicol. Optimal partit~onin~ ofrandomprograms across two processors. IEEE ~ransactions Computers, 15(2):134141, 1989 1989.
[OlS89]
J. Olsder. Performance analysis of data-driven networks. In J. McCanny, J. McWhiter, and E. Swartzlander Jr., editors, Systolic Array Processors; Contributions by Speakers at the International Conference Systolic Arrays. Prentice Hall, 1989.
[ORVK90]
G. J. Olsder, J. A. C. Resing, R. E. De Vries, and Discrete event systems with stochastic processing times. IEEE ~ransactions Automatic Control, 35(3):299-302, March 1990.
Oi-961 [Ous94] [Pap901
kotun et The case for single-chip multiprocessor. SIC;Notices, 31(9):2-11 September 1996. Ousterhout. An Introduction to ley, 1994.
and Tk. Addison-Wes-
Papadopoulos. Monsoon: A dataflow computing architec-
ture suitable for intelligent control. In Proceedings of the 5th IEEE International Symposium onIntelligent Control, 1990. A. Papoulis. Probabili~,Random Variablesand Stochastic Processes. McGraw-Hill, 1991. [PBL95]
J. L. Pino, S. S. Bhattacharyya, and E. A. Lee. A hierarchical multiprocessor scheduling system for DSP applications. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers, November 1995.
[Pet811
L. Peterson. Petri Net Theory and the odel ling of Systems. Prentice Hall, 1981.
[pH961
D. A. Patterson and L. Hennessey. Computer Architecture a ~uantitativeApproach. Morgan Kaufmann Publishers Inc., second edition, 1996,
[PHLB95]
Pino, S. E. A. Lee, and J. Buck. Software synthesis for DSP using Ptolemy. Journal VLSI Signal Processing, 9(l), January 1995.
[PLN92]
D. B. Powell, E. A. Lee, and W. C. Newman. Direct synthesis of optimized DSP assembly code from signal flow block diagrams. In Proceedings of the Inte~ationalConference on Acoustics, Speech, and Signal Processing, March 1992.
[PM911
K. K. Parhi and D. G. Nlesserschmitt. Static rate-optimal scheduling of iterative data-flowprogramsviaoptimum unfolding. IEEE Transactions on Computers, 40(2): 178-194,February 1991.
[Pm871
M. Prastein. Precedence-co~strainedscheduling with minimum time and communication. Master’s thesis, University ofIllinois at Urbana-Champaign, 1987.
[PriS)l ]
H. Wntz. Automatic ~ a p p i n g Large Signal Processing Systhesis, School of Computer tems to Parallel ~ a c h i n e Ph.D. . Science, Carnegie Mellon University, May 1991.
[Pri92]
H. Printz. Compilation of narrowband spectral detection systems for linear MIMD machines. In P~oceedings the Znternational Conference on App~icationSpecific Array Processors, August 1992.
[PS941
M. Potkonjac and B. Srivastava. Behavioral synthesis of high performance, and low power application specific processors for
linear computations. In pages45-56, 1994. [P+97]
D. Patterson et al. A case for intelligent RAM: IRAM. April 1997.
[Pto98]
Department of Electrical Engineering and Computer Sciences, University ofCalifornia at Berkeley. 1998.
[Pur971
S. Purcell. Mpact 2 media processor, balanced 2X performance. In 1997.
[PY90]
C. Papadimitriou and M. Yannakakis. Toward an architecture-independent analysis of parallel algorithms. pages 322-328, 1990.
[Rao85]
Rao. PhD. thesis, Stanford University, October 1985.
[RC6721
C. Ramamoorthy, K. M. Chandy, and M. J. Gonzalez. Optimalscheduling strategies inmultiprocessorsystems. ZEEE February 1972.
[RCHP91]
J. M. Rabaey, C. Chu, Hoang, and M.Potkonjak. Fast prototyping of datapath intensive architectures. 8(2):40”5 1 June 1991. D.Regenold. A single-chip multiprocessor DSP solution for co~municationsapplications. In pages 437-440, 1994.
[Rei681
R. Reiter. Scheduling parallel computations. October 1968.
RH801
C, Rarnamoorthy and G. S. Ho. Performance evaluation of asynchronous concurrent systems using Petri nets. SE-6(5):440-449, September 1980.
11
[RPM921
M. Renfors andY. Neuvo. The maximum samplingrate of digital filters under hardware speed constraints. March 1981.
S. Ritz, M. Pankert, and H. Meyr. High level software synthesis for signal processing systems. In
313LI~~~~P~Y
August 1992. S. Rajsbaum and M. Sidi. On the performance of synchronized programs in distributed networks with random processing times and transmissiondelays. on 5(9), September 1994. [RS98]
S. Rathnam and teractive media. 1998.
Slavenburg. Processing the new world of in15(2), March
S. Ramaswamy, S. Sapatnekar, and P. Banerjee. A framework for exploiting task and data parallelism on distributed memory muticomputers. on 8( 1l), November 1997. [Sar88]
Sar891
Sarkar. Synchronization using counting semaphores. In on 1988. Sarkar. MIT Press, 1989. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. Graph algorith~sfor clock schedule optimization. In of pages 132-1 1992.
[Sch88]
H. Schwetman. Using CSIM to model complex systems. In of pages 246" 253,1988.
[Sha89]
P. L. Shaffer. Minimization of interprocessor synchronization in multiprocessors with shared andprivate memory. In on 1989.
[Sha98]
M. El Sharkawy. Multiprocessor 3d sound system. In of
[SHL+97]
1998.
D. Shoemaker, F. Honore, P. LoPresti, Metcalf, and A unified system for scheduled com~unication.In July 1997,
[SI851
D. A. Schwartz and T. P. Barnwell 111. Cyclo-static solutions: Optimal multiprocessor realizations of recursive algorithms. In
pages 117-128, June 1985. [Sih9l]
G. C. Sih. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, April 1991.
SL901
G. C. Sih and E. A.Lee. Scheduling to account for interprocessor co~munication within interconnection-constrained processor networks. In on 1990.
[SL93a]
G. C. Sih and E. A. Lee. A compile-time scheduling heuristic for interconnection-constrained heterogeneous processor architectures. on 4(2):75-87, February 1993.
[SL93b]
G. C. Sih and E. A. Lee. Declustering: A new multiprocessor scheduling technique. on 4(6), June 1993.
[SL94]
S. Sriram and E. A. Lee. Statically scheduling com~unicationresources in multiprocessor DSP architectures. In November 1994.
Sri921
M.B. Srivastava. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, June 1992.
Sri951
S. Sriram.
Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California atBerkeley, 1995. St0771
H. S. Stone. ~ultiprocessorscheduling with the aid of network flow algorithms. on 3( 1):85--93,January 1977.
[St0911
A. Stolzle. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, December 199l.
[SW921
R. R. Shively and L. J.
Application and packaging of the
AT&T DSP3 parallel signal processor. In pages 316-326, 1992. [Tex98]Texas
Inst~ments. March 1998.
[TONL96]M.Tremblay,J.M.O'Connor,V.Narayanan,and speeds new media processing. [T+95]
H. Liang. VIS 16(4), August 1996.
A.Trihandoyo et for
Real-time speech recognition architecture multi-channel interactive voice response system. In 1995.
[TTL95]
J. Teich, L. Thiele, and E. A. Lee. Modeling and simulation of heterogeneous real-time systems based on deterministic disof crete event model. In on pages 156-1 6 1,1995,
[Vai93]P.
P, Vaidyanathan. Hall, 1993.
[VLS86]VLSI
Prentice
CAD Group,Stanford University,
1986.
[VPS90]
M. Veiga, J. Parera, and J. Santos. Programming DSP systemson multiprocessor architectures. In April 1990.
[V+96]
J. E. Vuillemin et al. Programmable active memories:Reconfigurable systems come of age. on (VLSI) 4( l), March 1996.
[WLR98]A.Y.
Wu,K.J.R. Liu, and A. Raghupathy.System architecture of an adaptive reconfigurable DSP computing engine. on for February 1998.
[W+97]
E. Waingold et al. Baring it a11 to software: Raw machines. pages 86-93, September 1997.
YC941
[YM96]
T. Yang andA. Gerasoulis. DSC: Scheduling parallel tasks on an unbounded numberof processors. on 5(9):95 1-967, September 1994.
J. S. Yu and P. C. Mueller.On-lineCartesianspace avoidance schemefor robot arms.
obstacle in
BIBLIOGRAPHY
August 1996.
LU
Yu841 W. Yu.
Ph.D. thesis, University of California at Berkeley, 1984, [YW93]
L. YaoandC.M.Woodside. Iterative decompositionandaggregation of stochastic marked graphPetri nets. In G. Rosenberg, ed1993. Springer-Verlag, 1993. itor,
[ZKM94]
V, Zivojnovic, H. Koerner,and H. Meyr.Multiprocessorscheduling with a-priori node assignment. In on 1994.
[ZRM94]V.Zivojnovic, S. Ritz, and H. Meyr.Retiming of DSPprograms for optimum vectorization. In April 1994. [ZSSS]A.Zakyand
Sadayappan.Optimal static scheduling of sequential loops on multiprocessors. In pages 130-137, 1989.
[ZVSM95]V.Zivojnovic,
J. M.Velarde,C.Schlager,and H. Meyer.DSPA DSP-oriented benchmarking methodology. In 1995.
This Page Intentionally Left Blank
ranch actors 91
chain-st~cturedgraph 44 lustering algorithms 87 mmunication edges 182 omplexity of algorithms45 computation graph omputation graphs 32 connected component 44 connected graph 44 constraint graph 242 contributes to the elimination of 246 convex function 150 critical path 58
dead-end path of a graph 43 locked synchronization graph 2 15 ustering 1
delays 34 ~ i j ~ s t r a48 ’
nant sequence clustering 89 dynamic critical path 86 dynamic level 86 dynamic level scheduling 85
earliest actor-processor mapping 85
earliest task first84 elimination of synchronization edges 246 estimated throughput 143 80,84 execution profile for a dynamic construct65
back edge 182 implementation on the OMA architecture 132 first-iteration graph 250 forward constraint 242 fully-connected interconnection network 76 functional p~allelism56 f u n ~ a ~ e n tcycle a l of a graph 43
Gantt chart 58 Graham’s bound for list scheduling anomalies 83 Graph data structures 3 1
lghest level first with estimated times84 omogeneous graph 36 omogeneous synchronous dataflow graphs
Branch 91 idle time 58 ILP in programmable architectures initial tokens 34 input edge of a vertex in a graph43 input hub 237 inst~ction14 Instruction level p~allelism14 inte~ali~ation
15
ce§§or c Q ~ ~ ~ n i c a t i o n cessor c o ~ ~ u n i c a t i o n
trong algorithm 129
re§ynchronizatiQn zation pro~lern
~ e r actor ~ e 92 84 ~ i n i r n ~ r n - ~ epath lay
ransactions Strategy 101 origination a path in a graph 43 output edge of vertex in graph 43 output hub overlapped schedule 6 1
p a i r ~ i s eresynchronization problem 2 18 p ~ t i a l l yscheduled graph 89 path delay 43
olynomial time algorithms 45 precedence constraint 68 priority list 80 processor assignment step 56 89
F filter bank 131 ready 80 ready-list scheduling 80 eco~fi~urable computing 25 educed Inst~ctionSet ~omputer16 reduced synchronization graph 191 redundant synchronization edge 19l relative §cheduling 243 repetition§ vector 35 resynchro~ization2 l5 resynchronization edge 2 I5 resynchronization problem 18 resynchronized graph 2 15
retiming 6 l
scheduling problem 76 scheduling state 86 selected input hub 237 selected output hub 237 self-timed buffer bound 184 62 self-timed scheduling set covering problem 18 Set-covering problem 46 Shortest and longest paths in graphs 47 Single Chip Multiprocessors 23 Solving difference constraints 50 Stone’s Assignment Algorithm 76 strongly connected component (SCC) of a graph 44 strongly connected graph 44 subsumption of a synchronization edge 219 Sub-word parallelism 17 Superscalar processors 14 synchronization graph 42, 188 Synchronous 7 Synchronous Dataflow 7 Synchronous languages 42
task graph 76 t e r ~ n a t i o nof a path in a graph 43 ellman-Ford algorithm 48 The Floyd- shall algorithm 49 tokens 6 topological sort 44 topology matrix 35 TPPI 91 Transaction 114 transaction controller 107
Index
transaction order 102 Transaction Order Controller 114 transitive closure 9 1 transparent 250 two-path parallelism instance 9 1 two-processor latency-constrained resynchroni~ation26 l
unbounded buffer synchronization 185 Undecidable problems 8 unfold in^ 1 Unfolding graphs 69
vertex cover 46 processors 14
well-ordered graph 44